Skip to content

Conversation

benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Jun 28, 2025

I am creating a PR for quotation denormalization even though it still needs to be rebased. Eli had thought it would be helpful to go through it and make some comments, even if there were some minor changes still coming.


This change is Reviewable

@benjaminking benjaminking force-pushed the quotation_denormalization branch from 5afb42c to defe7dc Compare June 28, 2025 20:33
@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2025

Codecov Report

❌ Patch coverage is 99.06030% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.08%. Comparing base (fb65daf) to head (f291d4d).

Files with missing lines Patch % Lines
...e/corpora/punctuation_analysis/quote_convention.py 91.01% 8 Missing ⚠️
...achine/corpora/fallback_quotation_mark_resolver.py 92.75% 5 Missing ⚠️
...on_analysis/depth_based_quotation_mark_resolver.py 97.64% 5 Missing ⚠️
...on_analysis/preliminary_quotation_mark_analyzer.py 98.57% 3 Missing ⚠️
...unctuation_analysis/quotation_mark_string_match.py 96.84% 3 Missing ⚠️
...rpora/punctuation_analysis/quote_convention_set.py 96.93% 3 Missing ⚠️
...a/punctuation_analysis/usfm_structure_extractor.py 95.00% 3 Missing ⚠️
.../quote_convention_detection_resolution_settings.py 92.85% 2 Missing ⚠️
...e_convention_changing_usfm_block_update_handler.py 99.22% 2 Missing ⚠️
.../punctuation_analysis/quote_convention_detector.py 97.29% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #203      +/-   ##
==========================================
+ Coverage   88.89%   91.08%   +2.18%     
==========================================
  Files         282      330      +48     
  Lines       17061    21614    +4553     
==========================================
+ Hits        15167    19687    +4520     
- Misses       1894     1927      +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see this, Ben! I left just a few comments regarding function names, getters/setters, and unused functions. These are representative of similar situations elsewhere in the code. I think these changes could significantly reduce the amount of boiler-plate code necessary as well as improve readability - both of which will make porting much faster. Thank you for your hard work and my apologies for only now seeing that you'd put this PR in. I usually get notified 🤔.

Reviewed 6 of 49 files at r1, all commit messages.
Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions


machine/corpora/analysis/quotation_mark_metadata.py line 47 at r1 (raw file):

        return self.text_segment

    def get_start_index(self) -> int:

Unused function


machine/corpora/analysis/quotation_mark_tabulator.py line 50 at r1 (raw file):

        self.quotation_counts_by_depth_and_direction[key].count_quotation_mark(quotation_mark)

    def _has_depth_and_direction_been_observed(self, depth: int, direction: QuotationMarkDirection) -> bool:

Would _have_depth_and_direction... be better?


machine/corpora/analysis/text_segment.py line 29 at r1 (raw file):

        if self.immediate_preceding_marker != value.immediate_preceding_marker:
            return False
        return True

Why not also check the other properties?


machine/corpora/analysis/text_segment.py line 31 at r1 (raw file):

        return True

    def get_text(self) -> str:

I don't think many of these helper methods are necessary. In cases where the field is not read-only and there's no other logic in the getters and setters, why not just use the bare variable - i.e., text_segment.previous_segment rather than set_.../get_...? If it is read-only, you can use the @property decorator (you can search for lots of examples in the machine.py codebase). That would cut down the size of this boiler-plate code a lot. It would also make the logic elsewhere more succinct and readable.


machine/corpora/analysis/text_segment.py line 43 at r1 (raw file):

        return self.text[index:]

    def get_immediate_preceding_marker_type(self) -> UsfmMarkerType:

Unused function


machine/corpora/analysis/text_segment.py line 62 at r1 (raw file):

    def replace_substring(self, start_index: int, end_index: int, replacement: str) -> None:
        self.text = self.text[:start_index] + replacement + self.text[end_index:]

What about substring_before(start_index) + replacement + substring_after(start_index)?


machine/corpora/analysis/quotation_mark_string_match.py line 48 at r1 (raw file):

        return regex_pattern.search(self.get_quotation_mark()) is not None

    def does_next_character_match(self, regex_pattern: regex.Pattern) -> bool:

Maybe next_character_matches()? I think <noun>_<verb>(e)s() is cleaner than does_<noun>_<verb>(). It's shorter, it sounds more natural in if statements, and I don't think does_... is used elsewhere in machine.py.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 49 files reviewed, 7 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/text_segment.py line 62 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

What about substring_before(start_index) + replacement + substring_after(start_index)?

*Sorry, end_index, but you get the idea :)

@benjaminking
Copy link
Collaborator Author

Unused function

Removed

Would _have_depth_and_direction... be better?

I have changed this, along with several other similarly-named functions.

Why not also check the other properties?

In this case, the object has references to previous and next instances (a linked list basically), so including it in the equality function would lead to infinite recursion.

I don't think many of these helper methods are necessary. In cases where the field is not read-only and there's no other logic in the getters and setters, why not just use the bare variable - i.e., text_segment.previous_segment rather than set_.../get_...? If it is read-only, you can use the @property decorator (you can search for lots of examples in the machine.py codebase). That would cut down the size of this boiler-plate code a lot. It would also make the logic elsewhere more succinct and readable.

I have made these changes throughout my code. Thanks for the suggestion.

Unused function

Removed.

What about substring_before(start_index) + replacement + substring_after(start_index)?

Good suggestion. Done.

Maybe next_character_matches()? I think <noun>_<verb>(e)s() is cleaner than does_<noun>_<verb>(). It's shorter, it sounds more natural in if statements, and I don't think does_... is used elsewhere in machine.py.

Changed as part of the naming changes I mentioned earlier.

I also did a fair bit of minor refactoring to make things more consistent (that will be the majority of the changes in the commit).

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Reviewed 1 of 39 files at r2, all commit messages.
Reviewable status: 2 of 49 files reviewed, 1 unresolved discussion


machine/corpora/analysis/text_segment.py line 25 at r2 (raw file):

        if self._index_in_verse != value._index_in_verse:
            return False
        if self._index_in_verse != value._index_in_verse:

I don't think you meant to check this twice. Maybe you meant to check one or more of the other non-TextSegment properties?

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Jul 1, 2025

I'll try and give it a more thorough review soon - thank you for the refactor and cleanup! I'll also begin porting more of the code now. I'm sure Damien will have more comments 🤪.

@benjaminking
Copy link
Collaborator Author

I don't think you meant to check this twice. Maybe you meant to check one or more of the other non-TextSegment properties?

Yup, I meant to check _num_segments_in_verse. Fixed.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - my comments might trickle in now as I notice things. (Also, your replies are showing up in the wrong spot for me in Reviewable 🤔? Do you see that too?)

Reviewable status: 2 of 49 files reviewed, 2 unresolved discussions


machine/corpora/analysis/quote_convention.py line 60 at r3 (raw file):

        return True

    def get_name(self) -> str:

Looks like this method should be replaced with a property. .name is used in QuoteConvetionSet and get_name() elsewhere.


machine/corpora/analysis/quote_convention.py line 63 at r3 (raw file):

        return self.name

    def get_num_levels(self) -> int:

It's your call, but a property might be nice for this as well.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 49 files reviewed, 4 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/quote_convention.py line 73 at r3 (raw file):

    def get_expected_quotation_mark(self, depth: int, direction: QuotationMarkDirection) -> str:
        if depth > len(self.levels) or depth < 1:

Could use get_num_levels/num_levels


machine/corpora/analysis/quote_convention.py line 122 at r3 (raw file):

        return QuoteConvention(self.get_name() + "_normalized", [level.normalize() for level in self.levels])

    def print_summary(self) -> None:

Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of _get_summary_message, you could override __repr__ or __str__.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 49 files at r1, 37 of 39 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @benjaminking and @Enkidu93)


machine/corpora/analysis/quote_convention.py line 122 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Damien might have other thoughts. But I don't like having printing methods in non-command-line-y classes. Maybe instead of _get_summary_message, you could override __repr__ or __str__.

I agree.


machine/corpora/analysis/quotation_mark_tabulator.py line 76 at r3 (raw file):

        return 1 - (num_differences / num_total_quotation_marks)

    def print_summary(self) -> None:

Consider using __str__ or __rep__. If that is inappropriate, you can have the method return a string description instead of calling print directly.


machine/corpora/analysis/quotation_mark_finder.py line 13 at r3 (raw file):

class QuotationMarkFinder:
    quote_pattern = regex.compile(r"(\p{Quotation_Mark}|<<|>>|<|>)", regex.U)

This should be all caps and probably prefixed with an underscore.


machine/corpora/scripture_embed.py line 0 at r3 (raw file):
Is this module used anywhere?


machine/corpora/analysis/quotation_mark_resolution_settings.py line 28 at r3 (raw file):

    @abstractmethod
    def should_rely_on_paragraph_markers(self) -> bool: ...

This could be a property.


machine/corpora/analysis/quote_convention_set.py line 161 at r3 (raw file):

        return (best_quote_convention, best_similarity)

    def print_summary(self) -> None:

Consider using __str__ or __rep__ instead of direct print calls.


machine/corpora/analysis/quote_convention_detector.py line 18 at r3 (raw file):

@dataclass

Consider marking this dataclass as immutable.


machine/corpora/analysis/standard_quote_conventions.py line 4 at r3 (raw file):

from .quote_convention_set import QuoteConventionSet

standard_quote_conventions: QuoteConventionSet = QuoteConventionSet(

The naming convention for constants is all caps.


machine/corpora/analysis/quotation_mark_string_match.py line 16 at r3 (raw file):

    # extra stuff in the regex to handle Western Cham
    letter_pattern: Pattern = regex.compile(r"[\p{L}\U0001E200-\U0001E28F]", regex.U)

You should use all caps for constant names. I think this are only used internally, so they should be prefixed with an underscore.


machine/corpora/analysis/quote_convention.py line 6 at r3 (raw file):

from .quotation_mark_direction import QuotationMarkDirection

quote_normalization_map: Dict[str, str] = {

The naming convention for constants is all caps. If the constant is only used locally in the module, then you should prefix with an underscore.


machine/corpora/analysis/quote_convention.py line 22 at r3 (raw file):

@dataclass

If a dataclass is intended to be immutable, consider using @dataclass(frozen=True).


machine/corpora/analysis/__init__.py line 0 at r3 (raw file):
I think this package name isn't specific enough, unless you were thinking that other kinds of "analysis" classes might end up here in the future. What about punctuation_analysis?


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 104 at r3 (raw file):

class QuotationMarkCategorizer:
    apostrophe_pattern = regex.compile(r"[\'\u2019\u2018]", regex.U)

This should be all caps and prefixed with an underscore.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another batch of comments - wanted to get a chunk in so that they aren't all coming in at once.

Reviewed 1 of 39 files at r2.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/quote_convention_set.py line 25 at r3 (raw file):

        opening_quotation_marks: Set[str] = set()
        closing_quotation_marks: Set[str] = set()
        all_quotation_marks: Set[str] = set()

Since this is just the union of the other two, could we just set it as such downstream?


machine/corpora/analysis/quote_convention_set.py line 27 at r3 (raw file):

        all_quotation_marks: Set[str] = set()

        if len(self._conventions) > 0:

I don't think this if statement is needed


machine/corpora/analysis/quote_convention_set.py line 39 at r3 (raw file):

            if len(all_quotation_marks) > 0:
                self._opening_quotation_mark_regex: Pattern = regex.compile(
                    r"[" + "".join(sorted(list(opening_quotation_marks))) + "]"

Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?


machine/corpora/analysis/quote_convention_set.py line 49 at r3 (raw file):

        if len(opening_quotation_marks) == 0:
            self._opening_quotation_mark_regex = regex.compile(r"")

I don't know how computationally expensive it would be to just make these the defaults and avoid these if statements? Or would that not work in some cases?


machine/corpora/analysis/quote_convention_set.py line 55 at r3 (raw file):

            self._all_quotation_mark_regex = regex.compile(r"")

    def _create_quotation_mark_pair_map(self) -> None:

Could this function be combined with the above function without the combined function getting too hairy? Since they are called in the same place as a way of initializing everything and are iterating through the same collection in the same way, it would save a few lines and possibly be cleaner (?)


machine/corpora/analysis/quote_convention_set.py line 88 at r3 (raw file):

    def get_all_quote_convention_names(self) -> List[str]:
        return sorted([qc.name for qc in self._conventions])

More sorting - why is this needed?


machine/corpora/analysis/quotation_mark_string_match.py line 57 at r3 (raw file):

    @property
    def previous_character(self) -> Union[str, None]:
        if self._start_index == 0:

Use is_at_start_of_segment?


machine/corpora/analysis/quotation_mark_tabulator.py line 8 at r3 (raw file):

class QuotationMarkCounts:

I wonder if you could leverage python's Counter class to simplify this or remove it altogether?


machine/corpora/analysis/quotation_mark_tabulator.py line 10 at r3 (raw file):

class QuotationMarkCounts:
    def __init__(self):
        self._string_counts: Dict[str, int] = dict()

Could this name be more specific?


machine/corpora/analysis/quotation_mark_tabulator.py line 35 at r3 (raw file):

    def __init__(self):
        self.quotation_counts_by_depth_and_direction: dict[tuple[int, QuotationMarkDirection], QuotationMarkCounts] = (

Should this be made private, as it were?


machine/corpora/analysis/quotation_mark_tabulator.py line 43 at r3 (raw file):

            self._count_quotation_mark(quotation_mark)

    def _count_quotation_mark(self, quote: QuotationMarkMetadata) -> None:

Throughout the code, there seems to be inconsistent use of quote, quotationMark, and quotation in method names and variables (or at least I haven't figured out the pattern). It would be nice to make this consistent.


machine/corpora/analysis/quotation_mark_tabulator.py line 59 at r3 (raw file):

    def calculate_similarity(self, quote_convention: QuoteConvention) -> float:
        num_differences = 0

Could we qualify this name weighted_count... or something? The name suggests to me that it's an int but it's really a float.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 32 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/preliminary_quotation_analyzer.py line 43 at r3 (raw file):

    def count_word_initial_apostrophe(self, quotation_mark: str) -> None:
        if quotation_mark not in self._word_initial_occurrences:

This very same pattern of checking if a key exists then updating the value based on the previous value happens a lot in the code. It might be worth using python's defaultdict to avoid this. I have an extension in C# UpdateValue that I can use, I think, to a similar end.


machine/corpora/analysis/preliminary_quotation_analyzer.py line 76 at r3 (raw file):

        num_initial_marks: int = self._get_word_initial_occurrences(quotation_mark)
        num_total_marks: int = self._get_total_occurrences(quotation_mark)
        return num_total_marks > 0 and num_initial_marks / num_total_marks < 0.1

This class uses a lot of "magic numbers". I don't know if it would be possible to describe them and put them into constants??


machine/corpora/analysis/preliminary_quotation_analyzer.py line 103 at r3 (raw file):

        self._later_quotation_mark_counts: Dict[str, int] = dict()

    def record_earlier_quotation_mark(self, quotation_mark: str) -> None:

Above you use count...() for method names but here you use record...().


machine/corpora/analysis/preliminary_quotation_analyzer.py line 155 at r3 (raw file):

        self._group_quotation_marks(quotation_marks)

    def _group_quotation_marks(self, quotation_marks: List[QuotationMarkStringMatch]) -> None:

You could use functools.reduce or itertools.grouby to make this a one-liner. It looks like we use groupby in thot_smt_model_trainer.py.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another handful of comments. I leave it to Damien's discretion if he'd like to leave any of these unaddressed for now in order to expedite getting the code in. I tried to make requested changes that I thought were less important non-blocking comments. I'm nearly through porting the code itself (almost all tests left), so my trickle of comments should be running dry soon. Thank you for doing this & my apologies for all the minor requested changes!

Reviewed 3 of 49 files at r1, 16 of 39 files at r2, 1 of 2 files at r3.
Reviewable status: all files reviewed, 44 unresolved discussions (waiting on @benjaminking)


machine/corpora/analysis/preliminary_quotation_analyzer.py line 273 at r3 (raw file):

    def __init__(self, quote_conventions: QuoteConventionSet):
        self._quote_conventions = quote_conventions

In some places QuoteConventionSet objects are called quoteConventionSets and in others quoteConventions. It would be nice if the variable naming were consistent.


machine/corpora/analysis/text_segment.py line 70 at r3 (raw file):

    # These setters need to be implemented outside the builder to avoid circular dependencies
    def set_previous_segment(self, previous_segment: "TextSegment") -> None:

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 26 at r3 (raw file):

    @property
    def current_depth(self) -> int:
        return self._current_depth + 1

This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1 in the calling code? (...with a comment as needed)


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 49 at r3 (raw file):

        if depth > len(self._quotation_stack):
            raise RuntimeError(
                "get_opening_quotation_mark_at_depth() was called with a depth greater than the quotation stack size."

I don't think you necessarily need to pass the name of the function since it would be available in a stack trace. You may, however, want to pass the stack size and depth.


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 56 at r3 (raw file):

        if not self.has_open_quotation_mark():
            raise RuntimeError(
                "get_deepest_opening_quotation_mark() was called when the stack of quotation marks was empty."

I don't think this error message is correct.


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 77 at r3 (raw file):

    @property
    def current_depth(self) -> int:

See above comment. I might be missing something: Why +1 above but not here?


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 130 at r3 (raw file):

            if quote_match._start_index > 0:
                return False
            if quote_match.quotation_mark != self._quotation_mark_resolver_state.get_opening_quotation_mark_at_depth(

I think this logic can be simplified by taking this check and moving it outside the if/else - something like:

if quote_match.quotation_mark != ...:
    return False

if not _quotation_continuer_state.continuer_has_been_observed():
    ...

return true

machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 163 at r3 (raw file):

            if quote_match.quotation_mark != "»":
                return False
            if not self._settings.are_marks_a_valid_pair(

Same thing as above method - I think this can be moved outside the if/else unless one of the bool functions is changing the state somehow.


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 187 at r3 (raw file):

        self,
        quote_match: QuotationMarkStringMatch,
        previous_match: Union[QuotationMarkStringMatch, None],

previous_match and next_match are unused.


machine/corpora/analysis/quotation_mark_resolver.py line 13 at r3 (raw file):

    def __init__(self, settings: QuotationMarkResolutionSettings):
        self._settings = settings

Is a default constructor needed?


machine/corpora/analysis/quotation_mark_resolver.py line 20 at r3 (raw file):

    ) -> Generator[QuotationMarkMetadata, None, None]: ...

    def reset(self) -> None: ...

Why is this method not marked with abstractmethod?


machine/corpora/analysis/usfm_structure_extractor.py line 16 at r3 (raw file):

        self._reset()

    def _reset(self):

Why a separate private reset function? Could we just do this in __init__()?

@benjaminking
Copy link
Collaborator Author

Since there are a bunch of comments, I will just skip the ones where my response would just be "Done."

Is this module used anywhere?

It must have been an artefact left over from rebasing with a long history. I deleted it now.

I think this package name isn't specific enough, unless you were thinking that other kinds of "analysis" classes might end up here in the future. What about punctuation_analysis?

I think that early on, John had thought that we should group all the things that analyze USFM documents together, but I don't think we have any plans for other types of analysis right now, so I went ahead and renamed it to punctuation_analysis.

Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?

Yes, it's for testing, since list(some_set) doesn't put the elements in a predictable order.

Could this function be combined with the above function without the combined function getting too hairy? Since they are called in the same place as a way of initializing everything and are iterating through the same collection in the same way, it would save a few lines and possibly be cleaner (?)

I think I prefer to keep these separate. They work similarly and could be combined, but don't necessarily need to work the same way. I think we get better maintainability keeping them separate, in case the requirements for one of them change. And the performance difference should be negligible.

Throughout the code, there seems to be inconsistent use of quote, quotationMark, and quotation in method names and variables (or at least I haven't figured out the pattern). It would be nice to make this consistent.

My intention was to use "quote" to refer to "the words that a person is saying," while "quotation mark" refers to a punctuation mark that indicates the presence of a quote. I've gone through everything to make it more consistent. A couple notes about the terminology:

  • A "quote convention" is the set of rules used to mark where quotes start and end
  • A "quote introducer" is combination of punctuation and whitespace that indicates that a quote will follow. For example, the comma and space in Bob said, "<quote>" or the colon and space in Bob said: "<quote>"
  • A "quote continuer" is used to indicate that a quote is being continued across paragraphs. In English, you don't use any closing quotation marks at the end of the first paragraph, but use new opening quotation marks at the start of the second paragraph. These quotation marks are called "quote continuers."
  • I think everything else in the code should refer to "quotation marks."

This class uses a lot of "magic numbers". I don't know if it would be possible to describe them and put them into constants??

I have refactored them into constants. Hopefully the names make sense -- it's hard to come up with concise names for some of these things.

You could use functools.reduce or itertools.grouby to make this a one-liner. It looks like we use groupby in thot_smt_model_trainer.py.

It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using defaultdict.

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?

I went ahead and gave them property setters, so they should behave as if they're public.

This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1 in the calling code? (...with a comment as needed)

This is a good call. I changed current_depth to be defined as the length of the stack for both classes.

I don't think this error message is correct.

I'm not sure I see why it's incorrect. Can you expound?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Great. I'll look back through the comments and try to confirm that all have been addressed. If you could please reply in Reviewable within the thread specific to that comment, it would make the back and forth a lot easier as we continue.

I went ahead and ported the changes you've made so far.

This should be my last round of new comments 🤪 🤞.

It looks like neither of these are used in the project so far. Does it seem worth adding another library for this? I was able to reduce this to three lines using defaultdict.
itertools is in the poetry lock and groupby is used in thot_smt_model_trainer.py - what do you mean by 'project'? It's not a big difference if it's left as-is, but it would simplify the function.

I'm not sure I see why it's incorrect. Can you expound?
I see now. I misread it and thought it had been copy-pasted from the previous error message. I like the new error message you came up with better 👍.

Reviewed 1 of 49 files at r1.
Reviewable status: 3 of 49 files reviewed, 57 unresolved discussions (waiting on @benjaminking and @ddaspit)


machine/corpora/analysis/text_segment.py line 70 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Since these properties can be get'd and set'd and there's no special logic on either end, why not just make them public?

Since there's no custom logic in the getter or setter, I'd just ditch the properties and replace e.g. _previous_segment with previous_segment.


machine/corpora/quotation_mark_update_first_pass.py line 37 at r4 (raw file):

    ) -> bool:
        target_marks_by_source_marks: Dict[str, Set[str]] = {}
        for level in range(1, source_quote_convention.num_levels + 1):

Some places depth is used and elsewhere level. Is there a true difference in meaning?


machine/corpora/quotation_mark_update_first_pass.py line 71 at r4 (raw file):

        return self._choose_best_strategy_based_on_observed_issues(self._quotation_mark_resolver.get_issues())

    def _choose_best_strategy_based_on_observed_issues(self, issues) -> QuotationMarkUpdateStrategy:

Type-hinting for issues


machine/corpora/punctuation_analysis/text_segment.py line 78 at r4 (raw file):

        self._next_segment = next_segment

    def set_index_in_verse(self, index_in_verse: int) -> None:

Same for index_in_verse as the segment fields.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 69 at r4 (raw file):

    def _apply_fallback_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock:
        for element in block._elements:

Use Elements


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 74 at r4 (raw file):

    def _apply_standard_updating(self, block: UsfmUpdateBlock) -> UsfmUpdateBlock:
        for element in block._elements:

Use Elements


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 111 at r4 (raw file):

    def _create_text_segment(self, token: UsfmToken) -> Union[TextSegment, None]:
        self._next_scripture_text_segment_builder.set_usfm_token(token)

Might be a little cleaner to do something like:

text_segment_to_return = null
if token.text != null:
    ...
return text_segment_to_return

machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 132 at r4 (raw file):

            if scripture_ref.chapter_num != self._current_chapter_number:
                self._current_chapter_number = scripture_ref.chapter_num
                self._start_new_chapter(self._current_chapter_number)

No need to pass bare field as parameter


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 147 at r4 (raw file):

            ):
                self._current_verse_number = scripture_ref.verse_num
                self._start_new_verse(self._current_verse_number)

No need to pass bare field as parameter


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 149 at r4 (raw file):

                self._start_new_verse(self._current_verse_number)

    def _start_new_verse(self, new_chapter_number: int) -> None:

new_chapter_number is unused


machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py line 93 at r4 (raw file):

        self._quote_continuer_mark_stack.append(quotation_mark)
        self._continuer_style = quote_continuer_style
        if len(self._quote_continuer_mark_stack) == len(quotation_mark_resolver_state._quotation_stack):

Should you use current_depth + 1?


machine/corpora/punctuation_analysis/depth_based_quotation_mark_resolver.py line 168 at r4 (raw file):

                return False

            # Check the next quotation mark match, since quote continuers must appear consecutively

Very minor, but some comments are starting with a capital letter - others are lowercase


machine/corpora/punctuation_analysis/quote_convention_detector.py line 53 at r4 (raw file):

        self._quotation_mark_tabulator.tabulate(resolved_quotation_marks)

    def detect_quote_convention(self, print_summary: bool) -> Union[QuoteConventionAnalysis, None]:

Throughout the code: Maybe Optional[Type] rather than Union[Type, None]?


machine/corpora/quotation_mark_update_resolution_settings.py line 16 at r4 (raw file):

        self._source_quote_convention = source_quote_convention
        self._quote_convention_singleton_set = QuoteConventionSet([self._source_quote_convention])
        self._target_quote_convention = target_quote_convention

This is unused.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 46 of 46 files at r4, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @benjaminking and @Enkidu93)


machine/corpora/punctuation_analysis/quote_convention_detector.py line 53 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Throughout the code: Maybe Optional[Type] rather than Union[Type, None]?

For Python 3.9 and earlier Optional[Type] seems to be the preferred syntax. Once we move to 3.10+, Type | None seems to be preferred.


tests/corpora/test_fallback_quotation_mark_resolver.py line 15 at r4 (raw file):

def test_reset():
    english_quote_convention = standard_quote_conventions.STANDARD_QUOTE_CONVENTIONS.get_quote_convention_by_name(

This constant should probably be exported from the machine.corpora.punctuation_analysis package. It would make using it nicer.

@benjaminking
Copy link
Collaborator Author

A few higher-level notes:

  • I replied to this round of comments in Reviewable. Would it be helpful for me to go and reply to all the older comments there?
  • I added a new test that demonstrates the full process for performing quotation denormalization (since it involves a couple steps)
  • I have rebased to the latest revision of the main branch

And concerning itertools and functools, I hadn't realized they were part of the standard library -- my mistake. But I did try out refactoring the function in question to use reduce, and while it was technically reduced to one line, formatting turned it into 8 lines, so I decided to leave it as it was. (and it turns out that the functionality needed to make groupby work in this scenario is only available for later Python versions).

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 21 of 64 files reviewed, 46 unresolved discussions (waiting on @ddaspit and @Enkidu93)


machine/corpora/quotation_mark_update_first_pass.py line 37 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Some places depth is used and elsewhere level. Is there a true difference in meaning?

I've tried to use level in the context of rules and depth in the context of observed quotation marks. Hopefully this commit clarifies that a bit be


machine/corpora/quotation_mark_update_first_pass.py line 71 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Type-hinting for issues

Done.


machine/corpora/quotation_mark_update_resolution_settings.py line 16 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This is unused.

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 69 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Use Elements

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 74 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Use Elements

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 132 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

No need to pass bare field as parameter

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 147 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

No need to pass bare field as parameter

Done.


machine/corpora/quote_convention_changing_usfm_update_block_handler.py line 149 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

new_chapter_number is unused

Done.


machine/corpora/scripture_embed.py line at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Is this module used anywhere?

It seems to have been a relic of rebasing. I've removed it.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminking Something weird happened with your branch. There seems to be some recent commits from the main branch that have been added to your branch. The preferred way to catch up with the main branch is to rebase on top of the main branch. As a last resort, you can merge the main branch into your branch.

Reviewed 4 of 43 files at r5, all commit messages.
Reviewable status: 25 of 64 files reviewed, 46 unresolved discussions (waiting on @benjaminking and @Enkidu93)

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, Ben - thank you. Don't feel obliged to reply in all the Reviewable threads unless there's more to add.

Reviewed 1 of 49 files at r1, 19 of 46 files at r4, 43 of 43 files at r5, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @benjaminking and @ddaspit)


machine/corpora/analysis/quotation_mark_string_match.py line 57 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Use is_at_start_of_segment?

This is still unaddressed. It's fairly minor though - partially just for consistency with the next_character property.


machine/corpora/analysis/quote_convention.py line 122 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I agree.

This one's fixed but not elsewhere


machine/corpora/punctuation_analysis/quotation_mark_tabulator.py line 69 at r5 (raw file):

        return 1 - (weighted_difference / total_weight)

    def get_summary_message(self) -> str:

In general, it looks like the comments Damien and I left regarding overriding __repr__ or __str__ have not been addressed. I'm fine with either properly overloading those methods or just removing the code altogether (and supporting methods). @ddaspit, what do you think?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 43 files at r5.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @benjaminking and @Enkidu93)


machine/corpora/punctuation_analysis/quotation_mark_tabulator.py line 69 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

In general, it looks like the comments Damien and I left regarding overriding __repr__ or __str__ have not been addressed. I'm fine with either properly overloading those methods or just removing the code altogether (and supporting methods). @ddaspit, what do you think?

I mainly wanted to get rid of the print statements. I will leave it up to Ben to determine if __repr__, __str__, or a separate method is appropriate. Something like __str__ or __repr__ is normally used for short string representations of the object. If the string is a longer more detailed description or represents ancillary information, then a separate method might be more appropriate.

@benjaminking benjaminking force-pushed the quotation_denormalization branch from 7e7d577 to ee01a7f Compare July 22, 2025 16:57
Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased again and did a force-push. Hopefully that fixes the issues with the Git history.

This push also includes a commit to fix a bug that I found with multi-character quotation marks. This adds a few new tests to cover the corrected behavior.

Reviewable status: 45 of 64 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)


tests/corpora/test_fallback_quotation_mark_resolver.py line 15 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This constant should probably be exported from the machine.corpora.punctuation_analysis package. It would make using it nicer.

Done.


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 26 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This property seems like a potential source for errors. Could we settle on one or the other being properly called the current depth and passing the +/- 1 in the calling code? (...with a comment as needed)

I have changed the depth properties to be consistent with one another. Depth 1 should always refer to a first-level quotation mark, while a depth of 0 (in the resolver and continuer states) means that no quotation mark has been observed yet. I think the only time I'm adding one to the depth is when a new quote is opened, which should be a more appropriate use of addition (as opposed to bookkeeping between inconsistent properties).


machine/corpora/analysis/quotation_mark_string_match.py line 57 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This is still unaddressed. It's fairly minor though - partially just for consistency with the next_character property.

Done.


machine/corpora/analysis/quote_convention_set.py line 39 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why are these all sorted (and elsewhere there's similar sorting)? Is it mainly for testing or something like that?

Yes, it's for testing that the regexes are created correctly. The elements in the set aren't added in a consistent order without using the sorted function.


machine/corpora/punctuation_analysis/quotation_mark_tabulator.py line 69 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I mainly wanted to get rid of the print statements. I will leave it up to Ben to determine if __repr__, __str__, or a separate method is appropriate. Something like __str__ or __repr__ is normally used for short string representations of the object. If the string is a longer more detailed description or represents ancillary information, then a separate method might be more appropriate.

In this particular case, get_summary_message creates a message that's useful to communicate to users, but it doesn't fully represent everything that's present in QuotationMarkTabulator. It's possible for two meaningfully different instances of QuotationMarkTabulator to have identical summary messages. So it didn't seem like __str__ or __repr__ were a good fit for this function.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - sounds good.

Reviewable status: 45 of 64 files reviewed, 2 unresolved discussions (waiting on @benjaminking and @ddaspit)


machine/corpora/analysis/depth_based_quotation_mark_resolver.py line 26 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

I have changed the depth properties to be consistent with one another. Depth 1 should always refer to a first-level quotation mark, while a depth of 0 (in the resolver and continuer states) means that no quotation mark has been observed yet. I think the only time I'm adding one to the depth is when a new quote is opened, which should be a more appropriate use of addition (as opposed to bookkeeping between inconsistent properties).

Yes, sorry - I meant to resolve this comment. This looks good.


machine/corpora/analysis/quote_convention_set.py line 39 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

Yes, it's for testing that the regexes are created correctly. The elements in the set aren't added in a consistent order without using the sorted function.

Yes - sorry - meant to resolve this. Makes sense. Something to beware of is that C# and Python do not always sort strings in the same way (oddly enough), so it will create some discrepancies in the tests (which other approaches like making both strings into sets of chars in the test itself wouldn't). But it's not a big deal.


machine/corpora/punctuation_analysis/quotation_mark_tabulator.py line 69 at r5 (raw file):

Previously, benjaminking (Ben King) wrote…

In this particular case, get_summary_message creates a message that's useful to communicate to users, but it doesn't fully represent everything that's present in QuotationMarkTabulator. It's possible for two meaningfully different instances of QuotationMarkTabulator to have identical summary messages. So it didn't seem like __str__ or __repr__ were a good fit for this function.

OK, I would say that if the main concern is that it doesn't capture all instance-specific info needed to recreate the object, then that would still be an appropriate place to use __str__ just not __repr__. But if you think it's too long or can't be reworked to make sense as a user-friendly string representation but is more like secondary, diagnostic info, then that's fine. I mainly am just trying to avoid committing code unnecessarily - especially since this function requires other methods that are only used here. Is there a reason that you believe this class in particular benefits from a summarizing function like this? This class isn't really user-facing, correct?

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 43 of 64 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @Enkidu93)


machine/corpora/punctuation_analysis/quotation_mark_tabulator.py line 69 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, I would say that if the main concern is that it doesn't capture all instance-specific info needed to recreate the object, then that would still be an appropriate place to use __str__ just not __repr__. But if you think it's too long or can't be reworked to make sense as a user-friendly string representation but is more like secondary, diagnostic info, then that's fine. I mainly am just trying to avoid committing code unnecessarily - especially since this function requires other methods that are only used here. Is there a reason that you believe this class in particular benefits from a summarizing function like this? This class isn't really user-facing, correct?

Your comment made me realize that I still had a print statement that was using this function. The message is one that the EITL team found helpful to include in the standalone script that detected the quotation convention. I made a change to include it in the QuoteConventionAnalysis object, so that the script can choose whether or not to print it. With that new structure in place, I think it is now clearer to leave the function as get_summary_message than to change it into __str__.

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more small changes:

  1. a check for an edge case with quote continuers that could lead to an out-of-bounds error
  2. A little bit of logic to create better guesses for ambiguous quotation marks in rare situations

Reviewable status: 40 of 64 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @Enkidu93)

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you

Reviewable status: 40 of 64 files reviewed, 9 unresolved discussions (waiting on @benjaminking and @ddaspit)


machine/corpora/punctuation_analysis/quotation_mark_tabulator.py line 69 at r5 (raw file):

Previously, benjaminking (Ben King) wrote…

Your comment made me realize that I still had a print statement that was using this function. The message is one that the EITL team found helpful to include in the standalone script that detected the quotation convention. I made a change to include it in the QuoteConventionAnalysis object, so that the script can choose whether or not to print it. With that new structure in place, I think it is now clearer to leave the function as get_summary_message than to change it into __str__.

Sounds good


tests/corpora/punctuation_analysis/test_quotation_mark_string_match.py line 157 at r7 (raw file):

    quotation_mark_string_match = QuotationMarkStringMatch(
        TextSegment.Builder().set_text("sample text").build(), 11, 12

I'm confused what an end index of twelve would mean here? It doesn't change the substance of the test though.


tests/corpora/punctuation_analysis/test_depth_based_quotation_mark_resolver.py line 996 at r7 (raw file):

    )
    assert not standard_swedish_quotation_mark_categorizer.is_opening_quotation_mark(
        QuotationMarkStringMatch(TextSegment.Builder().set_text("\u201d").build(), 1, 2)

I think you meant ...0, 1) not ...1, 2)


tests/corpora/punctuation_analysis/test_depth_based_quotation_mark_resolver.py line 2293 at r7 (raw file):

        QuotationMarkMetadata("\u201c", 3, QuotationMarkDirection.OPENING, text_segment, 10, 11),
        QuotationMarkMetadata("\u2018", 4, QuotationMarkDirection.OPENING, text_segment, 13, 14),
        # QuotationMarkMetadata("\u201c", 5, QuotationMarkDirection.Opening, text_segment, 20, 21),

Can this be removed?


tests/corpora/punctuation_analysis/test_quote_convention.py line 370 at r7 (raw file):

def test_print_summary() -> None:
    quote_convention = QuoteConvention(
        "test-quote-convention",

In this class - not sure if it's elsewhere - you're using kebab-case for quote convention names rather than the snake_case which is used elsewhere.


tests/corpora/test_quotation_denormalization.py line 53 at r7 (raw file):

    actual_denormalized_usfm = updater.get_usfm()

    assert actual_denormalized_usfm == expected_denormalized_usfm

For consistency, you can use the ignore_line_endings function in the helper class to avoid having to stick an \r at each line-ending


tests/corpora/punctuation_analysis/test_text_segment.py line 75 at r7 (raw file):

    different_text_segment = TextSegment.Builder().set_text("different text").build()

    assert basic_segment == basic_segment

This line and then following line don't seem to be what you meant to put


machine/corpora/quotation_mark_update_first_pass.py line 43 at r7 (raw file):

            if depth <= target_quote_convention.num_levels:
                target_marks_by_source_marks[opening_quotation_mark].add(
                    target_quote_convention.get_closing_quotation_mark_at_depth(depth)

This loop could be cut short - i.e., you shouldn't need to re-iterate.

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 36 of 64 files reviewed, 9 unresolved discussions (waiting on @ddaspit and @Enkidu93)


machine/corpora/quotation_mark_update_first_pass.py line 43 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This loop could be cut short - i.e., you shouldn't need to re-iterate.

Done.


tests/corpora/test_quotation_denormalization.py line 53 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

For consistency, you can use the ignore_line_endings function in the helper class to avoid having to stick an \r at each line-ending

Can you give more details on where the ignore_line_endings function is? I'm not able to find it.


machine/corpora/punctuation_analysis/quotation_mark_tabulator.py line 69 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Sounds good

Done.


tests/corpora/punctuation_analysis/test_depth_based_quotation_mark_resolver.py line 996 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think you meant ...0, 1) not ...1, 2)

Done.


tests/corpora/punctuation_analysis/test_depth_based_quotation_mark_resolver.py line 2293 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Can this be removed?

Done.


tests/corpora/punctuation_analysis/test_quotation_mark_string_match.py line 157 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I'm confused what an end index of twelve would mean here? It doesn't change the substance of the test though.

It was supposed to be 10 and 11. I've corrected the test.


tests/corpora/punctuation_analysis/test_quote_convention.py line 370 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

In this class - not sure if it's elsewhere - you're using kebab-case for quote convention names rather than the snake_case which is used elsewhere.

Done.


tests/corpora/punctuation_analysis/test_text_segment.py line 75 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This line and then following line don't seem to be what you meant to put

That actually was what I was meaning to write. The first line tests that an object is equal to itself, and the second tests that it's not equal to an object of a completely different class. Just some basic sanity checks.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 36 of 64 files reviewed, 4 unresolved discussions (waiting on @benjaminking and @ddaspit)


machine/corpora/quotation_mark_update_first_pass.py line 43 at r7 (raw file):

Previously, benjaminking (Ben King) wrote…

Done.

That is an improvement but not what I was thinking. Is there a reason something like this wouldn't work?

        source_marks: set = set()
        for depth in range(1, min(source_quote_convention.num_levels, target_quote_convention.num_levels) + 1):
            opening_quotation_mark = source_quote_convention.get_opening_quotation_mark_at_depth(depth)
            if opening_quotation_mark in source_marks:
                return False
            source_marks.add(opening_quotation_mark)
        return True

tests/corpora/test_quotation_denormalization.py line 53 at r7 (raw file):

Previously, benjaminking (Ben King) wrote…

Can you give more details on where the ignore_line_endings function is? I'm not able to find it.

Sure! Here


tests/corpora/punctuation_analysis/test_text_segment.py line 75 at r7 (raw file):

Previously, benjaminking (Ben King) wrote…

That actually was what I was meaning to write. The first line tests that an object is equal to itself, and the second tests that it's not equal to an object of a completely different class. Just some basic sanity checks.

I see - that makes sense.

@Enkidu93
Copy link
Collaborator

machine/corpora/quotation_mark_update_first_pass.py line 43 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

That is an improvement but not what I was thinking. Is there a reason something like this wouldn't work?

        source_marks: set = set()
        for depth in range(1, min(source_quote_convention.num_levels, target_quote_convention.num_levels) + 1):
            opening_quotation_mark = source_quote_convention.get_opening_quotation_mark_at_depth(depth)
            if opening_quotation_mark in source_marks:
                return False
            source_marks.add(opening_quotation_mark)
        return True

Sorry, more like this, I guess:

        target_mark_by_source_mark: Dict[str, str] = {}
        for depth in range(1, min(source_quote_convention.num_levels, target_quote_convention.num_levels) + 1):
            opening_quotation_mark = source_quote_convention.get_opening_quotation_mark_at_depth(depth)
            closing_quotation_mark = target_quote_convention.get_closing_quotation_mark_at_depth(depth)
            if (
                opening_quotation_mark in target_mark_by_source_mark
                and target_mark_by_source_mark[opening_quotation_mark] != closing_quotation_mark
            ):
                return False
            target_mark_by_source_mark[opening_quotation_mark] = closing_quotation_mark
        return True

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 34 of 64 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)


machine/corpora/quotation_mark_update_first_pass.py line 43 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Sorry, more like this, I guess:

        target_mark_by_source_mark: Dict[str, str] = {}
        for depth in range(1, min(source_quote_convention.num_levels, target_quote_convention.num_levels) + 1):
            opening_quotation_mark = source_quote_convention.get_opening_quotation_mark_at_depth(depth)
            closing_quotation_mark = target_quote_convention.get_closing_quotation_mark_at_depth(depth)
            if (
                opening_quotation_mark in target_mark_by_source_mark
                and target_mark_by_source_mark[opening_quotation_mark] != closing_quotation_mark
            ):
                return False
            target_mark_by_source_mark[opening_quotation_mark] = closing_quotation_mark
        return True

Yes, that should work. Incidentally, your comment made me realize that there was actually a bug in this function. It should be comparing opening quotes to opening quotes and closing quotes to closing quotes, instead of comparing opening to closing quotes. The tests didn't catch it earlier since real quote conventions tend to be parallel across levels. I fixed the bug (using the function structure you suggested) and added a test that will fail for the old logic, but pass for the new logic.


tests/corpora/test_quotation_denormalization.py line 53 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Sure! Here

Done.


tests/corpora/punctuation_analysis/test_text_segment.py line 75 at r7 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I see - that makes sense.

Done.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Well done! It's been an adventurous code review. Thank you for all your hard work ❤️

Reviewable status: 34 of 64 files reviewed, 1 unresolved discussion (waiting on @ddaspit)


machine/corpora/quotation_mark_update_first_pass.py line 43 at r7 (raw file):

Previously, benjaminking (Ben King) wrote…

Yes, that should work. Incidentally, your comment made me realize that there was actually a bug in this function. It should be comparing opening quotes to opening quotes and closing quotes to closing quotes, instead of comparing opening to closing quotes. The tests didn't catch it earlier since real quote conventions tend to be parallel across levels. I fixed the bug (using the function structure you suggested) and added a test that will fail for the old logic, but pass for the new logic.

Great!

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job, guys!

Reviewed 14 of 43 files at r5, 19 of 19 files at r6, 5 of 5 files at r7, 3 of 4 files at r8, 3 of 3 files at r9, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @benjaminking)

@ddaspit
Copy link
Contributor

ddaspit commented Jul 31, 2025

Don't forget to cleanup/squash the commits when merging.

@benjaminking
Copy link
Collaborator Author

Excellent! Do we usually squash down to a single commit when merging a branch? Or just organize it into larger milestones?

@benjaminking benjaminking merged commit 76adb63 into main Aug 1, 2025
14 checks passed
@benjaminking benjaminking deleted the quotation_denormalization branch August 1, 2025 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants