Skip to content

Conversation

benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Oct 8, 2025

This PR adds a new script called segment_verses.py that can segment passages into individual verses by aligning the passages against known verse segmentations (i.e. any Paratext project). It is also able to evaluate the accuracy of the segmentation.


This change is Reviewable

@benjaminking benjaminking requested a review from Enkidu93 October 8, 2025 17:14
Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the algorithm, but I can continue to do that once this is in :). Looks great - just a couple small comments.

@Enkidu93 reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @benjaminking)


silnlp/alignment/group_verses_into_passages.py line 71 at r1 (raw file):

def main() -> None:
    parser = argparse.ArgumentParser(description="Collect verse counts and compute alignment scores")

Is this description correct?


silnlp/alignment/group_verses_into_passages.py line 74 at r1 (raw file):

    parser.add_argument("--project", help="Name of source Paratext project", required=True, type=str)
    parser.add_argument("--input-passages", help=".tsv file with target passages", required=True, type=str)
    parser.add_argument("--output-passages", help=".tsv file with target passages", required=True, type=str)

Should this say 'source passages'? I might not be following correctly.


silnlp/alignment/segment_verses.py line 85 at r1 (raw file):

        for line in load_corpus(passage_file):
            row = line.split("\t")
            if len(row) < 7:

Maybe != for robustness - not sure it matters much


silnlp/alignment/segment_verses.py line 247 at r1 (raw file):

                continue

            fewest_crossed_alignments = 1000000

Not necessary, but a cool thing you can do that you might not know about is use underscores in numbers - e.g., 1_000_000 - doesn't change the value, but makes it easier to interpret at a glance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants