fix normalization order in find_missing_characters() #105

mshannon-sil · 2024-04-05T22:33:56Z

To account for composite characters, I changed find_missing_characters() to normalize the training example before using it to calculate the set of all characters in the training data, rather than normalizing the set of characters.

In addition to the change, I modified one of the test cases for updating the tokenizer to include a check for handling a composite character.

This change is

codecov-commenter · 2024-04-05T22:35:05Z

Codecov Report

Attention: Patch coverage is 90.90909% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.34%. Comparing base (bda3b54) to head (7942dfe).

Files	Patch %	Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #105   +/-   ##
=======================================
  Coverage   88.33%   88.34%           
=======================================
  Files         234      234           
  Lines       13816    13821    +5     
=======================================
+ Hits        12205    12210    +5     
  Misses       1611     1611

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddaspit

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

                for lang_code in lang_codes:
                    ex_text = ex[lang_code]
                    if isinstance(tokenizer, (NllbTokenizerFast)):

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

mshannon-sil

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion (waiting on @ddaspit)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

Done.

ddaspit

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

fix normalization order in find_missing_characters()

38cdb25

mshannon-sil added the bug label Apr 5, 2024

mshannon-sil requested a review from ddaspit April 5, 2024 22:33

mshannon-sil self-assigned this Apr 5, 2024

mshannon-sil linked an issue Apr 5, 2024 that may be closed by this pull request

normalize lines before getting charset #104

Closed

ddaspit requested changes Apr 10, 2024

View reviewed changes

don't check isinstance during iterations

7942dfe

mshannon-sil commented Apr 10, 2024

View reviewed changes

ddaspit approved these changes Apr 15, 2024

View reviewed changes

johnml1135 merged commit 5d05e6d into main Apr 16, 2024

ddaspit deleted the #104_normalize_order branch April 16, 2024 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix normalization order in find_missing_characters() #105

fix normalization order in find_missing_characters() #105

Uh oh!

mshannon-sil commented Apr 5, 2024 •

edited by ddaspit

Loading

Uh oh!

codecov-commenter commented Apr 5, 2024 •

edited

Loading

Uh oh!

ddaspit left a comment

Uh oh!

mshannon-sil left a comment

Uh oh!

ddaspit left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

fix normalization order in find_missing_characters() #105

fix normalization order in find_missing_characters() #105

Uh oh!

Conversation

mshannon-sil commented Apr 5, 2024 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil left a comment

Choose a reason for hiding this comment

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mshannon-sil commented Apr 5, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Apr 5, 2024 •

edited

Loading