Skip to content

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Apr 5, 2024

To account for composite characters, I changed find_missing_characters() to normalize the training example before using it to calculate the set of all characters in the training data, rather than normalizing the set of characters.

In addition to the change, I modified one of the test cases for updating the tokenizer to include a check for handling a composite character.


This change is Reviewable

@mshannon-sil mshannon-sil requested a review from ddaspit April 5, 2024 22:33
@mshannon-sil mshannon-sil self-assigned this Apr 5, 2024
@mshannon-sil mshannon-sil linked an issue Apr 5, 2024 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Apr 5, 2024

Codecov Report

Attention: Patch coverage is 90.90909% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.34%. Comparing base (bda3b54) to head (7942dfe).

Files Patch % Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #105   +/-   ##
=======================================
  Coverage   88.33%   88.34%           
=======================================
  Files         234      234           
  Lines       13816    13821    +5     
=======================================
+ Hits        12205    12210    +5     
  Misses       1611     1611           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

                for lang_code in lang_codes:
                    ex_text = ex[lang_code]
                    if isinstance(tokenizer, (NllbTokenizerFast)):

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion (waiting on @ddaspit)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@johnml1135 johnml1135 merged commit 5d05e6d into main Apr 16, 2024
@ddaspit ddaspit deleted the #104_normalize_order branch April 16, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

normalize lines before getting charset

4 participants