-
-
Notifications
You must be signed in to change notification settings - Fork 3
Tokenizer #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer #43
Conversation
… restore original tokenizer if there are no missing characters
Why are we defaulting this to false? Wouldn't the default behavior be to update the tokenizer? Is there some reason we would want to default to false? |
Can this be modified through the build_params? |
Why are you overwriting the tokenizer.json file? Why not create a new file in the project (or potentially, a small json diff that would merged into tokenizer.json...)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/jobs/settings.yaml
line 8 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why are we defaulting this to false? Wouldn't the default behavior be to update the tokenizer? Is there some reason we would want to default to false?
Good catch, I just defaulted them to false
since that's essentially what SILNLP does, but now that we've tested missing characters extensively we want true
to be the default.
machine/tokenization/custom_normalizer/tokenizer_config.json
line 2 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Can this be modified through the build_params?
No need, this file is only needed to overwrite for the normalizer it contains, no other part of it is used.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 175 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why are you overwriting the tokenizer.json file? Why not create a new file in the project (or potentially, a small json diff that would merged into tokenizer.json...)?
Well there's no need to keep the older tokenizer file, so I think it would be a bit of a waste of memory space to have an old and new version of tokenizer.json in the same project directory. As for merging a small json diff, I didn't think there was a way to get around overwriting the whole file. All the recommendations I see online just dump the whole updated json object into the file, overwriting all its contents. Do you have a link to an example of merging json into a json file in place without overwriting everything?
…erFast, add test for slow tokenizer scenario, log missing characters added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 8 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/jobs/settings.yaml
line 8 at r1 (raw file):
Previously, mshannon-sil wrote…
Good catch, I just defaulted them to
false
since that's essentially what SILNLP does, but now that we've tested missing characters extensively we wanttrue
to be the default.
Done.
Codecov ReportAttention:
📢 Thoughts on this report? Let us know!. |
Previously, mshannon-sil wrote…
That's fine - as long as it worked. The biggest concern would be if there were two successive runs using machine.py - the tokenizer.json would continue to drift over time. I am assuming that would not be desired, but neither would it be a huge issue either. Also, we run these jobs on a fresh docker image each time. It would have more concern for incremental tuning - if we need to add more and more tokens over time. |
Previously, mshannon-sil wrote…
I don't fully understand but I trust that you do :-). That's fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 8 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 175 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
That's fine - as long as it worked. The biggest concern would be if there were two successive runs using machine.py - the tokenizer.json would continue to drift over time. I am assuming that would not be desired, but neither would it be a huge issue either. Also, we run these jobs on a fresh docker image each time. It would have more concern for incremental tuning - if we need to add more and more tokens over time.
Each time we run a new job, it starts with an unmodified tokenizer, not from the tokenizer files saved in the directory, so we shouldn't run into any issues with drift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r1, 1 of 3 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @mshannon-sil)
machine/tokenization/custom_normalizer/tokenizer_config.json
line 1 at r3 (raw file):
{
I would move this file over to the huggingface directory.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 87 at r3 (raw file):
max_source_length: Optional[int] = None, max_target_length: Optional[int] = None, config: Union[LazySettings, dict] = {},
Rather than using a generic config dictionary, just add explicit parameters for turning on adding missing characters. It should be turned on by default for target and turned off by default for source.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 161 at r3 (raw file):
mpn.substitutions = [(re.compile(r), sub) for r, sub in mpn.substitutions] charset = {mpn.normalize(char) for char in charset} charset = {tokenizer.backend_tokenizer.normalizer.normalize_str(char) for char in charset} # type: ignore
Why are you ignoring the type error here? When we run into type errors, we want to fix them properly if possible.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 193 at r3 (raw file):
) else: norm_tok = PreTrainedTokenizerFast.from_pretrained(
What is norm_tok
used for?
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
"./machine/tokenization/custom_normalizer", use_fast=True ) tokenizer.backend_tokenizer.normalizer = norm_tok.backend_tokenizer.normalizer # type: ignore
What are the type errors here? Can they be fixed?
…tings rather than generic config, remove a type ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 8 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 87 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Rather than using a generic config dictionary, just add explicit parameters for turning on adding missing characters. It should be turned on by default for target and turned off by default for source.
Done.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 161 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Why are you ignoring the type error here? When we run into type errors, we want to fix them properly if possible.
Well I remember investigating it and not figuring out what could be causing the type error and that there really shouldn't be a type error, so I added that ignore flag. However I just removed the flag, and there isn't a type error anymore, so some other code I wrote must have addressed it. Point noted about fixing them properly, I'll avoid using # type: ignore
unless absolutely necessary.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 193 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
What is
norm_tok
used for?
It's just used to give the actual tokenizer the custom normalizer that doesn't normalize away ZWJ, ZWNJ, and Glottal Stop characters,
machine/tokenization/custom_normalizer/tokenizer_config.json
line 1 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would move this file over to the huggingface directory.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)
machine/jobs/settings.yaml
line 6 at r4 (raw file):
data_dir: ~/machine batch_size: 1024 tokenizer:
This should be moved under huggingface
.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 86 at r4 (raw file):
max_source_length: Optional[int] = None, max_target_length: Optional[int] = None, update_src: bool = False,
I think a more descriptive name would be better. Maybe add_unk_src_tokens
.
machine/tokenization/custom_normalizer/tokenizer_config.json
line 1 at r3 (raw file):
Previously, mshannon-sil wrote…
Done.
I would put this in a subdirectory of huggingface
similar to what you did before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
What are the type errors here? Can they be fixed?
The first error is Property "normalizer" has no defined setter
. However, some huggingface documentation https://github.com/huggingface/tokenizers/blob/b24a2fc1781d5da4e6ebcd3ecb5b91edffc0a05f/bindings/python/examples/custom_components.py uses this method of overwriting the normalizer, and I've tested that it works correctly, so there shouldn't be an issue. The second and third errors say Cannot access member "source_corpus" for type "Dataset" Member "source_corpus" is unknown
and Cannot access member "target_corpus" for type "Dataset" Member "target_corpus" is unknown
respectively. Not sure why this just started appearing now since I didn't write that code and I don't think I changed anything that should have affected it. Any recommendations for how to resolve these?
Previously, ddaspit (Damien Daspit) wrote…
Should I also change these names to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)
machine/jobs/settings.yaml
line 6 at r4 (raw file):
Previously, mshannon-sil wrote…
Should I also change these names to
add_unk_src_tokens
andadd_unk_trg_tokens
to match the changed argument names in HuggingfaceNmtModelTrainer?
Yes, it would make sense to keep them consistent.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
Previously, mshannon-sil wrote…
The first error is
Property "normalizer" has no defined setter
. However, some huggingface documentation https://github.com/huggingface/tokenizers/blob/b24a2fc1781d5da4e6ebcd3ecb5b91edffc0a05f/bindings/python/examples/custom_components.py uses this method of overwriting the normalizer, and I've tested that it works correctly, so there shouldn't be an issue. The second and third errors sayCannot access member "source_corpus" for type "Dataset" Member "source_corpus" is unknown
andCannot access member "target_corpus" for type "Dataset" Member "target_corpus" is unknown
respectively. Not sure why this just started appearing now since I didn't write that code and I don't think I changed anything that should have affected it. Any recommendations for how to resolve these?
I will take a look at them and see if I can figure out where the errors are coming from.
…uggingface directory
Why false and true? What if people are doing a Backtranslation? Why would we not want both to be true? |
Sometimes initialized to true sometimes to false - why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @mshannon-sil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)
machine/jobs/settings.yaml
line 24 at r5 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Why false and true? What if people are doing a Backtranslation? Why would we not want both to be true?
Yeah, we should probably turn them both on by default for Serval.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I will take a look at them and see if I can figure out where the errors are coming from.
Did you fix these type issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/jobs/settings.yaml
line 6 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Yes, it would make sense to keep them consistent.
Done.
machine/jobs/settings.yaml
line 24 at r5 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Yeah, we should probably turn them both on by default for Serval.
I agree that we should always want to add missing characters whether they're in the source or the target. I'll set them back to being both true.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Did you fix these type issues?
No not yet, the last commit was to address the other comments.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 86 at r4 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I think a more descriptive name would be better. Maybe
add_unk_src_tokens
.
Done.
tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py
line 203 at r5 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Sometimes initialized to true sometimes to false - why?
This is in the test directory, so I have different settings to make sure it works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
Previously, mshannon-sil wrote…
No not yet, the last commit was to address the other comments.
For the normalizer type error, I think we should just leave it like that. Any changes would be on Huggingface's end to implement a setter function, but it works regardless. In addition to the huggingface example I linked to earlier demonstrating a custom Normalizer replacing the original, there is this stack overflow answer where someone else did the same thing (https://stackoverflow.com/questions/74490128/nlp-huggingface-tokenizer-wont-let-me-set-pre-tokenizer-property).
As for the two corpus type errors, what's going on is that the type of self._corpus is defined as either Dataset or ParallelTextCorpus. I realized that my code is designed to work for ParallelTextCorpus and would fail if it receives a Dataset type corpus. However, looking at the codebase, I can't find a single instance where Dataset would be passed in instead of ParallelTextCorpus, and in my tests they have all been ParallelTextCorpus. I think it would be simpler to just set self._corpus to ParallelTextCorpus exclusively, unless there's a need for the Dataset type to be passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
Previously, mshannon-sil wrote…
For the normalizer type error, I think we should just leave it like that. Any changes would be on Huggingface's end to implement a setter function, but it works regardless. In addition to the huggingface example I linked to earlier demonstrating a custom Normalizer replacing the original, there is this stack overflow answer where someone else did the same thing (https://stackoverflow.com/questions/74490128/nlp-huggingface-tokenizer-wont-let-me-set-pre-tokenizer-property).
As for the two corpus type errors, what's going on is that the type of self._corpus is defined as either Dataset or ParallelTextCorpus. I realized that my code is designed to work for ParallelTextCorpus and would fail if it receives a Dataset type corpus. However, looking at the codebase, I can't find a single instance where Dataset would be passed in instead of ParallelTextCorpus, and in my tests they have all been ParallelTextCorpus. I think it would be simpler to just set self._corpus to ParallelTextCorpus exclusively, unless there's a need for the Dataset type to be passed in.
Regarding the normalizer type error, I think ignoring the error is fine. We should add a comment explaining that we are using undocumented/unsupported behavior.
Although none of our code uses Dataset
, you have to remember that Machine.py is a shared library. I would prefer not to remove this functionality, since it is useful to other applications that might embed Machine.py. I don't think it would be too difficult to update the code to support Dataset
as well. You can look at the preprocess_function
for an example of how to extract the source and target segments from the Dataset
.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 157 at r5 (raw file):
charset = set() for text in texts: for row in text._get_rows():
You should use the get_rows
method instead of the internal _get_rows
method. The code would look like this:
for text in texts:
with text.get_rows() as rows:
for row in rows:
charset = charset | set(row.text)
Although you don't have to use the context manager (with
statement), it is safer to use it, since it will ensure that any underlying files/streams are closed.
…aracters, refactor missing_characters to support dataset type, add normalization check to test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 9 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 196 at r3 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Regarding the normalizer type error, I think ignoring the error is fine. We should add a comment explaining that we are using undocumented/unsupported behavior.
Although none of our code uses
Dataset
, you have to remember that Machine.py is a shared library. I would prefer not to remove this functionality, since it is useful to other applications that might embed Machine.py. I don't think it would be too difficult to update the code to supportDataset
as well. You can look at thepreprocess_function
for an example of how to extract the source and target segments from theDataset
.
Okay I added an explanatory comment.
I ended up moving the creation of the train_dataset
variable from self._corpus to an earlier point in the code, so that I can just pass train_dataset
into missing_characters()
along with the language codes corresponding to the src/trg needing updating. I correspondingly refactored missing_characters()
to support the Dataset type instead of the List[Text] type. I figured this was more streamlined than having the missing_characters()
function need to support both Datasets and List[Text], and I don't think that moving the creation of the train_dataset
has any negative impacts on the code. But if stylistically you'd prefer to keep the creation of train_dataset
where it was, I can refactor the function again to support both types.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 157 at r5 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You should use the
get_rows
method instead of the internal_get_rows
method. The code would look like this:for text in texts: with text.get_rows() as rows: for row in rows: charset = charset | set(row.text)
Although you don't have to use the context manager (
with
statement), it is safer to use it, since it will ensure that any underlying files/streams are closed.
Changed to work with Dataset rather than List[Text].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
Added some code and tests for adding missing characters into machine.py.
Along the way, I ran into two issues related to existing code for adding language codes. The first issue was that if both lang_codes were new, it would assign the wrong index for the second one and then crash. I fixed this by upgrading to the latest version of transformers. However, the other issue is that for every tokenizer I've tried that isn't NllbTokenizerFast, added language codes don't save when the tokenizer is saved, and so when the tokenizer is reloaded it crashes when given the new language code. I haven't been able to resolve that yet, so I'm going to open a separate issue for it.
Also, in my tests I'm using the model
hf-internal-testing/tiny-random-nllb
. However, I think there's a mistake in the tokenizer_config.json) file, because it has"tokenizer_class": "MBartTokenizer"
when it should be"tokenizer_class": "NllbTokenizer"
. Should I go ahead and submit a pull request on huggingface to change that, or is there a reason I'm not aware of why it should be MBartTokenizer?This change is