Skip to content

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Oct 13, 2023

Added some code and tests for adding missing characters into machine.py.

Along the way, I ran into two issues related to existing code for adding language codes. The first issue was that if both lang_codes were new, it would assign the wrong index for the second one and then crash. I fixed this by upgrading to the latest version of transformers. However, the other issue is that for every tokenizer I've tried that isn't NllbTokenizerFast, added language codes don't save when the tokenizer is saved, and so when the tokenizer is reloaded it crashes when given the new language code. I haven't been able to resolve that yet, so I'm going to open a separate issue for it.

Also, in my tests I'm using the model hf-internal-testing/tiny-random-nllb. However, I think there's a mistake in the tokenizer_config.json) file, because it has "tokenizer_class": "MBartTokenizer" when it should be "tokenizer_class": "NllbTokenizer". Should I go ahead and submit a pull request on huggingface to change that, or is there a reason I'm not aware of why it should be MBartTokenizer?


This change is Reviewable

@johnml1135
Copy link
Collaborator

machine/jobs/settings.yaml line 8 at r1 (raw file):

  tokenizer:
    update_src: false
    update_trg: false

Why are we defaulting this to false? Wouldn't the default behavior be to update the tokenizer? Is there some reason we would want to default to false?

@johnml1135
Copy link
Collaborator

machine/tokenization/custom_normalizer/tokenizer_config.json line 2 at r1 (raw file):

{
  "additional_special_tokens": null,

Can this be modified through the build_params?

@johnml1135
Copy link
Collaborator

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 175 at r1 (raw file):

                    data["model"]["vocab"][token] = vocab_len + i
                file.seek(0)
                json.dump(data, file, ensure_ascii=False, indent=4)

Why are you overwriting the tokenizer.json file? Why not create a new file in the project (or potentially, a small json diff that would merged into tokenizer.json...)?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/jobs/settings.yaml line 8 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why are we defaulting this to false? Wouldn't the default behavior be to update the tokenizer? Is there some reason we would want to default to false?

Good catch, I just defaulted them to false since that's essentially what SILNLP does, but now that we've tested missing characters extensively we want true to be the default.


machine/tokenization/custom_normalizer/tokenizer_config.json line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Can this be modified through the build_params?

No need, this file is only needed to overwrite for the normalizer it contains, no other part of it is used.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 175 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why are you overwriting the tokenizer.json file? Why not create a new file in the project (or potentially, a small json diff that would merged into tokenizer.json...)?

Well there's no need to keep the older tokenizer file, so I think it would be a bit of a waste of memory space to have an old and new version of tokenizer.json in the same project directory. As for merging a small json diff, I didn't think there was a way to get around overwriting the whole file. All the recommendations I see online just dump the whole updated json object into the file, overwriting all its contents. Do you have a link to an example of merging json into a json file in place without overwriting everything?

…erFast, add test for slow tokenizer scenario, log missing characters added
Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 8 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/jobs/settings.yaml line 8 at r1 (raw file):

Previously, mshannon-sil wrote…

Good catch, I just defaulted them to false since that's essentially what SILNLP does, but now that we've tested missing characters extensively we want true to be the default.

Done.

@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Files Coverage Δ
...huggingface/test_hugging_face_nmt_model_trainer.py 100.00% <100.00%> (ø)
...tion/huggingface/hugging_face_nmt_model_trainer.py 81.14% <94.54%> (+6.85%) ⬆️

📢 Thoughts on this report? Let us know!.

@johnml1135
Copy link
Collaborator

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 175 at r1 (raw file):

Previously, mshannon-sil wrote…

Well there's no need to keep the older tokenizer file, so I think it would be a bit of a waste of memory space to have an old and new version of tokenizer.json in the same project directory. As for merging a small json diff, I didn't think there was a way to get around overwriting the whole file. All the recommendations I see online just dump the whole updated json object into the file, overwriting all its contents. Do you have a link to an example of merging json into a json file in place without overwriting everything?

That's fine - as long as it worked. The biggest concern would be if there were two successive runs using machine.py - the tokenizer.json would continue to drift over time. I am assuming that would not be desired, but neither would it be a huge issue either. Also, we run these jobs on a fresh docker image each time. It would have more concern for incremental tuning - if we need to add more and more tokens over time.

@johnml1135
Copy link
Collaborator

machine/tokenization/custom_normalizer/tokenizer_config.json line 2 at r1 (raw file):

Previously, mshannon-sil wrote…

No need, this file is only needed to overwrite for the normalizer it contains, no other part of it is used.

I don't fully understand but I trust that you do :-). That's fine.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @mshannon-sil)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 8 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 175 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

That's fine - as long as it worked. The biggest concern would be if there were two successive runs using machine.py - the tokenizer.json would continue to drift over time. I am assuming that would not be desired, but neither would it be a huge issue either. Also, we run these jobs on a fresh docker image each time. It would have more concern for incremental tuning - if we need to add more and more tokens over time.

Each time we run a new job, it starts with an unmodified tokenizer, not from the tokenizer files saved in the directory, so we shouldn't run into any issues with drift.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 8 files at r1, 1 of 3 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @mshannon-sil)


machine/tokenization/custom_normalizer/tokenizer_config.json line 1 at r3 (raw file):

{

I would move this file over to the huggingface directory.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 87 at r3 (raw file):

        max_source_length: Optional[int] = None,
        max_target_length: Optional[int] = None,
        config: Union[LazySettings, dict] = {},

Rather than using a generic config dictionary, just add explicit parameters for turning on adding missing characters. It should be turned on by default for target and turned off by default for source.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 161 at r3 (raw file):

            mpn.substitutions = [(re.compile(r), sub) for r, sub in mpn.substitutions]
            charset = {mpn.normalize(char) for char in charset}
            charset = {tokenizer.backend_tokenizer.normalizer.normalize_str(char) for char in charset}  # type: ignore

Why are you ignoring the type error here? When we run into type errors, we want to fix them properly if possible.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 193 at r3 (raw file):

                )
            else:
                norm_tok = PreTrainedTokenizerFast.from_pretrained(

What is norm_tok used for?


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

                    "./machine/tokenization/custom_normalizer", use_fast=True
                )
                tokenizer.backend_tokenizer.normalizer = norm_tok.backend_tokenizer.normalizer  # type: ignore

What are the type errors here? Can they be fixed?

…tings rather than generic config, remove a type ignore
Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 8 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 87 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Rather than using a generic config dictionary, just add explicit parameters for turning on adding missing characters. It should be turned on by default for target and turned off by default for source.

Done.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 161 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Why are you ignoring the type error here? When we run into type errors, we want to fix them properly if possible.

Well I remember investigating it and not figuring out what could be causing the type error and that there really shouldn't be a type error, so I added that ignore flag. However I just removed the flag, and there isn't a type error anymore, so some other code I wrote must have addressed it. Point noted about fixing them properly, I'll avoid using # type: ignore unless absolutely necessary.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 193 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is norm_tok used for?

It's just used to give the actual tokenizer the custom normalizer that doesn't normalize away ZWJ, ZWNJ, and Glottal Stop characters,


machine/tokenization/custom_normalizer/tokenizer_config.json line 1 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would move this file over to the huggingface directory.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)


machine/jobs/settings.yaml line 6 at r4 (raw file):

  data_dir: ~/machine
  batch_size: 1024
  tokenizer:

This should be moved under huggingface.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 86 at r4 (raw file):

        max_source_length: Optional[int] = None,
        max_target_length: Optional[int] = None,
        update_src: bool = False,

I think a more descriptive name would be better. Maybe add_unk_src_tokens.


machine/tokenization/custom_normalizer/tokenizer_config.json line 1 at r3 (raw file):

Previously, mshannon-sil wrote…

Done.

I would put this in a subdirectory of huggingface similar to what you did before.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What are the type errors here? Can they be fixed?

The first error is Property "normalizer" has no defined setter. However, some huggingface documentation https://github.com/huggingface/tokenizers/blob/b24a2fc1781d5da4e6ebcd3ecb5b91edffc0a05f/bindings/python/examples/custom_components.py uses this method of overwriting the normalizer, and I've tested that it works correctly, so there shouldn't be an issue. The second and third errors say Cannot access member "source_corpus" for type "Dataset" Member "source_corpus" is unknown and Cannot access member "target_corpus" for type "Dataset" Member "target_corpus" is unknown respectively. Not sure why this just started appearing now since I didn't write that code and I don't think I changed anything that should have affected it. Any recommendations for how to resolve these?

@mshannon-sil
Copy link
Collaborator Author

machine/jobs/settings.yaml line 6 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be moved under huggingface.

Should I also change these names to add_unk_src_tokens and add_unk_trg_tokens to match the changed argument names in HuggingfaceNmtModelTrainer?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil)


machine/jobs/settings.yaml line 6 at r4 (raw file):

Previously, mshannon-sil wrote…

Should I also change these names to add_unk_src_tokens and add_unk_trg_tokens to match the changed argument names in HuggingfaceNmtModelTrainer?

Yes, it would make sense to keep them consistent.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

Previously, mshannon-sil wrote…

The first error is Property "normalizer" has no defined setter. However, some huggingface documentation https://github.com/huggingface/tokenizers/blob/b24a2fc1781d5da4e6ebcd3ecb5b91edffc0a05f/bindings/python/examples/custom_components.py uses this method of overwriting the normalizer, and I've tested that it works correctly, so there shouldn't be an issue. The second and third errors say Cannot access member "source_corpus" for type "Dataset" Member "source_corpus" is unknown and Cannot access member "target_corpus" for type "Dataset" Member "target_corpus" is unknown respectively. Not sure why this just started appearing now since I didn't write that code and I don't think I changed anything that should have affected it. Any recommendations for how to resolve these?

I will take a look at them and see if I can figure out where the errors are coming from.

@johnml1135
Copy link
Collaborator

machine/jobs/settings.yaml line 24 at r5 (raw file):

      batch_size: 16
    tokenizer:
      add_unk_src_tokens: false

Why false and true? What if people are doing a Backtranslation? Why would we not want both to be true?

@johnml1135
Copy link
Collaborator

tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 203 at r5 (raw file):

            max_target_length=20,
            add_unk_src_tokens=False,
            add_unk_trg_tokens=False,

Sometimes initialized to true sometimes to false - why?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @mshannon-sil)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r5, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)


machine/jobs/settings.yaml line 24 at r5 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Why false and true? What if people are doing a Backtranslation? Why would we not want both to be true?

Yeah, we should probably turn them both on by default for Serval.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I will take a look at them and see if I can figure out where the errors are coming from.

Did you fix these type issues?

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/jobs/settings.yaml line 6 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, it would make sense to keep them consistent.

Done.


machine/jobs/settings.yaml line 24 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yeah, we should probably turn them both on by default for Serval.

I agree that we should always want to add missing characters whether they're in the source or the target. I'll set them back to being both true.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Did you fix these type issues?

No not yet, the last commit was to address the other comments.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 86 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think a more descriptive name would be better. Maybe add_unk_src_tokens.

Done.


tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 203 at r5 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Sometimes initialized to true sometimes to false - why?

This is in the test directory, so I have different settings to make sure it works as expected.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

Previously, mshannon-sil wrote…

No not yet, the last commit was to address the other comments.

For the normalizer type error, I think we should just leave it like that. Any changes would be on Huggingface's end to implement a setter function, but it works regardless. In addition to the huggingface example I linked to earlier demonstrating a custom Normalizer replacing the original, there is this stack overflow answer where someone else did the same thing (https://stackoverflow.com/questions/74490128/nlp-huggingface-tokenizer-wont-let-me-set-pre-tokenizer-property).

As for the two corpus type errors, what's going on is that the type of self._corpus is defined as either Dataset or ParallelTextCorpus. I realized that my code is designed to work for ParallelTextCorpus and would fail if it receives a Dataset type corpus. However, looking at the codebase, I can't find a single instance where Dataset would be passed in instead of ParallelTextCorpus, and in my tests they have all been ParallelTextCorpus. I think it would be simpler to just set self._corpus to ParallelTextCorpus exclusively, unless there's a need for the Dataset type to be passed in.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @johnml1135 and @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

Previously, mshannon-sil wrote…

For the normalizer type error, I think we should just leave it like that. Any changes would be on Huggingface's end to implement a setter function, but it works regardless. In addition to the huggingface example I linked to earlier demonstrating a custom Normalizer replacing the original, there is this stack overflow answer where someone else did the same thing (https://stackoverflow.com/questions/74490128/nlp-huggingface-tokenizer-wont-let-me-set-pre-tokenizer-property).

As for the two corpus type errors, what's going on is that the type of self._corpus is defined as either Dataset or ParallelTextCorpus. I realized that my code is designed to work for ParallelTextCorpus and would fail if it receives a Dataset type corpus. However, looking at the codebase, I can't find a single instance where Dataset would be passed in instead of ParallelTextCorpus, and in my tests they have all been ParallelTextCorpus. I think it would be simpler to just set self._corpus to ParallelTextCorpus exclusively, unless there's a need for the Dataset type to be passed in.

Regarding the normalizer type error, I think ignoring the error is fine. We should add a comment explaining that we are using undocumented/unsupported behavior.

Although none of our code uses Dataset, you have to remember that Machine.py is a shared library. I would prefer not to remove this functionality, since it is useful to other applications that might embed Machine.py. I don't think it would be too difficult to update the code to support Dataset as well. You can look at the preprocess_function for an example of how to extract the source and target segments from the Dataset.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 157 at r5 (raw file):

            charset = set()
            for text in texts:
                for row in text._get_rows():

You should use the get_rows method instead of the internal _get_rows method. The code would look like this:

for text in texts:
    with text.get_rows() as rows:
        for row in rows:
            charset = charset | set(row.text)

Although you don't have to use the context manager (with statement), it is safer to use it, since it will ensure that any underlying files/streams are closed.

…aracters, refactor missing_characters to support dataset type, add normalization check to test case
Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 9 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @johnml1135)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 196 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Regarding the normalizer type error, I think ignoring the error is fine. We should add a comment explaining that we are using undocumented/unsupported behavior.

Although none of our code uses Dataset, you have to remember that Machine.py is a shared library. I would prefer not to remove this functionality, since it is useful to other applications that might embed Machine.py. I don't think it would be too difficult to update the code to support Dataset as well. You can look at the preprocess_function for an example of how to extract the source and target segments from the Dataset.

Okay I added an explanatory comment.

I ended up moving the creation of the train_dataset variable from self._corpus to an earlier point in the code, so that I can just pass train_dataset into missing_characters() along with the language codes corresponding to the src/trg needing updating. I correspondingly refactored missing_characters() to support the Dataset type instead of the List[Text] type. I figured this was more streamlined than having the missing_characters() function need to support both Datasets and List[Text], and I don't think that moving the creation of the train_dataset has any negative impacts on the code. But if stylistically you'd prefer to keep the creation of train_dataset where it was, I can refactor the function again to support both types.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 157 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should use the get_rows method instead of the internal _get_rows method. The code would look like this:

for text in texts:
    with text.get_rows() as rows:
        for row in rows:
            charset = charset | set(row.text)

Although you don't have to use the context manager (with statement), it is safer to use it, since it will ensure that any underlying files/streams are closed.

Changed to work with Dataset rather than List[Text].

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@johnml1135 johnml1135 merged commit c565239 into main Oct 20, 2023
@ddaspit ddaspit deleted the tokenizer branch October 20, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Update tokenizer to accept new characters for Huggingface models

4 participants