- 
                Notifications
    You must be signed in to change notification settings 
- Fork 31k
Skipping outputs #3116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipping outputs #3116
Conversation
| Nice. One question: do we want to have a  
 
 | 
| I'm ok with both solutions (by the way, in general terms, a lot of software can accept a combination of whitelist and/or blacklist. When both are present, it's usually "include the whitelist, and remove the blacklist") If we do  | 
| I agree with both of you. Furthermore, this approach (deleting from the dict  I'm implementing both your proposed changes, looking into fixing the above and into fast tokenizers. I'll then move on to the tests. 
 | 
| return_attention_mask=True, | ||
| return_overflowing_tokens=False, | ||
| return_special_tokens_mask=False, | ||
| return_token_type_ids=None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe explicitly define them as: Optional[bool] = None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll update the typings, doc & tests if @mfuntowicz and @thomwolf agree that this is the best way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we move from True/False to None/XXX ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting None means that we can safely identify if the user has passed a value different from the default. If a value is None, then we can rely on the return_outputs attribute to return this value or not.
If it is not None, then its value is absolute (as it's an argument that was passed by the user).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see, then I guess a clear typing like @julien-c suggested would be good
aefcb84    to
    0cc3eb5      
    Compare
  
    | I like the solution, 👍 . One question: It requires the user to know / look at the names of the parameters handled by  model = SomeModel(...)
tokenizer = AutoTokenizer.from_pretrained(..., return_outputs=model.input_names) | 
| Indeed, such an attribute would be helpful! I'll add it and move on to the tests. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok did a review but I'm not sure it was finished enough to do one haha
anyway, here are some comments
| return_attention_mask=True, | ||
| return_overflowing_tokens=False, | ||
| return_special_tokens_mask=False, | ||
| return_token_type_ids=None, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we move from True/False to None/XXX ?
| pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | ||
| max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | ||
| pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | ||
| return_outputs = ["attention_mask"] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this model_inputs or model_input_names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @julien-c answer that as he offered that naming
If we do
keep_output, maybe we name the attributereturn_outputs: List[str]for consistency withencode_xxx()params?
| if return_attention_mask is None: | ||
| return_attention_mask = "attention_mask" in self.return_outputs | ||
| if return_overflowing_tokens is None: | ||
| return_overflowing_tokens = "overflowing_tokens" in self.return_outputs | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"overflowing_tokens" will never be in self.return_outputs, won't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not by default, but a user could pass it as argument so that it is always returned. Not as important as the other three though, I agree. Will remove it.
| Regarding the suggestion of @mfuntowicz, in the end, this should be in a common configuration for model and tokenizers I guess, so maybe we could actually have this attribute as input to  | 
| That value is currently managed by the  It still needs to be a class attribute in my opinion, as it should be overridden by children of  | 
0cc3eb5    to
    eaf0f97      
    Compare
  
    | return [ | ||
| {value: batch_encode_plus_sequences[value][i] for value in batch_encode_plus_sequences.keys()} | ||
| for i in range(len(batch_encode_plus_sequences)) | ||
| for i in range(len(batch_encode_plus_sequences["input_ids"])) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bugfix
63a664d    to
    84b4176      
    Compare
  
    | Should be good for review. I reverted the  | 
| Codecov Report
 @@            Coverage Diff             @@
##           master    #3116      +/-   ##
==========================================
+ Coverage   77.98%   77.99%   +<.01%     
==========================================
  Files          98       98              
  Lines       16645    16660      +15     
==========================================
+ Hits        12981    12994      +13     
- Misses       3664     3666       +2
 Continue to review full report at Codecov. 
 | 
dd336e1    to
    96b2fa1      
    Compare
  
    * Minimal example * Proposal 2 * Proposal 2 for fast tokenizers * Typings * Docs * Revert "Docs" for easier review This reverts commit eaf0f97. * Remove unnecessary assignments * Tests * Fix faulty type * Remove prints * return_outputs -> model_input_names * Revert "Revert "Docs" for easier review" This reverts commit 6fdc694. * code quality
Currently,
encode_plusandbatch_encode_plusreturn the same outputs for different models.This is sub-optimal as we can't do the following for each model:
This will crash for DistilBERT as the tokenizer would return
token_type_idswhich can't be handled by the model.In order to fix this, each tokenizer has to return model-specific arguments. Usually there are the same default arguments, and some models handle less (e.g. DistilBERT, RoBERTa).
This is a mock PR offering a solution using a
skip_outputsreturn_outputsargument to tokenizers.Returns a dictionary without the token type ids:
{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}Specifying a custom
skip_outputsreturn_outputsat initialisation works as expected:{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}or with a custom
skippedoutput:{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}This also works with saving/reloading:
Returns the following:
{'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]} {'input_ids': [101, 4403, 117, 1293, 1132, 1128, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}