Skip to content

Conversation

@mfuntowicz
Copy link
Member

This can happen when using bert-base-multilingual-cased with an input containing an unique space.
In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior
over the labels_ids tokens adding one more tokens than tokens vector.

…s instead of hardcoded numbers.

Signed-off-by: Morgan Funtowicz <[email protected]>
…mpty.

This can happen when using bert-base-multilingual-cased with an input containing an unique space.
In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior
over the labels_ids tokens adding one more tokens than tokens vector.

Signed-off-by: Morgan Funtowicz <[email protected]>
@codecov-io
Copy link

Codecov Report

Merging #2991 into master will decrease coverage by 1.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2991      +/-   ##
==========================================
- Coverage   77.16%   76.12%   -1.04%     
==========================================
  Files          98       98              
  Lines       15997    15997              
==========================================
- Hits        12344    12178     -166     
- Misses       3653     3819     +166
Impacted Files Coverage Δ
src/transformers/modeling_tf_pytorch_utils.py 8.72% <0%> (-81.21%) ⬇️
src/transformers/modeling_roberta.py 85.71% <0%> (-10%) ⬇️
src/transformers/modeling_xlnet.py 73.48% <0%> (-2.3%) ⬇️
src/transformers/modeling_ctrl.py 96.03% <0%> (-2.21%) ⬇️
src/transformers/modeling_openai.py 80.2% <0%> (-1.35%) ⬇️
src/transformers/modeling_utils.py 92.2% <0%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38f5fe9...57f312d. Read the comment docs.

@srush
Copy link
Contributor

srush commented Feb 24, 2020

This looks good to me. Maybe we can make it a test? This might break some of the other examples as well I will check.

@srush srush merged commit b08259a into master Mar 27, 2020
@srush srush deleted the ner_tokenizers branch March 27, 2020 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants