Pull master #3

shoarora · 2020-04-07T22:39:26Z

No description provided.

* Add clear description of how to train T5 * correct docstring in T5 * correct typo * correct docstring format * update t5 model docs * implement collins feedback * fix typo and add more explanation for sentinal tokens * delete unnecessary todos

* make decoder input ids optional for t5 training * lm_lables should not be shifted in t5 * add tests * finish shift right functionality for PT T5 * move shift right to correct class * cleaner code * replace -100 values with pad token id * add assert statement * remove unnecessary for loop * make style

* Update the NER TF script to remove the softmax and make the pad token label id to -1 * Reformat the quality and style Co-authored-by: Julien Plu <[email protected]>

…#3501) Co-authored-by: Ioannis Douratsos <[email protected]>

…gingface#3488)

…gingface#3437) * Using loaded checkpoint with --do_predict Without this fix, I'm getting near-random validation performance for a trained model, and the validation performance differs per validation run. I think this happens since the `model` variable isn't set with the loaded checkpoint, so I'm using a randomly initialized model. Looking at the model activations, they differ each time I run evaluation (but they don't with this fix). * Update checkpoint loading * Fixing model loading

* feat: add model card bert-imdb * feat: add model card gpt2-imdb-pos * feat: add model card gpt2-imdb

* Create README.md * Update README.md

…face#3453)

Fix typo

…ace#3538)

- Show that the last uploaded version was trained on more data (custom_license files)

…for T5 and Bart (huggingface#3514) * fix conflicts * add model size argument to summarization * correct wrong import * fix isort * correct imports * other isort make style * make style

…ingface#3367) * add bad words list * make style * add bad_words_tokens * make style * better naming * make style * fix typo

* Start cleaning examples * Fixup

Create model card for: distilbert-multi-finetuned-for-xqua-on-tydiqa

* add bert bahasa readme * update readme * update readme * added xlnet * added tiny-bert and fix xlnet readme

(cherry picked from commit 8e25c4b)

…am search (huggingface#3601) * split beam search and no beam search test * fix test * clean generate tests

Co-Authored-By: Kevin Clark <[email protected]> Co-Authored-By: Lysandre Debut <[email protected]>

* Update notebooks * From local to global link * from local links to *actual* global links

* added model card * updated README * updated README * updated README * added evals * removed pico eval * Tweaks Co-authored-by: Julien Chaumond <[email protected]>

* Add model card for BERTeus * Update README

* Create model card * Fix model name in fine-tuning script

* Add model card * Fix model name in fine-tuning script

* Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py `convert_examples_to_fes atures` sets `pad_token=0` by default, which is correct for BERT but incorrect for RoBERTa (`pad_token=1`) and XLNet (`pad_token=5`). I think the other arguments to `convert_examples_to_features` are correct, but it might be helpful if someone checked who is more familiar with this part of the codebase. * Simplifying change to match recent commits

@dirkgr

* Renamed num_added_tokens to num_special_tokens_to_add Signed-off-by: Morgan Funtowicz <[email protected]> * Cherry-Pick: Partially fix space only input without special tokens added to the output huggingface#3091 Signed-off-by: Morgan Funtowicz <[email protected]> * Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast Signed-off-by: Morgan Funtowicz <[email protected]> * Make fast tokenizers unittests work on Windows. * Entirely refactored unittest for tokenizers fast. * Remove ABC class for CommonFastTokenizerTest * Added embeded_special_tokens tests from allenai @dirkgr * Make embeded_special_tokens tests from allenai more generic * Uniformize vocab_size as a property for both Fast and normal tokenizers * Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin) * Ensure providing None input raise the same ValueError than Python tokenizer + tests. * Fix invalid input for assert_padding when testing batch_encode_plus * Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter. * Ensure tokenize() correctly forward add_special_tokens to rust. * Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast. Avoid stripping on None values. * unittests ensure tokenize() also throws a ValueError if provided None * Added add_special_tokens unittest for all supported models. * Style * Make sure TransfoXL test run only if PyTorch is provided. * Split up tokenizers tests for each model type. * Fix invalid unittest with new tokenizers API. * Filter out Roberta openai detector models from unittests. * Introduce BatchEncoding on fast tokenizers path. This new structure exposes all the mappings retrieved from Rust. It also keeps the current behavior with model forward. * Introduce BatchEncoding on slow tokenizers path. Backward compatibility. * Improve error message on BatchEncoding for slow path * Make add_prefix_space True by default on Roberta fast to match Python in majority of cases. * Style and format. * Added typing on all methods for PretrainedTokenizerFast * Style and format * Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast. * Style and format * encode_plus now supports pretokenized inputs. * Remove user warning about add_special_tokens when working on pretokenized inputs. * Always go through the post processor. * Added support for pretokenized input pairs on encode_plus * Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError. * Added pretokenized inputs support on batch_encode_plus * Update BatchEncoding methods name to match Encoding. * Bump setup.py tokenizers dependency to 0.7.0rc1 * Remove unused parameters in BertTokenizerFast * Make sure Roberta returns token_type_ids for unittests. * Added missing typings * Update add_tokens prototype to match tokenizers side and allow AddedToken * Bumping tokenizers to 0.7.0rc2 * Added documentation for BatchEncoding * Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods. * Added higher-level typing for tokenize / encode_plus / batch_encode_plus. * Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers. * Fix text-classification pipeline using the wrong tokenizer * Make pipelines works with BatchEncoding * Turn off add_special_tokens on tokenize by default. Signed-off-by: Morgan Funtowicz <[email protected]> * Remove add_prefix_space from tokenize call in unittest. Signed-off-by: Morgan Funtowicz <[email protected]> * Style and quality Signed-off-by: Morgan Funtowicz <[email protected]> * Correct message for batch_encode_plus none input exception. Signed-off-by: Morgan Funtowicz <[email protected]> * Fix invalid list comprehension for offset_mapping overriding content every iteration. Signed-off-by: Morgan Funtowicz <[email protected]> * TransfoXL uses Strip normalizer. Signed-off-by: Morgan Funtowicz <[email protected]> * Bump tokenizers dependency to 0.7.0rc3 Signed-off-by: Morgan Funtowicz <[email protected]> * Support AddedTokens for special_tokens and use left stripping on mask for Roberta. Signed-off-by: Morgan Funtowicz <[email protected]> * SpecilaTokenMixin can use slots to faster access to underlying attributes. Signed-off-by: Morgan Funtowicz <[email protected]> * Remove update_special_tokens from fast tokenizers. * Ensure TransfoXL unittests are run only when torch is available. * Style. Signed-off-by: Morgan Funtowicz <[email protected]> * Style * Style 🙏🙏 * Remove slots on SpecialTokensMixin, need deep dive into pickle protocol. * Remove Roberta warning on __init__. * Move documentation to Google style. Co-authored-by: LysandreJik <[email protected]>

Co-authored-by: TevenLeScao <[email protected]>

@LysandreJik

Close huggingface#3639 + spurious warning mentioned in huggingface#3227 cc @LysandreJik @thomwolf

* Optimize causal mask using torch.where Instead of multiplying by 1.0 float mask, use torch.where with a bool mask for increased performance. * Maintain compatiblity with torch 1.0.0 - thanks for PR feedback * Fix typo * reformat line for CI

* improve and add features to benchmark utils * update benchmark style * remove output files

* fix egde gase for bert tokenization * add Lysandres comments for improvement * use new is_pretokenized_flag

* Initial commit to get BERT + run_glue.py on TPU * Add README section for TPU and address comments. * Cleanup TPU bits from run_glue.py (#3) TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * Cleanup TPU bits from run_glue.py TPU runner is currently implemented in: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py. We plan to upstream this directly into `huggingface/transformers` (either `master` or `tpu`) branch once it's been more thoroughly tested. * No need to call `xm.mark_step()` explicitly (huggingface#4) Since for gradient accumulation we're accumulating on batches from `ParallelLoader` instance which on next() marks the step itself. * Resolve R/W conflicts from multiprocessing (huggingface#5) * Add XLNet in list of models for `run_glue_tpu.py` (huggingface#6) * Add RoBERTa to list of models in TPU GLUE (huggingface#7) * Add RoBERTa and DistilBert to list of models in TPU GLUE (huggingface#8) * Use barriers to reduce duplicate work/resources (huggingface#9) * Shard eval dataset and aggregate eval metrics (huggingface#10) * Shard eval dataset and aggregate eval metrics Also, instead of calling `eval_loss.item()` every time do summation with tensors on device. * Change defaultdict to float * Reduce the pred, label tensors instead of metrics As brought up during review some metrics like f1 cannot be aggregated via averaging. GLUE task metrics depends largely on the dataset, so instead we sync the prediction and label tensors so that the metrics can be computed accurately on those instead. * Only use tb_writer from master (huggingface#11) * Apply huggingface black code formatting * Style * Remove `--do_lower_case` as example uses cased * Add option to specify tensorboard logdir This is needed for our testing framework which checks regressions against key metrics writtern by the summary writer. * Using configuration for `xla_device` * Prefix TPU specific comments. * num_cores clarification and namespace eval metrics * Cache features file under `args.cache_dir` Instead of under `args.data_dir`. This is needed as our test infra uses data_dir with a read-only filesystem. * Rename `run_glue_tpu` to `run_tpu_glue` Co-authored-by: LysandreJik <[email protected]>

patrickvonplaten and others added 30 commits March 30, 2020 13:35

fix lm lables in docstring (huggingface#3529)

296252c

Release: v2.7.0

6f5a12a

Un-pin isort for v2.7.0 pypi

a009d75

Re-pin isort version

eff757f

Update the NER TF script (huggingface#3511)

d38bbb2

* Update the NER TF script to remove the softmax and make the pad token label id to -1 * Reformat the quality and style Co-authored-by: Julien Plu <[email protected]>

[InputExample] Unfreeze for now, cf. huggingface#3423

cc598b3

[BART] Update encoder and decoder on set_input_embedding (huggingface…

1f72865

…#3501) Co-authored-by: Ioannis Douratsos <[email protected]>

[bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (hug…

8deff3a

…gingface#3488)

Add model cards (huggingface#3537)

a6c4ee2

* feat: add model card bert-imdb * feat: add model card gpt2-imdb-pos * feat: add model card gpt2-imdb

Add electra and alectra model cards (huggingface#3524)

ebceeea

Create model card (huggingface#3487)

99833a9

Add text shown in example of usage (huggingface#3464)

b48a1f0

Added CovidBERT-NLI model card (huggingface#3477)

c82ef72

Add link to 16 POS tags model (huggingface#3465)

c2cf192

Create README.md (huggingface#3393)

bbedb59

* Create README.md * Update README.md

Create card for the model: GPT-2-finetuned-covid-bio-medrxiv (hugging…

4a56635

…face#3453)

Update README.md (huggingface#3470)

a8d4dff

Fix typo

Add better explanation to check docs locally. (huggingface#3459)

57b0fab

Update usage doc regarding generate fn (huggingface#3504)

42e1e3c

remove useless and confusing lm_labels line (huggingface#3531)

55bcae7

[Docs] Add usage examples for translation and summarization (huggingf…

83d1fbc

…ace#3538)

Update README.md (huggingface#3552)

0373b60

- Show that the last uploaded version was trained on more data (custom_license files)

[Examples] Clean summarization and translation example testing files …

ae6834e

…for T5 and Bart (huggingface#3514) * fix conflicts * add model size argument to summarization * correct wrong import * fix isort * correct imports * other isort make style * make style

[Generate] Add bad words list argument to the generate function (hugg…

b38d552

…ingface#3367) * add bad words list * make style * add bad_words_tokens * make style * better naming * make style * fix typo

Tokenizers: Start cleaning examples a little (huggingface#3455)

50e15c8

* Start cleaning examples * Fixup

Create model card (huggingface#3557)

c1a6252

Create model card for: distilbert-multi-finetuned-for-xqua-on-tydiqa

Add tiny-bert-bahasa-cased model card (huggingface#3567)

8538ce9

* add bert bahasa readme * update readme * update readme * added xlnet * added tiny-bert and fix xlnet readme

Timoeller and others added 28 commits April 4, 2020 15:27

Adjust model card to reflect changes to vocabulary

4ab8ab4

(cherry picked from commit 8e25c4b)

Fix TF T5 docstring (huggingface#3636)

b809d2f

fix argument order (huggingface#3637)

1789c7d

[Generate, Test] Split generate test function into beam search, no be…

2ee4105

…am search (huggingface#3601) * split beam search and no beam search test * fix test * clean generate tests

Release: v2.8.0

36bffc8

unpin isort for pypi

11c3257

Re-pin isort

ea6dba2

[model_cards] ELECTRA (w/ examples of usage)

39a34cc

Co-Authored-By: Kevin Clark <[email protected]> Co-Authored-By: Lysandre Debut <[email protected]>

Update notebooks (huggingface#3620)

261c4ff

* Update notebooks * From local to global link * from local links to *actual* global links

BioMed Roberta-Base (AllenAI) (huggingface#3643)

529534d

* added model card * updated README * updated README * updated README * added evals * removed pico eval * Tweaks Co-authored-by: Julien Chaumond <[email protected]>

Add model card for BERTeus (huggingface#3649)

47e1334

* Add model card for BERTeus * Update README

Create README.md (huggingface#3662)

760872d

Create README.md

6903a98

Create model card (huggingface#3654)

c4bcb01

* Create model card * Fix model name in fine-tuning script

Add model card (huggingface#3655)

769b60f

* Add model card * Fix model name in fine-tuning script

Create README.md

6bec88c

Add model card

43eca3f

Add model card

326e6eb

Create README.md

0ac33dd

fixed TransfoXLLMHeadModel documentation (huggingface#3661)

0a9d09b

Co-authored-by: TevenLeScao <[email protected]>

[model_cards] Turn down spurious warnings

11cc1e1

Close huggingface#3639 + spurious warning mentioned in huggingface#3227 cc @LysandreJik @thomwolf

Fix roberta checkpoint conversion script (huggingface#3642)

5aa8a27

Speedup torch summarization tests (huggingface#3663)

0a4b106

[Examples, Benchmark] Improve benchmark utils (huggingface#3674)

80fa0f7

* improve and add features to benchmark utils * update benchmark style * remove output files

[Tokenization] fix edge case for bert tokenization (huggingface#3517)

b0ad069

* fix egde gase for bert tokenization * add Lysandres comments for improvement * use new is_pretokenized_flag

shoarora merged commit cb81c9b into glue-test-processors Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pull master #3

Pull master #3

Uh oh!

shoarora commented Apr 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

28 participants

Pull master #3

Pull master #3

Uh oh!

Conversation

shoarora commented Apr 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

28 participants