Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
5b44e0a
[T5] Add training documenation (#3507)
patrickvonplaten Mar 30, 2020
75ec6c9
[T5] make decoder input ids optional for t5 training (#3521)
patrickvonplaten Mar 30, 2020
296252c
fix lm lables in docstring (#3529)
patrickvonplaten Mar 30, 2020
6f5a12a
Release: v2.7.0
LysandreJik Mar 30, 2020
a009d75
Un-pin isort for v2.7.0 pypi
LysandreJik Mar 30, 2020
eff757f
Re-pin isort version
LysandreJik Mar 30, 2020
d38bbb2
Update the NER TF script (#3511)
Mar 30, 2020
cc598b3
[InputExample] Unfreeze for now, cf. #3423
julien-c Mar 30, 2020
1f72865
[BART] Update encoder and decoder on set_input_embedding (#3501)
dougian Mar 30, 2020
8deff3a
[bart-tiny-random] Put a 5MB model on S3 to allow faster exampl… (#3488)
sshleifer Mar 30, 2020
e5c393d
[Bug fix] Using loaded checkpoint with --do_predict (instead of… (#3437)
ethanjperez Mar 30, 2020
a6c4ee2
Add model cards (#3537)
lvwerra Mar 31, 2020
ebceeea
Add electra and alectra model cards (#3524)
shoarora Mar 31, 2020
99833a9
Create model card (#3487)
mrm8488 Mar 31, 2020
b48a1f0
Add text shown in example of usage (#3464)
mrm8488 Mar 31, 2020
c82ef72
Added CovidBERT-NLI model card (#3477)
gsarti Mar 31, 2020
c2cf192
Add link to 16 POS tags model (#3465)
mrm8488 Mar 31, 2020
bbedb59
Create README.md (#3393)
brandenchan Mar 31, 2020
4a56635
Create card for the model: GPT-2-finetuned-covid-bio-medrxiv (#3453)
mrm8488 Mar 31, 2020
a8d4dff
Update README.md (#3470)
mrm8488 Mar 31, 2020
57b0fab
Add better explanation to check `docs` locally. (#3459)
patrickvonplaten Mar 31, 2020
42e1e3c
Update usage doc regarding generate fn (#3504)
patrickvonplaten Mar 31, 2020
55bcae7
remove useless and confusing lm_labels line (#3531)
patrickvonplaten Mar 31, 2020
83d1fbc
[Docs] Add usage examples for translation and summarization (#3538)
patrickvonplaten Mar 31, 2020
0373b60
Update README.md (#3552)
mrm8488 Mar 31, 2020
ae6834e
[Examples] Clean summarization and translation example testing files …
patrickvonplaten Mar 31, 2020
b38d552
[Generate] Add bad words list argument to the generate function (#3367)
patrickvonplaten Mar 31, 2020
50e15c8
Tokenizers: Start cleaning examples a little (#3455)
julien-c Apr 1, 2020
c1a6252
Create model card (#3557)
mrm8488 Apr 1, 2020
8538ce9
Add tiny-bert-bahasa-cased model card (#3567)
huseinzol05 Apr 1, 2020
b815edf
[T5, Testst] Add extensive hard-coded integration tests and make sure…
patrickvonplaten Apr 1, 2020
9de9ceb
Correct output shape for Bert NSP models in docs (#3482)
Genius1237 Apr 1, 2020
06dd597
fix bug in warnings T5 pipelines (#3545)
patrickvonplaten Apr 1, 2020
a4ee4da
[T5, TF 2.2] change tf t5 argument naming (#3547)
patrickvonplaten Apr 1, 2020
ab5d06a
[T5, examples] replace heavy t5 models with tiny random models (#3556)
patrickvonplaten Apr 2, 2020
390c128
[Encoder-Decoder] Force models outputs to always have batch_size as t…
patrickvonplaten Apr 2, 2020
1b10159
Adding should_continue check for retraining (#3509)
xeb Apr 2, 2020
c50aa67
Resizing embedding matrix before sending it to the optimizer. (#3532)
Apr 2, 2020
f68d228
delete bogus print statement (#3595)
patrickvonplaten Apr 2, 2020
ddb1ce7
added model_cards for polish squad models
Apr 2, 2020
9f6349a
Create README.md
ahotrod Apr 1, 2020
81484b4
Create README.md (#3568)
redewiedergabe Apr 3, 2020
8e287d5
corrected mistake in polish model cards (#3611)
borhenryk Apr 3, 2020
e91692f
Update README.md (#3603)
ahotrod Apr 3, 2020
1ac6a24
Update README.md (#3604)
ahotrod Apr 3, 2020
216e167
Added albert-base-bahasa-cased README and fixed tiny-bert-bahasa-case…
huseinzol05 Apr 3, 2020
8594dd8
BertJapaneseTokenizer accept options for mecab (#3566)
tamuhey Apr 3, 2020
d5d7d88
ELECTRA (#3257)
LysandreJik Apr 3, 2020
c6acd24
Speed up GELU computation with torch.jit (#2988)
mryab Apr 3, 2020
3e4b4dd
[model_cards] Link to ExBERT visualisation
julien-c Apr 4, 2020
243e687
Create model card
mrm8488 Apr 3, 2020
94eb68d
weigths*weights
julien-c Apr 4, 2020
5d912e7
Tweak typing for #3566
julien-c Apr 4, 2020
fd9995e
Create README.md
ktrapeznikov Apr 4, 2020
ac40eed
Create README.md
ktrapeznikov Apr 4, 2020
4ab8ab4
Adjust model card to reflect changes to vocabulary
Timoeller Apr 3, 2020
b809d2f
Fix TF T5 docstring (#3636)
patrickvonplaten Apr 5, 2020
1789c7d
fix argument order (#3637)
patrickvonplaten Apr 5, 2020
2ee4105
[Generate, Test] Split generate test function into beam search, no be…
patrickvonplaten Apr 6, 2020
36bffc8
Release: v2.8.0
LysandreJik Apr 6, 2020
11c3257
unpin isort for pypi
LysandreJik Apr 6, 2020
ea6dba2
Re-pin isort
LysandreJik Apr 6, 2020
39a34cc
[model_cards] ELECTRA (w/ examples of usage)
julien-c Apr 6, 2020
261c4ff
Update notebooks (#3620)
LysandreJik Apr 6, 2020
529534d
BioMed Roberta-Base (AllenAI) (#3643)
kernelmachine Apr 6, 2020
47e1334
Add model card for BERTeus (#3649)
jjacampos Apr 6, 2020
760872d
Create README.md (#3662)
MichalMalyska Apr 6, 2020
6903a98
Create README.md
mrm8488 Apr 6, 2020
c4bcb01
Create model card (#3654)
mrm8488 Apr 6, 2020
769b60f
Add model card (#3655)
mrm8488 Apr 6, 2020
6bec88c
Create README.md
mrm8488 Apr 6, 2020
43eca3f
Add model card
mrm8488 Apr 6, 2020
326e6eb
Add model card
mrm8488 Apr 6, 2020
0ac33dd
Create README.md
ktrapeznikov Apr 6, 2020
e52d125
Fix RoBERTa/XLNet Pad Token in run_multiple_choice.py (#3631)
ethanjperez Apr 6, 2020
96ab75b
Tokenizers v3.0.0 (#3185)
mfuntowicz Apr 6, 2020
0a9d09b
fixed TransfoXLLMHeadModel documentation (#3661)
TevenLeScao Apr 6, 2020
11cc1e1
[model_cards] Turn down spurious warnings
julien-c Apr 7, 2020
5aa8a27
Fix roberta checkpoint conversion script (#3642)
Apr 7, 2020
0a4b106
Speedup torch summarization tests (#3663)
sshleifer Apr 7, 2020
05deb52
Optimize causal mask using torch.where (#2715)
Akababa Apr 7, 2020
80fa0f7
[Examples, Benchmark] Improve benchmark utils (#3674)
patrickvonplaten Apr 7, 2020
b0ad069
[Tokenization] fix edge case for bert tokenization (#3517)
patrickvonplaten Apr 7, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,8 +164,9 @@ At some point in the future, you'll be able to seamlessly move from pre-training
14. **[MMBT](https://github.com/facebookresearch/mmbt/)** (from Facebook), released together with the paper a [Supervised Multimodal Bitransformers for Classifying Images and Text](https://arxiv.org/pdf/1909.02950.pdf) by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine.
15. **[FlauBERT](https://github.com/getalp/Flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
16. **[BART](https://github.com/pytorch/fairseq/tree/master/examples/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
17. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
18. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
17. **[ELECTRA](https://github.com/google-research/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
18. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
19. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).

Expand Down
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ Once you have setup `sphinx`, you can build the documentation by running the fol
make html
```

A folder called ``_build/html`` should have been created. You can now open the file ``_build/html/index.html`` in your browser.

---
**NOTE**

Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'2.6.0'
release = u'2.8.0'


# -- General configuration ---------------------------------------------------
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/flaubert
model_doc/bart
model_doc/t5
model_doc/electra
2 changes: 1 addition & 1 deletion docs/source/migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ loss = outputs[0]
# In transformers you can also have access to the logits:
loss, logits = outputs[:2]

# And even the attention weigths if you configure the model to output them (and other outputs too, see the docstrings and documentation)
# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs
Expand Down
115 changes: 115 additions & 0 deletions docs/source/model_doc/electra.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
ELECTRA
----------------------------------------------------

The ELECTRA model was proposed in the paper.
`ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators <https://openreview.net/pdf?id=r1xMH1BtvB>`__.
ELECTRA is a new pre-training approach which trains two transformer models: the generator and the discriminator. The
generator's role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator,
which is the model we're interested in, tries to identify which tokens were replaced by the generator in the sequence.

The abstract from the paper is the following:

*Masked language modeling (MLM) pre-training methods such as BERT corrupt
the input by replacing some tokens with [MASK] and then train a model to
reconstruct the original tokens. While they produce good results when transferred
to downstream NLP tasks, they generally require large amounts of compute to be
effective. As an alternative, we propose a more sample-efficient pre-training task
called replaced token detection. Instead of masking the input, our approach
corrupts it by replacing some tokens with plausible alternatives sampled from a small
generator network. Then, instead of training a model that predicts the original
identities of the corrupted tokens, we train a discriminative model that predicts
whether each token in the corrupted input was replaced by a generator sample
or not. Thorough experiments demonstrate this new pre-training task is more
efficient than MLM because the task is defined over all input tokens rather than
just the small subset that was masked out. As a result, the contextual representations
learned by our approach substantially outperform the ones learned by BERT
given the same model size, data, and compute. The gains are particularly strong
for small models; for example, we train a model on one GPU for 4 days that
outperforms GPT (trained using 30x more compute) on the GLUE natural language
understanding benchmark. Our approach also works well at scale, where it
performs comparably to RoBERTa and XLNet while using less than 1/4 of their
compute and outperforms them when using the same amount of compute.*

Tips:

- ELECTRA is the pre-training approach, therefore there is nearly no changes done to the underlying model: BERT. The
only change is the separation of the embedding size and the hidden size -> The embedding size is generally smaller,
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from
their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no
projection layer is used.
- The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
contain both the generator and discriminator. The conversion script requires the user to name which model to export
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
available ELECTRA models, however. This means that the discriminator may be loaded in the `ElectraForMaskedLM` model,
and the generator may be loaded in the `ElectraForPreTraining` model (the classification head will be randomly
initialized as it doesn't exist in the generator).


ElectraConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.ElectraConfig
:members:


ElectraTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.ElectraTokenizer
:members:


ElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.ElectraModel
:members:


ElectraForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.ElectraForPreTraining
:members:


ElectraForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.ElectraForMaskedLM
:members:


ElectraForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.ElectraForTokenClassification
:members:


TFElectraModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFElectraModel
:members:


TFElectraForPreTraining
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFElectraForPreTraining
:members:


TFElectraForMaskedLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFElectraForMaskedLM
:members:


TFElectraForTokenClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFElectraForTokenClassification
:members:
40 changes: 36 additions & 4 deletions docs/source/model_doc/t5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,45 @@ To facilitate future work on transfer learning for NLP, we release our dataset,

The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .

Training
~~~~~~~~~~~~~~~~~~~~
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
This means that for training we always need an input sequence and a target sequence.
The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* perprended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``lm_labels``. The PAD token is hereby used as the start-sequence token.
T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.

- Unsupervised denoising training
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens)
and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens.
Each sentinel tokens represents a unique mask token for this sentence and should start with ``<extra_id_1>``, ``<extrac_id_2>``, ... up to ``<extra_id_100>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
*E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows:

::

input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)

- Supervised training
In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should
be processed as follows:

::

input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
lm_labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)

Tips
~~~~~~~~~~~~~~~~~~~~
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised
and supervised tasks and which each task is cast as a sequence to sequence task.
Therefore T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
For more information about the which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generating the decoder output.
and supervised tasks and for which each task is converted into a text-to-text format.
T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.


Expand Down
1 change: 1 addition & 0 deletions docs/source/notebooks.md
16 changes: 0 additions & 16 deletions docs/source/notebooks.rst

This file was deleted.

Loading