Skip to content
Merged
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/xlmroberta
model_doc/flaubert
model_doc/bart
model_doc/t5
69 changes: 69 additions & 0 deletions docs/source/model_doc/t5.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
T5
----------------------------------------------------
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_

Overview
~~~~~
The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in
Here the abstract:

*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.
In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format.
Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks.
By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*

The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .

Tips
~~~~~~~~~~~~~~~~~~~~
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised
and supervised tasks and which each task is cast as a sequence to sequence task.
Therefore T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
For more information about the which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generating the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.


T5Config
~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.T5Config
:members:


T5Tokenizer
~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.T5Tokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary


T5Model
~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.T5Model
:members:


T5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.T5ForConditionalGeneration
:members:


TFT5Model
~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFT5Model
:members:


TFT5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.TFT5ForConditionalGeneration
:members:
4 changes: 0 additions & 4 deletions docs/source/pretrained_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,6 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | | | FlauBERT large architecture |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Bart | ``bart-large`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
Expand All @@ -285,6 +284,3 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | ``bart-large-cnn`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters (same as base) |
| | | | bart-large base architecture finetuned on cnn summarization task |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+


.. <https://huggingface.co/transformers/examples.html>`__
11 changes: 7 additions & 4 deletions examples/ner/utils_ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,12 +112,15 @@ def convert_examples_to_features(
label_ids = []
for word, label in zip(example.words, example.labels):
word_tokens = tokenizer.tokenize(word)
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))

# bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
if len(word_tokens) > 0:
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))

# Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
special_tokens_count = 3 if sep_token_extra else 2
special_tokens_count = tokenizer.num_added_tokens()
if len(tokens) > max_seq_length - special_tokens_count:
tokens = tokens[: (max_seq_length - special_tokens_count)]
label_ids = label_ids[: (max_seq_length - special_tokens_count)]
Expand Down
23 changes: 8 additions & 15 deletions examples/summarization/bart/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
### Get the CNN Data
### Get Preprocessed CNN Data
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:

```bash
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz
```

this should make a directory called cnn_dm/ with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line.

### Usage
### Evaluation
To create summaries for each article in dataset, run:
```bash
python evaluate_cnn.py <path_to_test.source> cnn_test_summaries.txt
Expand All @@ -16,21 +18,12 @@ the default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted


### Training



After downloading the CNN and Daily Mail datasets, preprocess the dataset:
```commandline
git clone https://github.com/artmatsak/cnn-dailymail
cd cnn-dailymail && python make_datafiles.py ../cnn/stories/ ../dailymail/stories/
```

Run the training script: `run_train.sh`

Run/modify `run_train.sh`

### Where is the code?
The core model is in `src/transformers/modeling_bart.py`. This directory only contains examples.

### (WIP) Rouge Scores
## (WIP) Rouge Scores

### Stanford CoreNLP Setup
```
Expand Down
10 changes: 7 additions & 3 deletions examples/summarization/bart/test_bart_examples.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import logging
import os
import sys
import tempfile
import unittest
Expand All @@ -8,6 +9,8 @@
from .evaluate_cnn import _run_generate


output_file_name = "output_bart_sum.txt"

articles = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]

logging.basicConfig(level=logging.DEBUG)
Expand All @@ -19,10 +22,11 @@ class TestBartExamples(unittest.TestCase):
def test_bart_cnn_cli(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
tmp = Path(tempfile.gettempdir()) / "utest_generations_bart_sum.hypo"
with tmp.open("w") as f:
f.write("\n".join(articles))
testargs = ["evaluate_cnn.py", str(tmp), "output.txt"]
testargs = ["evaluate_cnn.py", str(tmp), output_file_name]
with patch.object(sys, "argv", testargs):
_run_generate()
self.assertTrue(Path("output.txt").exists())
self.assertTrue(Path(output_file_name).exists())
os.remove(Path(output_file_name))
2 changes: 1 addition & 1 deletion examples/summarization/t5/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
***This script evaluates the the multitask pre-trained checkpoint for ``t5-large`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a model fine-tuned on summarization, so that results will be worse here by approx. 0.5 ROUGE points***
***This script evaluates the the multitask pre-trained checkpoint for ``t5-base`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a model fine-tuned on summarization, so that results will be worse here by approx. 0.5 ROUGE points***

### Get the CNN Data
First, you need to download the CNN data. It's about ~400 MB and can be downloaded by
Expand Down
14 changes: 10 additions & 4 deletions examples/summarization/t5/evaluate_cnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ def chunks(lst, n):
yield lst[i : i + n]


def generate_summaries(lns, output_file_path, batch_size, device):
def generate_summaries(lns, output_file_path, model_size, batch_size, device):
output_file = Path(output_file_path).open("w")

model = T5ForConditionalGeneration.from_pretrained("t5-large")
model = T5ForConditionalGeneration.from_pretrained(model_size)
model.to(device)

tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer = T5Tokenizer.from_pretrained(model_size)

# update config with summarization specific params
task_specific_params = model.config.task_specific_params
Expand Down Expand Up @@ -61,6 +61,12 @@ def calculate_rouge(output_lns, reference_lns, score_path):

def run_generate():
parser = argparse.ArgumentParser()
parser.add_argument(
"model_size",
type=str,
help="T5 model size, either 't5-small', 't5-base' or 't5-large'. Defaults to base.",
default="t5-base",
)
parser.add_argument(
"input_path", type=str, help="like cnn_dm/test_articles_input.txt",
)
Expand All @@ -83,7 +89,7 @@ def run_generate():

source_lns = [x.rstrip() for x in open(args.input_path).readlines()]

generate_summaries(source_lns, args.output_path, args.batch_size, args.device)
generate_summaries(source_lns, args.output_path, args.model_size, args.batch_size, args.device)

output_lns = [x.rstrip() for x in open(args.output_path).readlines()]
reference_lns = [x.rstrip() for x in open(args.reference_path).readlines()]
Expand Down
14 changes: 10 additions & 4 deletions examples/summarization/t5/test_t5_examples.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import logging
import os
import sys
import tempfile
import unittest
Expand All @@ -8,6 +9,9 @@
from .evaluate_cnn import run_generate


output_file_name = "output_t5_sum.txt"
score_file_name = "score_t5_sum.txt"

articles = ["New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]

logging.basicConfig(level=logging.DEBUG)
Expand All @@ -19,11 +23,13 @@ class TestT5Examples(unittest.TestCase):
def test_t5_cli(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
tmp = Path(tempfile.gettempdir()) / "utest_generations_t5_sum.hypo"
with tmp.open("w") as f:
f.write("\n".join(articles))
testargs = ["evaluate_cnn.py", str(tmp), "output.txt", str(tmp), "score.txt"]
testargs = ["evaluate_cnn.py", "t5-small", str(tmp), output_file_name, str(tmp), score_file_name]
with patch.object(sys, "argv", testargs):
run_generate()
self.assertTrue(Path("output.txt").exists())
self.assertTrue(Path("score.txt").exists())
self.assertTrue(Path(output_file_name).exists())
self.assertTrue(Path(score_file_name).exists())
os.remove(Path(output_file_name))
os.remove(Path(score_file_name))
25 changes: 20 additions & 5 deletions examples/translation/t5/test_t5_examples.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import logging
import os
import sys
import tempfile
import unittest
Expand All @@ -8,7 +9,11 @@
from .evaluate_wmt import run_generate


text = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
text = ["When Liana Barrientos was 23 years old, she got married in Westchester County."]
translation = ["Als Liana Barrientos 23 Jahre alt war, heiratete sie in Westchester County."]

output_file_name = "output_t5_trans.txt"
score_file_name = "score_t5_trans.txt"

logging.basicConfig(level=logging.DEBUG)

Expand All @@ -19,10 +24,20 @@ class TestT5Examples(unittest.TestCase):
def test_t5_cli(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
with tmp.open("w") as f:

tmp_source = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.hypo"
with tmp_source.open("w") as f:
f.write("\n".join(text))
testargs = ["evaluate_cnn.py", str(tmp), "output.txt", str(tmp), "score.txt"]

tmp_target = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.target"
with tmp_target.open("w") as f:
f.write("\n".join(translation))

testargs = ["evaluate_wmt.py", str(tmp_source), output_file_name, str(tmp_target), score_file_name]

with patch.object(sys, "argv", testargs):
run_generate()
self.assertTrue(Path("output.txt").exists())
self.assertTrue(Path(output_file_name).exists())
self.assertTrue(Path(score_file_name).exists())
os.remove(Path(output_file_name))
os.remove(Path(score_file_name))
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-german-cased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: german
license: mit
---

# 🤗 + 📚 dbmdz German BERT models
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: german
license: mit
tags:
- "historic german"
---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: german
license: mit
tags:
- "historic german"
---
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-german-uncased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: german
license: mit
---

# 🤗 + 📚 dbmdz German BERT models
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-italian-cased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: italian
license: mit
---

# 🤗 + 📚 dbmdz BERT models
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-italian-uncased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: italian
license: mit
---

# 🤗 + 📚 dbmdz BERT models
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-italian-xxl-cased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: italian
license: mit
---

# 🤗 + 📚 dbmdz BERT models
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: italian
license: mit
---

# 🤗 + 📚 dbmdz BERT models
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-turkish-128k-cased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---

# 🤗 + 📚 dbmdz Turkish BERT model
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-turkish-128k-uncased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---

# 🤗 + 📚 dbmdz Turkish BERT model
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-turkish-cased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---

# 🤗 + 📚 dbmdz Turkish BERT model
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/bert-base-turkish-uncased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---

# 🤗 + 📚 dbmdz Turkish BERT model
Expand Down
1 change: 1 addition & 0 deletions model_cards/dbmdz/distilbert-base-turkish-cased/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---

# 🤗 + 📚 dbmdz Distilled Turkish BERT model
Expand Down
Loading