shoarora · shoarora · Apr 1, 2020 · Mar 27, 2020 · Mar 27, 2020 · Mar 27, 2020
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -103,3 +103,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
     model_doc/xlmroberta
     model_doc/flaubert
     model_doc/bart
+    model_doc/t5
diff --git a/docs/source/model_doc/t5.rst b/docs/source/model_doc/t5.rst
@@ -0,0 +1,69 @@
+T5
+----------------------------------------------------
+**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+
+Overview
+~~~~~
+The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in 
+Here the abstract: 
+
+*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. 
+In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. 
+Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. 
+By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. 
+To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*
+
+The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .
+
+Tips
+~~~~~~~~~~~~~~~~~~~~
+- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised 
+  and supervised tasks and which each task is cast as a sequence to sequence task. 
+  Therefore T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
+  For more information about the which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
+- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generating the decoder output.
+- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
+
+
+T5Config
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5Config
+    :members:
+
+
+T5Tokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5Tokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+T5Model
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5Model
+    :members:
+
+
+T5ForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5ForConditionalGeneration
+    :members:
+
+
+TFT5Model
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFT5Model
+    :members:
+
+
+TFT5ForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFT5ForConditionalGeneration
+    :members:
diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst
@@ -275,7 +275,6 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   |                                                            | | FlauBERT large architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | Bart              | ``bart-large``                                             | | 12-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
 |                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_)                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
@@ -285,6 +284,3 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   | ``bart-large-cnn``                                         | | 12-layer, 1024-hidden, 16-heads, 406M parameters       (same as base)                                                               |
 |                   |                                                            | | bart-large base architecture finetuned on cnn summarization task                                                                    |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-
-
-.. <https://huggingface.co/transformers/examples.html>`__
diff --git a/examples/ner/utils_ner.py b/examples/ner/utils_ner.py
@@ -112,12 +112,15 @@ def convert_examples_to_features(
         label_ids = []
         for word, label in zip(example.words, example.labels):
             word_tokens = tokenizer.tokenize(word)
-            tokens.extend(word_tokens)
-            # Use the real label id for the first token of the word, and padding ids for the remaining tokens
-            label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
+
+            # bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
+            if len(word_tokens) > 0:
+                tokens.extend(word_tokens)
+                # Use the real label id for the first token of the word, and padding ids for the remaining tokens
+                label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
 
         # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
-        special_tokens_count = 3 if sep_token_extra else 2
+        special_tokens_count = tokenizer.num_added_tokens()
         if len(tokens) > max_seq_length - special_tokens_count:
             tokens = tokens[: (max_seq_length - special_tokens_count)]
             label_ids = label_ids[: (max_seq_length - special_tokens_count)]

diff --git a/examples/summarization/bart/README.md b/examples/summarization/bart/README.md
@@ -1,13 +1,15 @@
-### Get the CNN Data
+### Get Preprocessed CNN Data
 To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:
 
 ```bash
-tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
+wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
+tar -xzvf cnn_dm.tgz
 ```
+
 this should make a directory called cnn_dm/ with files like `test.source`. 
 To use your own data, copy that files format. Each article to be summarized is on its own line.
 
-### Usage
+### Evaluation
 To create summaries for each article in dataset, run:
 ```bash
 python evaluate_cnn.py <path_to_test.source> cnn_test_summaries.txt
@@ -16,21 +18,12 @@ the default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted
 
 
 ### Training
-
-
-
-After downloading the CNN and Daily Mail datasets, preprocess the dataset:
-```commandline
-git clone https://github.com/artmatsak/cnn-dailymail
-cd cnn-dailymail && python make_datafiles.py ../cnn/stories/ ../dailymail/stories/
-```
-
-Run the training script: `run_train.sh`
-
+Run/modify `run_train.sh`
+
 ### Where is the code?
 The core model is in `src/transformers/modeling_bart.py`. This directory only contains examples.
 
-### (WIP) Rouge Scores
+## (WIP) Rouge Scores
 
 ### Stanford CoreNLP Setup
 ```

diff --git a/examples/summarization/bart/test_bart_examples.py b/examples/summarization/bart/test_bart_examples.py
@@ -1,4 +1,5 @@
 import logging
+import os
 import sys
 import tempfile
 import unittest
@@ -8,6 +9,8 @@
 from .evaluate_cnn import _run_generate
 
 
+output_file_name = "output_bart_sum.txt"
+
 articles = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
 
 logging.basicConfig(level=logging.DEBUG)
@@ -19,10 +22,11 @@ class TestBartExamples(unittest.TestCase):
     def test_bart_cnn_cli(self):
         stream_handler = logging.StreamHandler(sys.stdout)
         logger.addHandler(stream_handler)
-        tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
+        tmp = Path(tempfile.gettempdir()) / "utest_generations_bart_sum.hypo"
         with tmp.open("w") as f:
             f.write("\n".join(articles))
-        testargs = ["evaluate_cnn.py", str(tmp), "output.txt"]
+        testargs = ["evaluate_cnn.py", str(tmp), output_file_name]
         with patch.object(sys, "argv", testargs):
             _run_generate()
-            self.assertTrue(Path("output.txt").exists())
+            self.assertTrue(Path(output_file_name).exists())
+            os.remove(Path(output_file_name))
diff --git a/examples/summarization/t5/README.md b/examples/summarization/t5/README.md
@@ -1,4 +1,4 @@
-***This script evaluates the the multitask pre-trained checkpoint for ``t5-large`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a model fine-tuned on summarization, so that results will be worse here by approx. 0.5 ROUGE points***
+***This script evaluates the the multitask pre-trained checkpoint for ``t5-base`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a model fine-tuned on summarization, so that results will be worse here by approx. 0.5 ROUGE points***
 
 ### Get the CNN Data
 First, you need to download the CNN data. It's about ~400 MB and can be downloaded by 

diff --git a/examples/summarization/t5/evaluate_cnn.py b/examples/summarization/t5/evaluate_cnn.py
@@ -14,13 +14,13 @@ def chunks(lst, n):
         yield lst[i : i + n]
 
 
-def generate_summaries(lns, output_file_path, batch_size, device):
+def generate_summaries(lns, output_file_path, model_size, batch_size, device):
     output_file = Path(output_file_path).open("w")
 
-    model = T5ForConditionalGeneration.from_pretrained("t5-large")
+    model = T5ForConditionalGeneration.from_pretrained(model_size)
     model.to(device)
 
-    tokenizer = T5Tokenizer.from_pretrained("t5-large")
+    tokenizer = T5Tokenizer.from_pretrained(model_size)
 
     # update config with summarization specific params
     task_specific_params = model.config.task_specific_params
@@ -61,6 +61,12 @@ def calculate_rouge(output_lns, reference_lns, score_path):
 
 def run_generate():
     parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "model_size",
+        type=str,
+        help="T5 model size, either 't5-small', 't5-base' or 't5-large'. Defaults to base.",
+        default="t5-base",
+    )
     parser.add_argument(
         "input_path", type=str, help="like cnn_dm/test_articles_input.txt",
     )
@@ -83,7 +89,7 @@ def run_generate():
 
     source_lns = [x.rstrip() for x in open(args.input_path).readlines()]
 
-    generate_summaries(source_lns, args.output_path, args.batch_size, args.device)
+    generate_summaries(source_lns, args.output_path, args.model_size, args.batch_size, args.device)
 
     output_lns = [x.rstrip() for x in open(args.output_path).readlines()]
     reference_lns = [x.rstrip() for x in open(args.reference_path).readlines()]

diff --git a/examples/summarization/t5/test_t5_examples.py b/examples/summarization/t5/test_t5_examples.py
@@ -1,4 +1,5 @@
 import logging
+import os
 import sys
 import tempfile
 import unittest
@@ -8,6 +9,9 @@
 from .evaluate_cnn import run_generate
 
 
+output_file_name = "output_t5_sum.txt"
+score_file_name = "score_t5_sum.txt"
+
 articles = ["New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
 
 logging.basicConfig(level=logging.DEBUG)
@@ -19,11 +23,13 @@ class TestT5Examples(unittest.TestCase):
     def test_t5_cli(self):
         stream_handler = logging.StreamHandler(sys.stdout)
         logger.addHandler(stream_handler)
-        tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
+        tmp = Path(tempfile.gettempdir()) / "utest_generations_t5_sum.hypo"
         with tmp.open("w") as f:
             f.write("\n".join(articles))
-        testargs = ["evaluate_cnn.py", str(tmp), "output.txt", str(tmp), "score.txt"]
+        testargs = ["evaluate_cnn.py", "t5-small", str(tmp), output_file_name, str(tmp), score_file_name]
         with patch.object(sys, "argv", testargs):
             run_generate()
-            self.assertTrue(Path("output.txt").exists())
-            self.assertTrue(Path("score.txt").exists())
+            self.assertTrue(Path(output_file_name).exists())
+            self.assertTrue(Path(score_file_name).exists())
+            os.remove(Path(output_file_name))
+            os.remove(Path(score_file_name))
diff --git a/examples/translation/t5/test_t5_examples.py b/examples/translation/t5/test_t5_examples.py
@@ -1,4 +1,5 @@
 import logging
+import os
 import sys
 import tempfile
 import unittest
@@ -8,7 +9,11 @@
 from .evaluate_wmt import run_generate
 
 
-text = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
+text = ["When Liana Barrientos was 23 years old, she got married in Westchester County."]
+translation = ["Als Liana Barrientos 23 Jahre alt war, heiratete sie in Westchester County."]
+
+output_file_name = "output_t5_trans.txt"
+score_file_name = "score_t5_trans.txt"
 
 logging.basicConfig(level=logging.DEBUG)
 
@@ -19,10 +24,20 @@ class TestT5Examples(unittest.TestCase):
     def test_t5_cli(self):
         stream_handler = logging.StreamHandler(sys.stdout)
         logger.addHandler(stream_handler)
-        tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
-        with tmp.open("w") as f:
+
+        tmp_source = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.hypo"
+        with tmp_source.open("w") as f:
             f.write("\n".join(text))
-        testargs = ["evaluate_cnn.py", str(tmp), "output.txt", str(tmp), "score.txt"]
+
+        tmp_target = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.target"
+        with tmp_target.open("w") as f:
+            f.write("\n".join(translation))
+
+        testargs = ["evaluate_wmt.py", str(tmp_source), output_file_name, str(tmp_target), score_file_name]
+
         with patch.object(sys, "argv", testargs):
             run_generate()
-            self.assertTrue(Path("output.txt").exists())
+            self.assertTrue(Path(output_file_name).exists())
+            self.assertTrue(Path(score_file_name).exists())
+            os.remove(Path(output_file_name))
+            os.remove(Path(score_file_name))
diff --git a/model_cards/dbmdz/bert-base-german-cased/README.md b/model_cards/dbmdz/bert-base-german-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz German BERT models

diff --git a/model_cards/dbmdz/bert-base-german-europeana-cased/README.md b/model_cards/dbmdz/bert-base-german-europeana-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 tags:
   - "historic german"
 ---

diff --git a/model_cards/dbmdz/bert-base-german-europeana-uncased/README.md b/model_cards/dbmdz/bert-base-german-europeana-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 tags:
   - "historic german"
 ---

diff --git a/model_cards/dbmdz/bert-base-german-uncased/README.md b/model_cards/dbmdz/bert-base-german-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz German BERT models

diff --git a/model_cards/dbmdz/bert-base-italian-cased/README.md b/model_cards/dbmdz/bert-base-italian-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz BERT models

diff --git a/model_cards/dbmdz/bert-base-italian-uncased/README.md b/model_cards/dbmdz/bert-base-italian-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz BERT models

diff --git a/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md b/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz BERT models

diff --git a/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md b/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz BERT models

diff --git a/model_cards/dbmdz/bert-base-turkish-128k-cased/README.md b/model_cards/dbmdz/bert-base-turkish-128k-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz Turkish BERT model

diff --git a/model_cards/dbmdz/bert-base-turkish-128k-uncased/README.md b/model_cards/dbmdz/bert-base-turkish-128k-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz Turkish BERT model

diff --git a/model_cards/dbmdz/bert-base-turkish-cased/README.md b/model_cards/dbmdz/bert-base-turkish-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz Turkish BERT model

diff --git a/model_cards/dbmdz/bert-base-turkish-uncased/README.md b/model_cards/dbmdz/bert-base-turkish-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz Turkish BERT model

diff --git a/model_cards/dbmdz/distilbert-base-turkish-cased/README.md b/model_cards/dbmdz/distilbert-base-turkish-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---
 
 # 🤗 + 📚 dbmdz Distilled Turkish BERT model