Skip to content

Conversation

@patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Mar 24, 2020

Adds TF 2.0 Example for T5 summarization.

Adds dataset download file via tensorflow_datasets and a rouge scorer.

Example is currently being tested on T5-large on GPU to see how rouge scorer performs in comparsion to examples/summarization/bart rouge scorer.

max_length=max_length,
min_length=min_length,
early_stopping=True,
bos_token_id=tokenizer.pad_token_id,
Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@craffel -
I wanted to add an example for T5 summarization for the CNN/DM dataset.

These are the default params, I use for t5 summarization. I saw in the paper that you use num_beams=4 as well and from the code:
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/gin/beam_search.gin
and
https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/gin/sequence_lengths/cnn_dailymail_v002.gin, I assumed that you pad/cut to 512 tokens. transformers uses the "simple" length_penalty instead of the "google" length_penalty so instead of alpha=0.6, I set the length penalty here to 2.0 - do you think that is good? It's the default parameter for Bart. And regarding min_length and max_length, I am not sure whether these params are good - they are copied form the Bart defaults.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Patrick, thanks for making this. Some notes

  • The released checkpoints are after multitask pre-training. You will get decent summarization results with them, but in order to get our SoTA-at-the-time summarization results you'll need to finetune the model further before generating from it. You can probably ignore that for the sake of an example, but it may be worth mentioning in the README.
  • We didn't play with the num beams or length penalty at all, just set them to values that had been used in the past on machine translation and never tweaked them at all. So, your values are probably fine (especially since you already are going to be losing some performance without fine-tuning).
  • We don't have min length/max length functionality in our code; we just let the model choose whatever length. If BART uses this trick, it probably helps, and it seems somewhat model-agnostic so we can just use their values.
  • The model should work equivalently whether you pad to a max length or not. You should also be able to use a sequence length longer than 512 and it should work fine too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Install `files2rouge` following the instructions at [here](https://github.com/pltrdy/files2rouge).
I also needed to run `sudo apt-get install libxml-parser-perl`

```python
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use a Python ROUGE scorer, e.g.
https://github.com/google-research/google-research/tree/master/rouge
That way you can just score within the evaluate_cnn.py file instead of having people run this separate code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the google-research rouge scorer :-). Works well, but it's not the nicest python API to score a list of strings :D.

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me, I let you take care of @craffel comments

Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending my comments

@patrickvonplaten
Copy link
Contributor Author

pending my comments

Very much down to share the summarization code in another PR!

@patrickvonplaten patrickvonplaten force-pushed the add_t5_summarization_example branch from 2cda9e8 to 5d1c4a6 Compare March 26, 2020 13:40
@patrickvonplaten
Copy link
Contributor Author

Code quality test fails because of unpinned isort library (see #3449)

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me!

@@ -0,0 +1,25 @@
***This script evaluates the [T5 Model](https://arxiv.org/pdf/1910.10683.pdf) ``t5-large`` on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a ``t5-large`` model fine-tuned on summarization, so that results will be slightly worse here***
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Suggest changing "the T5 Model t5-large" to "the multitask pre-trained checkpoint for t5-large
  2. If you want to be more specific than "slightly", it looks like they are about 0.5 ROUGE points (on all ROUGE variants) lower than they would be with fine-tuning.

@thomwolf thomwolf merged commit e703e92 into huggingface:master Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants