Add t5 to pipeline(task='summarization') #3413

patrickvonplaten · 2020-03-24T14:26:20Z

This PR:

adds T5 to summarization piplines.
adds warnings and better defaults to Bart/T5 summarization
removes unnecessary assert in generate() function

codecov-io · 2020-03-24T14:35:15Z

Codecov Report

Merging #3413 into master will increase coverage by 0.02%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master    #3413      +/-   ##
==========================================
+ Coverage   77.56%   77.58%   +0.02%     
==========================================
  Files         100      100              
  Lines       16970    16993      +23     
==========================================
+ Hits        13162    13184      +22     
- Misses       3808     3809       +1

Impacted Files	Coverage Δ
src/transformers/modeling_utils.py	`91.71% <ø> (-0.02%)`	⬇️
src/transformers/pipelines.py	`73.05% <93.10%> (+0.52%)`	⬆️
src/transformers/modeling_tf_utils.py	`84.44% <100.00%> (+0.52%)`	⬆️
src/transformers/tokenization_t5.py	`95.89% <100.00%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e392ba6...23778d1. Read the comment docs.

sshleifer

From my perspective, looks good. Would love test coverage to avoid breaking it accidentally!

src/transformers/pipelines.py

sshleifer · 2020-03-24T16:31:11Z

src/transformers/pipelines.py

in a previous PR, @thomwolf suggested that if a kwarg (like length_penalty) is on the config, it should only be exposed through generate_kwargs).

I would favor to (re-)define the kwargs like max_length, min_length, do_sample, length_penalty in this function because the defaults in the config are defaults for generate() in general, but there are not good defaults for summarization. And to me pipelines is about easy-to-use so I think good general defaults for summarization should be defined here. Happy to discuss.

Ok that's quite an interesting case for T5 versus Bart because in Bart the pretrained model are task-specific so it make sense to have the task-specific generation hyper-parameters in the config while for T5 the pretrained model is generic so it would make more sense to have a way to specify task specific generation HP in the config.

In general, I'm not a fan of having model-specific hyper-parameters in general classes like pipelines.

What I would propose is maybe the following for T5 (open to discussion):

have generic generation HP in the config like now

also have a dict of class specific generation HP in the config which can override the generic HP.

Ex of config dic:

{ .... max_length: 100, # generic generation HP length_penalty: 1.0, task_specific_generation_params: { "summarization": { # task id (e.g. name of the pipeline?) max_length: 140, length_penalty: 2.0 }, "translation_en_to_de": { max_length: 160, length_penalty: 3.0 }, }, }

What do you think?

Yeah, I like that idea!
Two questions:

Should we enforce that all parameters defined in task_specific_generation_params have to be already defined so that they can only override?

Do we allow "summarization" and "translation" if the config has no task_specific_generation_params defined?

src/transformers/pipelines.py

patrickvonplaten · 2020-03-24T18:32:19Z

src/transformers/pipelines.py

Is there a reason why we would not optionally use pad_to_max_length here? Was not sure if I can add it, but for summarization when using a batched input, this is necessary. @LysandreJik @thomwolf @mfuntowicz

src/transformers/tokenization_t5.py

thomwolf

Nice work. A few tweaks regarding handling task-specific generation HPs if you can do it (see my comments).

src/transformers/pipelines.py

thomwolf · 2020-03-25T10:27:29Z

src/transformers/pipelines.py

Ok for now but we'll probably refactor this in the future to have framework-agnostic ensure_tensor_on_device and get_tensor_length methods (maybe have a base class and framework-specific derived class for instance).

thomwolf · 2020-03-25T10:33:17Z

src/transformers/pipelines.py

Ok that's quite an interesting case for T5 versus Bart because in Bart the pretrained model are task-specific so it make sense to have the task-specific generation hyper-parameters in the config while for T5 the pretrained model is generic so it would make more sense to have a way to specify task specific generation HP in the config.

In general, I'm not a fan of having model-specific hyper-parameters in general classes like pipelines.

What I would propose is maybe the following for T5 (open to discussion):

have generic generation HP in the config like now

also have a dict of class specific generation HP in the config which can override the generic HP.

Ex of config dic:

{ .... max_length: 100, # generic generation HP length_penalty: 1.0, task_specific_generation_params: { "summarization": { # task id (e.g. name of the pipeline?) max_length: 140, length_penalty: 2.0 }, "translation_en_to_de": { max_length: 160, length_penalty: 3.0 }, }, }

What do you think?

src/transformers/tokenization_t5.py

thomwolf · 2020-03-26T10:01:09Z

src/transformers/pipelines.py

        args = ["input_ids", "attention_mask"]

-        if not isinstance(self.model.config, (DistilBertConfig, XLMConfig, RobertaConfig, BartConfig)):
+        if not isinstance(self.model.config, (DistilBertConfig, XLMConfig, RobertaConfig, BartConfig, T5Config)):


Ok, note that we can now remove the inputs_for_model from pipelines since #3116
Let's do it in another PR cleaning up pipelines later.

patrickvonplaten requested review from LysandreJik, sshleifer and thomwolf March 24, 2020 14:26

sshleifer reviewed Mar 24, 2020

View reviewed changes

sshleifer changed the title ~~Add t5 to pipelines~~ Add t5 to pipeline(task='summarization') Mar 24, 2020

patrickvonplaten commented Mar 24, 2020

View reviewed changes

patrickvonplaten mentioned this pull request Mar 24, 2020

Adds translation pipeline #3419

Merged

patrickvonplaten commented Mar 24, 2020

View reviewed changes

src/transformers/tokenization_t5.py Outdated Show resolved Hide resolved

thomwolf reviewed Mar 25, 2020

View reviewed changes

patrickvonplaten mentioned this pull request Mar 25, 2020

Extend config with task specific configs. #3433

Merged

3 tasks

patrickvonplaten force-pushed the add_t5_to_pipelines branch from 23778d1 to 275f798 Compare March 25, 2020 15:44

patrickvonplaten added 9 commits March 26, 2020 09:33

solve conflicts

aa0614e

move warnings below

c9ecbd8

incorporate changes

1690d7e

add pad_to_max_length to pipelines

ad31415

add bug fix for T5 beam search

26b6fca

add prefix patterns

039aed9

make style

16faf2b

fix conflicts

d82b942

adapt pipelines for task specific parameters

62ffb38

patrickvonplaten force-pushed the add_t5_to_pipelines branch from 275f798 to 62ffb38 Compare March 26, 2020 08:54

patrickvonplaten added 2 commits March 26, 2020 09:57

improve docstring

8d4b05f

remove unused patterns

fd5183e

thomwolf reviewed Mar 26, 2020

View reviewed changes

thomwolf approved these changes Mar 26, 2020

View reviewed changes

thomwolf merged commit 9c683ef into huggingface:master Mar 26, 2020

patrickvonplaten deleted the add_t5_to_pipelines branch March 26, 2020 10:37

patrickvonplaten mentioned this pull request Apr 14, 2020

[Pipelines] Clean pipelines test and remove unnecessary code #3795

Merged

Uh oh!

Add t5 to pipeline(task='summarization') #3413

Add t5 to pipeline(task='summarization') #3413

Uh oh!

Conversation

patrickvonplaten commented Mar 24, 2020

Uh oh!

codecov-io commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sshleifer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sshleifer Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

thomwolf Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Mar 25, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickvonplaten Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomwolf Mar 25, 2020

Choose a reason for hiding this comment

Uh oh!

thomwolf Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomwolf Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented Mar 24, 2020 •

edited

Loading

thomwolf Mar 25, 2020 •

edited

Loading

thomwolf Mar 25, 2020 •

edited

Loading