-
Notifications
You must be signed in to change notification settings - Fork 30.9k
[WIP] seq2seq example #3402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] seq2seq example #3402
Conversation
As a non-blocking question, I do note that a lot of the examples use argparse to parse comparatively long lists of arguments. I've maintained the extant style in this PR to avoid causing noise and confusion Would it be acceptable if I broke with this style to use a JSON file to store all the arguments for an experiment? |
That sounds good. I'm still tweaking things on my end for accuracy and improved logic as I get more familiar with the code base here. I'll see if I can rebase of #3383 by then, depending on my other workload. Feel free to reach out via google hangouts if you're comfortable. |
…mbeddings as input to decoder when no decoder_input_ids or decoder_inputs_embeds provided
8e3e8d4
to
b782352
Compare
features = [] | ||
|
||
cls_token, sep_token, pad_token = cls_token = ( | ||
tokenizer.cls_token, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I've been using your example to guide my own setup of a seq2seq model (thank you!) Is this line in your pull request a typo? Think it's just supposed to be cls_token, sep_token, pad_token = ( ...
formatted_tokens += [pad_token_label_id] | ||
segment_ids += [cls_token_segment_id] | ||
# gpt2 has no cls_token | ||
elif model_type not in ["gpt2"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is supposed to be
elif model_type in ['gpt2']
instead of "not in"
Sorry, to answer only now! I'll will soon add a Encoder-Decoder google colab that shows how to use seq2seq |
Thanks - fine to close. We've moved forward without using seq2seq due to poor overall accuracy with the scale of data in place. |
This PR presents an example seq2seq use case and bug fixes necessary for this to execute with reasonable accuracy.
The utils_seq2seq.py file defines the data format for training data, and the run_seq2seq.py file takes training, development, and test data and produces a model. The README.md discusses how to execute this toy problem. The specific toy problem in use here is formatting a date string to the American style, which is a trivial example. On my local setup using GPUs, this example executes within 5 minutes. Production models should include more learnings.
I welcome feedback about how to strengthen performance here and the best route to increase testing.
This relies on a few bug fixes which have been incorporated in this branch
I strongly suspect that the input to the decoder in the PreTrainedEncoderDecoder class is incorrect as present in the code base, and commit 9fcf73a has a proposed fix. It doesn't make sense to have the expected token ids as input to the decoder when the decoder needs to learn how to decode from the embeddings.Incomplete understanding - will fix