Unofficial PyTorch implementation of
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators by Kevin Clark. Minh-Thang Luong. Quoc V. Le. Christopher D. Manning
I pretrain ELECTRA-small from scratch and has successfully replicate the paper's results on GLUE.
Model | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE | Avg. |
---|---|---|---|---|---|---|---|---|---|
ELECTRA-Small | 54.6 | 89.1 | 83.7 | 80.3 | 88.0 | 79.7 | 87.7 | 60.8 | 78.0 |
ELECTRA-Small (my) | 57.2 | 87.1 | 82.1 | 80.4 | 88 | 78.9 | 87.9 | 63.1 | 78.08 |
Results for models on the GLUE test set.
-
You don't need to download and process datasets manually, the scirpt take care those for you automatically. (Thanks to huggingface/nlp and hugginface/transformers)
-
AFAIK, the closest reimplementation to the original one, taking care of many easily overlooked details (described below).
-
AFAIK, the only one successfully validate itself by replicating the results in the paper.
-
Comes with jupyter notebooks, which you can explore the code and inspect the processed data.
-
You don't need to download and preprocess anything by yourself, all you need is running the training script.
Note: This project is actually for my personal research. So I didn't trying to make it easy to use for all users, but trying to make it easy to read and modify.
pip install fastai nlp transformers hugdatafast
python pretrain.py
- set
pretrained_checkcpoint
infinetune.py
to use the checkpoint you've pretrained and saved inelectra_pytorch/checkpoints/pretrain
. python finetune.py
(withdo_finetune
set toTrue
)- Go to neptune, pick the best run of 10 runs for each task, and set
th_runs
infinetune.py
according to the numbers in the names of runs you picked. python finetune.py
(withdo_finetune
set toFalse
), this outpus predictions on testset, you can then compress and send.tsv
s inelectra_pytorch/test_outputs/<group_name>/*.tsv
to GLUE site to get test score.
-
I didn't use CLI arguments, so configure options enclosed within
MyConfig
in the python files to your needs before run them. (There're comments below it showing the options for vanilla settings) -
You will need a Neptune account and create a neptune project on the website to record GLUE finetuning results. Don't forget to replace
richarddwang/electra-glue
with your neptune project's name -
The python files
pretrian.py
,finetune.py
are in fact converted fromPretrain.ipynb
andFinetune_GLUE.ipynb
. You can also use those notebooks to explore ELECTRA training and finetuning.
Below lists the details of the original implementation/paper that are easy to be overlooked and I have taken care of. I found these details are indispensable to successfully replicate the results of the paper.
- Use Adam optimizer without bias correction (bias correction is default for Pytorch and fastai Adam optimizer)
- There is a bug in how original implementation decays learning rates through layers. See _get_layer_lrs
- Use clip gradient
- For MRPC and STS tasks, it appends the same dataset with swapped sentence1 and sentence2 to the original dataset, and call it "double_unordered"
- For pretraing data preprocessing, it concat and truncate setences to fit the max length, and stop concating when it comes to the end of a document.
- For pretraing data preprocessing, it by chance split the text into sentence A and sentence B, and also by chance change the max length
- For finetuning data preprocessing, it follow BERT's way to truncate the longest one of sentence A and B to fit the max length
- Use gradient clipping
- The output layer is initialized by Tensorflow v1's default initialization which is xavier
- It use gumbel softmax to sample generations from geneartor
- It didn't mask like BERT, but mask for [MASK] for 85% and 15% remains the same
- It didn't do warmup and then do linear decay but do them together, which means the learning rate warmups and decays at the same time when warming up. See here
- It use a dropout and a linear layer for GLUE output layer, not what
ElectraClassificationHead
uses. - It didn't tie input and output embeddings for its generator, which is a common practice applied by many model.
- It tie not only word/pos/token type embeddings but also layer norm in embedding layer, for generator and discriminator.
- All public model of ELECTRA checkpoints are actually ++ model. See this issue
- It downscales generator by hidden_size, number of attention heads, and intermediate size, but not number of layers.
If you pretrain, finetune, and generate test results. electra_pytorch
will generate these for you.
project root
|
|── datasets
| |── glue
| |── <task>
| ...
|
|── checkpoints
| |── pretrain
| | |── <base_run_name>_<seed>_<percent>.pth
| | ...
| |
| |── glue
| |── <group_name>_<task>_<ith_run>.pth
| ...
|
|── test_outputs
| |── <group_name>
| | |── CoLA.tsv
| | ...
| |
| | ...
@misc{clark2020electra,
title={ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators},
author={Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},
year={2020},
eprint={2003.10555},
archivePrefix={arXiv},
primaryClass={cs.CL}
}