Bugs when fine tuning the gpt2

Transformers Version: 4.8.2
Torch Version: 1.8.0

I am using the official script to fine tune the gpt2 on the csv files. 
the script: 
https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py

train and validation file makeup:
```
df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=1).to_csv(train_file, index=False)
df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=0.2).to_csv(validation_file, index=False)

```
My shell command:
```
python -u ./run_clm_no_trainer.py \
                --num_train_epochs 7 \
                --train_file './fintune_csvs/stsa_train_finetune.csv' \
                --validation_file './fintune_csvs/stsa_test_finetune.csv'  \
                --model_name_or_path gpt2 \
                --per_device_train_batch_size 16 \
                --per_device_eval_batch_size 16 \
                --output_dir "./finetune_gpt2_stsa" \
                --preprocessing_num_workers 16 \
                --block_size 256 --overwrite_cache True
```
where ths csv files contain a column, named 'text' for fine tuning the model.

However, there are always errors below, suggesting the lengths of the dataloader 

>  File "./run_clm_no_trainer.py", line 503, in <module>
>     main()exts in chunks of 256 #12:   0%|                              | 0/1 [00:00<?, ?ba/s]
>   File "./run_clm_no_trainer.py", line 480, in main
>     for step, batch in enumerate(eval_dataloader):
>   File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in __iter__
>     for batch in super().__iter__():
>   File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in __next__
>     data = self._next_data()
>   File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
>     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
>   File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
>     return self.collate_fn(data)
>   File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
>     batch[k] = torch.tensor([f[k] for f in features])
> ValueError: expected sequence of length 256 at dim 1 (got 52)

Next time I run it, it returns the similar error:

> ValueError: expected sequence of length 168 at dim 1 (got 136)


Then I modified the input params of tokenizer:
```
tokenizer.pad_token = tokenizer.eos_token 
def tokenize_function(examples):
    return tokenizer(examples[text_column_name],) , padding=True, truncation=True )

```

This seems fix the problem. However, the generated texts are quite short after this change.
Any suggestions?


 





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugs when fine tuning the gpt2 #12965

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bugs when fine tuning the gpt2 #12965

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions