Skip to content

Bugs when fine tuning the gpt2 #12965

@yanan1116

Description

@yanan1116

Transformers Version: 4.8.2
Torch Version: 1.8.0

I am using the official script to fine tune the gpt2 on the csv files.
the script:
https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py

train and validation file makeup:

df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=1).to_csv(train_file, index=False)
df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=0.2).to_csv(validation_file, index=False)

My shell command:

python -u ./run_clm_no_trainer.py \
                --num_train_epochs 7 \
                --train_file './fintune_csvs/stsa_train_finetune.csv' \
                --validation_file './fintune_csvs/stsa_test_finetune.csv'  \
                --model_name_or_path gpt2 \
                --per_device_train_batch_size 16 \
                --per_device_eval_batch_size 16 \
                --output_dir "./finetune_gpt2_stsa" \
                --preprocessing_num_workers 16 \
                --block_size 256 --overwrite_cache True

where ths csv files contain a column, named 'text' for fine tuning the model.

However, there are always errors below, suggesting the lengths of the dataloader

File "./run_clm_no_trainer.py", line 503, in
main()exts in chunks of 256 #12: 0%| | 0/1 [00:00<?, ?ba/s]
File "./run_clm_no_trainer.py", line 480, in main
for step, batch in enumerate(eval_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 256 at dim 1 (got 52)

Next time I run it, it returns the similar error:

ValueError: expected sequence of length 168 at dim 1 (got 136)

Then I modified the input params of tokenizer:

tokenizer.pad_token = tokenizer.eos_token 
def tokenize_function(examples):
    return tokenizer(examples[text_column_name],) , padding=True, truncation=True )

This seems fix the problem. However, the generated texts are quite short after this change.
Any suggestions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions