-
Notifications
You must be signed in to change notification settings - Fork 31k
Description
Transformers Version: 4.8.2
Torch Version: 1.8.0
I am using the official script to fine tune the gpt2 on the csv files.
the script:
https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm_no_trainer.py
train and validation file makeup:
df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=1).to_csv(train_file, index=False)
df_train_ft_aug.rename(columns={'content': 'text'}).sample(frac=0.2).to_csv(validation_file, index=False)
My shell command:
python -u ./run_clm_no_trainer.py \
--num_train_epochs 7 \
--train_file './fintune_csvs/stsa_train_finetune.csv' \
--validation_file './fintune_csvs/stsa_test_finetune.csv' \
--model_name_or_path gpt2 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--output_dir "./finetune_gpt2_stsa" \
--preprocessing_num_workers 16 \
--block_size 256 --overwrite_cache True
where ths csv files contain a column, named 'text' for fine tuning the model.
However, there are always errors below, suggesting the lengths of the dataloader
File "./run_clm_no_trainer.py", line 503, in
main()exts in chunks of 256 #12: 0%| | 0/1 [00:00<?, ?ba/s]
File "./run_clm_no_trainer.py", line 480, in main
for step, batch in enumerate(eval_dataloader):
File "/usr/local/lib/python3.6/dist-packages/accelerate/data_loader.py", line 289, in iter
for batch in super().iter():
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.6/dist-packages/transformers/data/data_collator.py", line 80, in default_data_collator
batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 256 at dim 1 (got 52)
Next time I run it, it returns the similar error:
ValueError: expected sequence of length 168 at dim 1 (got 136)
Then I modified the input params of tokenizer:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
return tokenizer(examples[text_column_name],) , padding=True, truncation=True )
This seems fix the problem. However, the generated texts are quite short after this change.
Any suggestions?