-
Couldn't load subscription status.
- Fork 31k
Closed
Description
System Info
transformersversion: 4.29.2- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
- Jax version: 0.4.10
- JaxLib version: 0.4.10
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
- accelerate-0.19.0-py3-none-any.whl
- datasets-2.12.0-py3-none-any.whl
- peft-0.3.0-py3-none-any.whl
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import get_peft_model, LoraConfig, TaskType
model_name_or_path = "asi/gpt-fr-cased-small"
def preprocess_function(examples):
return tokenizer(text=examples["review"],
truncation=True,
padding="max_length",
max_length=tokenizer.max_model_input_sizes["gpt2"])
trainset = load_dataset("allocine", split="train").remove_columns("label").select(range(900))
testset = load_dataset("allocine", split="test").remove_columns("label").select(range(900,1000))
tokenizer_name_or_path = "asi/gpt-fr-cased-small"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
tokenizer.model_max_length = tokenizer.max_model_input_sizes["gpt2"]
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
trainset = trainset.map(preprocess_function,
remove_columns=trainset.features.keys(),
num_proc=32)
testset = testset.map(preprocess_function,
remove_columns=testset.features.keys(),
num_proc=32)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=12,
lora_alpha=32,
lora_dropout=0.15,
fan_in_fan_out=True,
)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
lora_model = get_peft_model(model, peft_config)
trainer = Trainer(
model=lora_model,
train_dataset=trainset,
eval_dataset=testset,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
args=TrainingArguments(
auto_find_batch_size = True,
fp16=True,
num_train_epochs = 2,
learning_rate = 2e-5,
optim = "adamw_torch",
evaluation_strategy = "steps",
eval_delay = 0,
eval_steps = 10,
eval_accumulation_steps = 1,
logging_strategy = "steps",
logging_first_step = True,
logging_steps=10,
log_level = "info",
save_strategy = "steps",
save_steps = 100,
save_total_limit = 10,
output_dir='outputs',
),
)
trainer.train()
Expected behavior
Hello ! The first logs from trainer seems accurate to me (Total optimization steps = Num Epochs * Num examples//Total train batch size) but right after, trainer doubles the total optimization steps for no reason. I also encountered a case where it doubled 4 times !
***** Running training *****
Num examples = 900
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 226
Number of trainable parameters = 442,368
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
***** Running training *****
Num examples = 900
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 450
Number of trainable parameters = 442,368
HessTaha
Metadata
Metadata
Assignees
Labels
No labels