Skip to content

Trainer.train() initializing train multiple times for no apparent reason and doubling total optimization steps with LoRA #23762

@dechantoine

Description

@dechantoine

System Info

  • transformers version: 4.29.2
  • Platform: Linux-5.15.107+-x86_64-with-glibc2.31
  • Python version: 3.10.11
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Tensorflow version (GPU?): 2.12.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
  • Jax version: 0.4.10
  • JaxLib version: 0.4.10
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no
  • accelerate-0.19.0-py3-none-any.whl
  • datasets-2.12.0-py3-none-any.whl
  • peft-0.3.0-py3-none-any.whl

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from datasets import load_dataset
from transformers import AutoTokenizer,  AutoModelForCausalLM,  Trainer, TrainingArguments,  DataCollatorForLanguageModeling
from peft import get_peft_model, LoraConfig, TaskType

model_name_or_path = "asi/gpt-fr-cased-small"

def preprocess_function(examples):
    return tokenizer(text=examples["review"],
                      truncation=True,
                      padding="max_length",
                      max_length=tokenizer.max_model_input_sizes["gpt2"])

trainset = load_dataset("allocine", split="train").remove_columns("label").select(range(900))
testset = load_dataset("allocine", split="test").remove_columns("label").select(range(900,1000))
tokenizer_name_or_path = "asi/gpt-fr-cased-small"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)

tokenizer.model_max_length = tokenizer.max_model_input_sizes["gpt2"]
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

trainset = trainset.map(preprocess_function,
                        remove_columns=trainset.features.keys(),
                        num_proc=32)
testset = testset.map(preprocess_function,
                      remove_columns=testset.features.keys(),
                      num_proc=32)

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    inference_mode=False,
    r=12,
    lora_alpha=32,
    lora_dropout=0.15,
    
    fan_in_fan_out=True,
)

model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
lora_model = get_peft_model(model, peft_config)

trainer = Trainer(
    model=lora_model, 
    train_dataset=trainset,
    eval_dataset=testset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),

    args=TrainingArguments(
        auto_find_batch_size = True,
        fp16=True,
        
        num_train_epochs = 2,
        learning_rate = 2e-5, 
        optim = "adamw_torch",
         
        evaluation_strategy = "steps",
        eval_delay = 0,
        eval_steps = 10,
        eval_accumulation_steps = 1,
        
        logging_strategy = "steps",
        logging_first_step = True,
        logging_steps=10, 
        log_level = "info",
        
        save_strategy = "steps",
        save_steps = 100,
        save_total_limit = 10,

        output_dir='outputs',
    ),
)

trainer.train()

Expected behavior

Hello ! The first logs from trainer seems accurate to me (Total optimization steps = Num Epochs * Num examples//Total train batch size) but right after, trainer doubles the total optimization steps for no reason. I also encountered a case where it doubled 4 times !

***** Running training *****
  Num examples = 900
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 226
  Number of trainable parameters = 442,368
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
***** Running training *****
  Num examples = 900
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 450
  Number of trainable parameters = 442,368

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions