Skip to content

System kills the processes of llama2-70B fsdp finetune when loading the model #37664

@yuanwu2017

Description

@yuanwu2017

System Info

transformers version: 4.52.0.dev0(fee1190) latest main

  • transformers version: 4.52.0.dev0
  • Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.30.2
  • Safetensors version: 0.5.3
  • Accelerate version: 1.7.0.dev0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: FSDP
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_for
    ward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': 'FULL_SHARD', 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wr
    ap': '', 'fsdp_use_orig_params': True, 'fsdp_version': 1}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • DeepSpeed version: 0.16.1+hpu.synapse.v1.20.0
  • PyTorch version (GPU?): 2.6.0+hpu_1.20.0-543.git4952fce (False)
  • Tensorflow version (GPU?): 2.15.1 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using distributed or parallel set-up in script?:
  • Using HPU in script?:
  • HPU type: GAUDI2

Who can help?

@ArthurZucker @SunMarc @zach-huggingface

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce:
https://github.com/yuanwu2017/llm-dbg/tree/main/finetune
1. run 8 HPUs fsdp finetune with llama2-70b:
accelerate launch --config_file hpu_config_fsdp.yaml run_lora_clm.py --model_name_or_path meta-llama/Llama-2-70b-hf --dataset_name tatsu-lab/alpaca --bf16 True --output_dir ./olora --max_seq_len 2048 --gradient_checkpointing --per_device_train_batch_size 5 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --validation_split_percentage 4 --fsdp auto_wrap --fsdp_config ./fsdp_config.json --num_train_epochs 2 --eval_strategy epoch --per_device_eval_batch_size 1 --eval_delay 2 --do_eval --torch_compile --gradient_accumulation_steps 2

System kills the processes of finetune
In latest code, the low_cpu_mem_usage is removed. The model was loaded 8 times in CPU memory. Each process loaded a model. The CPU's memory was exhausted. The system monitor killed the processes of finetune.
Image

Image

Expected behavior

Run without errors. I tried the finetune with transformers <=4.50.3, it can work without error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions