-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Description
System Info
transformers version: 4.52.0.dev0(fee1190) latest main
transformersversion: 4.52.0.dev0- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.3
- Accelerate version: 1.7.0.dev0
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_for
ward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': 'FULL_SHARD', 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wr
ap': '', 'fsdp_use_orig_params': True, 'fsdp_version': 1}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: [] - DeepSpeed version: 0.16.1+hpu.synapse.v1.20.0
- PyTorch version (GPU?): 2.6.0+hpu_1.20.0-543.git4952fce (False)
- Tensorflow version (GPU?): 2.15.1 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using distributed or parallel set-up in script?:
- Using HPU in script?:
- HPU type: GAUDI2
Who can help?
@ArthurZucker @SunMarc @zach-huggingface
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce:
https://github.com/yuanwu2017/llm-dbg/tree/main/finetune
1. run 8 HPUs fsdp finetune with llama2-70b:
accelerate launch --config_file hpu_config_fsdp.yaml run_lora_clm.py --model_name_or_path meta-llama/Llama-2-70b-hf --dataset_name tatsu-lab/alpaca --bf16 True --output_dir ./olora --max_seq_len 2048 --gradient_checkpointing --per_device_train_batch_size 5 --save_strategy no --learning_rate 0.0004 --warmup_ratio 0.03 --lr_scheduler_type "constant" --logging_steps 1 --dataset_concatenation --do_train --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --validation_split_percentage 4 --fsdp auto_wrap --fsdp_config ./fsdp_config.json --num_train_epochs 2 --eval_strategy epoch --per_device_eval_batch_size 1 --eval_delay 2 --do_eval --torch_compile --gradient_accumulation_steps 2
System kills the processes of finetune
In latest code, the low_cpu_mem_usage is removed. The model was loaded 8 times in CPU memory. Each process loaded a model. The CPU's memory was exhausted. The system monitor killed the processes of finetune.

Expected behavior
Run without errors. I tried the finetune with transformers <=4.50.3, it can work without error.