System kills the processes of   llama2-70B fsdp finetune when loading the model

### System Info

transformers version: 4.52.0.dev0(fee11906)  latest main
- `transformers` version: 4.52.0.dev0
- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.3
- Accelerate version: 1.7.0.dev0
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: FSDP
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - fsdp_config: {'fsdp_activation_checkpointing': False, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_for
ward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': 'FULL_SHARD', 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wr
ap': '', 'fsdp_use_orig_params': True, 'fsdp_version': 1}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- DeepSpeed version: 0.16.1+hpu.synapse.v1.20.0
- PyTorch version (GPU?): 2.6.0+hpu_1.20.0-543.git4952fce (False)
- Tensorflow version (GPU?): 2.15.1 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using distributed or parallel set-up in script?: <fill in>
- Using HPU in script?: <fill in>
- HPU type: GAUDI2



### Who can help?

@ArthurZucker @SunMarc @zach-huggingface 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Steps to reproduce:
https://github.com/yuanwu2017/llm-dbg/tree/main/finetune
**1.  run 8  HPUs fsdp finetune with llama2-70b:**
`accelerate launch --config_file hpu_config_fsdp.yaml run_lora_clm.py         --model_name_or_path meta-llama/Llama-2-70b-hf         --dataset_name tatsu-lab/alpaca         --bf16 True         --output_dir ./olora         --max_seq_len 2048         --gradient_checkpointing         --per_device_train_batch_size 5         --save_strategy no         --learning_rate 0.0004         --warmup_ratio 0.03         --lr_scheduler_type "constant"         --logging_steps 1         --dataset_concatenation         --do_train         --lora_rank 4         --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj"         --validation_split_percentage 4         --fsdp auto_wrap         --fsdp_config ./fsdp_config.json         --num_train_epochs 2         --eval_strategy epoch         --per_device_eval_batch_size 1         --eval_delay 2         --do_eval         --torch_compile         --gradient_accumulation_steps 2`


**System kills the processes of finetune**
In latest code, the [low_cpu_mem_usage](https://github.com/huggingface/transformers/blob/fee1190601b5d04ec6d3f7f58fd22788d7f3236d/src/transformers/modeling_utils.py#L4036) is removed.  The model was loaded 8 times in CPU memory. Each process loaded a model. The CPU's memory was exhausted. The system monitor killed the processes of finetune.
<img width="833" alt="Image" src="https://github.com/user-attachments/assets/9f3b72d9-275e-4fc8-b3eb-53d990e7f15c" />

<img width="839" alt="Image" src="https://github.com/user-attachments/assets/504b49cd-ca59-4dc2-86ab-a3eb2bc9cdcc" />




### Expected behavior

Run without errors.  I tried the finetune with transformers <=4.50.3, it can work without error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

System kills the processes of llama2-70B fsdp finetune when loading the model #37664

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System kills the processes of llama2-70B fsdp finetune when loading the model #37664

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions