Skip to content

[BUG]: 使用llama2增量预训练失败 #4578

@zryowen123

Description

@zryowen123

🐛 Describe the bug

我使用examples/language/llama2中的代码预训练llama2-70b。使用gemini.sh直接跑benchmark.py是成功的,但是我想基于训好的模型进行增量预训练,训练参数和gemini.sh中给出的参数一致,只是修改了如下代码读取已有的模型:
with init_ctx:
# model = LlamaForCausalLM(config)
model = LlamaForCausalLM.from_pretrained(args.model_path)
然后跑gemini.sh报了OOM的错误:
outputs = model(**batch)

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 247, in forward

outputs = self.module(*args, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward

outputs = self.model(

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward

layer_outputs = decoder_layer(

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 405, in forward

hidden_states = self.input_layernorm(hidden_states)

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl

return forward_call(*input, **kwargs)

File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 89, in forward

return self.weight * hidden_states.to(input_dtype)

File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/colo_parameter.py", line 63, in torch_function

new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())

File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 82, in pre_op

ColoParamOpHookManager._trigger_pre_forward(params)

File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 63, in _trigger_pre_forward

hook.pre_forward(params)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_hook.py", line 47, in pre_forward

self.pre_op(params)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_hook.py", line 35, in pre_op

self._chunk_manager.access_chunk(chunk)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/manager.py", line 110, in access_chunk

self.__add_accessed_chunk(chunk)

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/manager.py", line 246, in __add_accessed_chunk

chunk.access_chunk()

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/chunk.py", line 359, in access_chunk

self.__gather()

File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/chunk.py", line 480, in __gather

dist.all_gather(gather_list, self.cuda_shard, self.torch_pg)

File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2277, in all_gather

work = group.allgather([tensor_list], [tensor])

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 3; 79.33 GiB total capacity; 73.44 GiB already allocated; 469.81 MiB free; 77.47 GiB reserved in total by PyTorch) If reserved memory is > > allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

辛苦开发者看下这个是怎么回事

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions