-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
🐛 Describe the bug
我使用examples/language/llama2中的代码预训练llama2-70b。使用gemini.sh直接跑benchmark.py是成功的,但是我想基于训好的模型进行增量预训练,训练参数和gemini.sh中给出的参数一致,只是修改了如下代码读取已有的模型:
with init_ctx:
# model = LlamaForCausalLM(config)
model = LlamaForCausalLM.from_pretrained(args.model_path)
然后跑gemini.sh报了OOM的错误:
outputs = model(**batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 247, in forward
outputs = self.module(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
outputs = self.model(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
layer_outputs = decoder_layer(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 405, in forward
hidden_states = self.input_layernorm(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 89, in forward
return self.weight * hidden_states.to(input_dtype)
File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/colo_parameter.py", line 63, in torch_function
new_args = ColoParamOpHookManager.pre_op(params, *args, *kwargs.values())
File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 82, in pre_op
ColoParamOpHookManager._trigger_pre_forward(params)
File "/opt/conda/lib/python3.8/site-packages/colossalai/tensor/param_op_hook.py", line 63, in _trigger_pre_forward
hook.pre_forward(params)
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_hook.py", line 47, in pre_forward
self.pre_op(params)
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/gemini_hook.py", line 35, in pre_op
self._chunk_manager.access_chunk(chunk)
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/manager.py", line 110, in access_chunk
self.__add_accessed_chunk(chunk)
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/manager.py", line 246, in __add_accessed_chunk
chunk.access_chunk()
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/chunk.py", line 359, in access_chunk
self.__gather()
File "/opt/conda/lib/python3.8/site-packages/colossalai/zero/gemini/chunk/chunk.py", line 480, in __gather
dist.all_gather(gather_list, self.cuda_shard, self.torch_pg)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2277, in all_gather
work = group.allgather([tensor_list], [tensor])
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 3; 79.33 GiB total capacity; 73.44 GiB already allocated; 469.81 MiB free; 77.47 GiB reserved in total by PyTorch) If reserved memory is > > allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
辛苦开发者看下这个是怎么回事
Environment
No response