- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4.5k
Description
🐛 Describe the bug
I run the bert from huggingface with zero, but get RuntimeError: CUDA error: an illegal memory access was encountered, I found that this problem seemed to be caused by initial_scale in config.py
Traceback (most recent call last):
File "colossalai/run.py", line 463, in 
train(args)
File "colossalai/run.py", line 252, in train
trainer(model,
File "colossalai/run.py", line 127, in trainer
engine.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 163, in backward
ret = self.optimizer.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 169, in backward
self.model.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_model/sharded_model_v2.py", line 233, in backward
loss.backward()
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f9d1dfa2d62 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1:  + 0x1c5f3 (0x7f9d6164f5f3 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7f9d61650002 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f9d1df8c314 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4:  + 0x29adb9 (0x7f9de496cdb9 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5:  + 0xae0c91 (0x7f9de51b2c91 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f9de51b2f92 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7:  + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #8:  + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #9:  + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #10:  + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #11:  + 0x158415 (0x56473bab5415 in /home/paulzhang/miniconda3/bin/python)
frame #12:  + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #13:  + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #14:  + 0x1592ac (0x56473bab62ac in /home/paulzhang/miniconda3/bin/python)
frame #15:  + 0x158e77 (0x56473bab5e77 in /home/paulzhang/miniconda3/bin/python)
frame #16:  + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python)
frame #17:  + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python)
frame #18:  + 0x176057 (0x56473bad3057 in /home/paulzhang/miniconda3/bin/python)
frame #19: PyDict_SetItemString + 0x61 (0x56473baf43c1 in /home/paulzhang/miniconda3/bin/python)
frame #20: PyImport_Cleanup + 0x9d (0x56473bb32aad in /home/paulzhang/miniconda3/bin/python)
frame #21: Py_FinalizeEx + 0x79 (0x56473bb64a49 in /home/paulzhang/miniconda3/bin/python)
frame #22: Py_RunMain + 0x183 (0x56473bb66893 in /home/paulzhang/miniconda3/bin/python)
frame #23: Py_BytesMain + 0x39 (0x56473bb66ca9 in /home/paulzhang/miniconda3/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f9e409e50b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25:  + 0x1e21c7 (0x56473bb3f1c7 in /home/paulzhang/miniconda3/bin/python)
Environment
No response