Skip to content

[BUG]: RuntimeError: CUDA error: an illegal memory access was encountered #1337

@paulpaulzhang

Description

@paulpaulzhang

🐛 Describe the bug

I run the bert from huggingface with zero, but get RuntimeError: CUDA error: an illegal memory access was encountered, I found that this problem seemed to be caused by initial_scale in config.py

Traceback (most recent call last):
File "colossalai/run.py", line 463, in
train(args)
File "colossalai/run.py", line 252, in train
trainer(model,
File "colossalai/run.py", line 127, in trainer
engine.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 163, in backward
ret = self.optimizer.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_optim/sharded_optim_v2.py", line 169, in backward
self.model.backward(loss)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/colossalai/zero/sharded_model/sharded_model_v2.py", line 233, in backward
loss.backward()
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f9d1dfa2d62 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1c5f3 (0x7f9d6164f5f3 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7f9d61650002 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f9d1df8c314 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x29adb9 (0x7f9de496cdb9 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xae0c91 (0x7f9de51b2c91 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f9de51b2f92 in /home/paulzhang/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #8: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #9: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #10: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #11: + 0x158415 (0x56473bab5415 in /home/paulzhang/miniconda3/bin/python)
frame #12: + 0x15893b (0x56473bab593b in /home/paulzhang/miniconda3/bin/python)
frame #13: + 0x193141 (0x56473baf0141 in /home/paulzhang/miniconda3/bin/python)
frame #14: + 0x1592ac (0x56473bab62ac in /home/paulzhang/miniconda3/bin/python)
frame #15: + 0x158e77 (0x56473bab5e77 in /home/paulzhang/miniconda3/bin/python)
frame #16: + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python)
frame #17: + 0x158e60 (0x56473bab5e60 in /home/paulzhang/miniconda3/bin/python)
frame #18: + 0x176057 (0x56473bad3057 in /home/paulzhang/miniconda3/bin/python)
frame #19: PyDict_SetItemString + 0x61 (0x56473baf43c1 in /home/paulzhang/miniconda3/bin/python)
frame #20: PyImport_Cleanup + 0x9d (0x56473bb32aad in /home/paulzhang/miniconda3/bin/python)
frame #21: Py_FinalizeEx + 0x79 (0x56473bb64a49 in /home/paulzhang/miniconda3/bin/python)
frame #22: Py_RunMain + 0x183 (0x56473bb66893 in /home/paulzhang/miniconda3/bin/python)
frame #23: Py_BytesMain + 0x39 (0x56473bb66ca9 in /home/paulzhang/miniconda3/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f9e409e50b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x1e21c7 (0x56473bb3f1c7 in /home/paulzhang/miniconda3/bin/python)

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions