-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
🐛 Describe the bug
I have no idea why it has a bug "RuntimeError: CUDA error: no kernel image is available for execution on the device" when I am training the latent diffusion model in a super-resolution task.
I really appreciate it if you could help me out.
Lightning config
trainer:
accelerator: gpu
devices: 1
log_gpu_memory: all
max_epochs: 3
precision: 16
auto_select_gpus: false
strategy:
target: strategies.ColossalAIStrategy
params:
use_chunk: true
enable_distributed_storage: true
placement_policy: cuda
force_outputs_fp32: true
log_every_n_steps: 3
logger: true
default_root_dir: /tmp/diff_log/
logger_config:
wandb:
target: loggers.WandbLogger
params:
name: nowname
save_dir: /tmp/diff_log/
offline: opt.debug
id: nowname
/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:248: UserWarning: Could not log computational graph since the model.example_input_array attribute is not set or input_array was not given
rank_zero_warn(
/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 12 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. rank_zero_warn( Epoch 0: 0%| | 0/42156 [00:00<?, ?it/s]/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You calledself.log('global_step', ...)in yourtraining_step` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
Summoning checkpoint.
[12/17/22 18:52:57] INFO colossalai - ProcessGroup - INFO:
/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/process_group.py:24
get
INFO colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last):
File "/home/liuchaowei/ColossalAI/examples/images/diffusion/main_ISP.py", line 805, in
trainer.fit(model, data)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 81, in optimizer_step
optimizer.step()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 142, in step
ret = self.optim.step(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
return func(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 143, in step
multi_tensor_applier(self.gpu_adam_op, self._dummy_overflow_buf, [g_l, p_l, m_l, v_l], group['lr'],
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py", line 35, in call
return op(self.chunk_size,
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from multi_tensor_apply at colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:111 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3e974e120e in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0x21c67 (0x7f3e3abcfc67 in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #2: multi_tensor_adam_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float, float, float, float, int, int, int, float) + 0x2e9 (0x7f3e3abd0569 in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #3: + 0x1c211 (0x7f3e3abca211 in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #4: + 0x1819c (0x7f3e3abc619c in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
Environment
pytorch:1.12.1
cuda:11.3
pytorch-lightning:1.8.0