-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingloggingRelated to the `LoggerConnector` and `log()`Related to the `LoggerConnector` and `log()`tunerver: 2.2.x
Milestone
Description
Bug description
Running LearningRateFinder leads to teardown() on training epoch loop's results being moved to "cpu" here.
The problem is that loop results are only moved to device when registering for the first time here. This leads to an issue for cumulated_batch_size reduction which used the device of the original value tensor when it was first created. So when it's still on cpu when the training starts for real after lr_find we face RuntimeError('No backend type associated with device type cpu').
E.g. the issue happens when using 2 GPU device (see logs below).
I'll submit a fix for review shortly.
What version are you seeing the problem on?
master
How to reproduce the bug
No response
Error messages and logs
train/0 [1]:-> s.trainer.fit(s.model, **kwargs)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(543)fit()
train/0 [1]:-> call._call_and_handle_interrupt(
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py(43)_call_and_handle_interrupt()
train/0 [1]:-> return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py(105)launch()
train/0 [1]:-> return function(*args, **kwargs)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(579)_fit_impl()
train/0 [1]:-> self._run(model, ckpt_path=ckpt_path)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(986)_run()
train/0 [1]:-> results = self._run_stage()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(1032)_run_stage()
train/0 [1]:-> self.fit_loop.run()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(205)run()
train/0 [1]:-> self.advance()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(363)advance()
train/0 [1]:-> self.epoch_loop.run(self._data_fetcher)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(139)run()
train/0 [1]:-> self.on_advance_end(data_fetcher)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(287)on_advance_end()
train/0 [1]:-> self.val_loop.run()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py(182)_decorator()
train/0 [1]:-> return loop_run(self, *args, **kwargs)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(142)run()
train/0 [1]:-> return self.on_run_end()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(254)on_run_end()
train/0 [1]:-> self._on_evaluation_epoch_end()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(336)_on_evaluation_epoch_end()
train/0 [1]:-> trainer._logger_connector.on_epoch_end()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(195)on_epoch_end()
train/0 [1]:-> metrics = self.metrics
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(234)metrics()
train/0 [1]:-> return self.trainer._results.metrics(on_step)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(483)metrics()
train/0 [1]:-> value = self._get_cache(result_metric, on_step)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(447)_get_cache()
train/0 [1]:-> result_metric.compute()
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(289)wrapped_func()
train/0 [1]:-> self._computed = compute(*args, **kwargs)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(251)compute()
train/0 [1]:-> cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py(342)reduce()
train/0 [1]:-> return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(172)_sync_ddp_if_available()
train/0 [1]:-> return _sync_ddp(result, group=group, reduce_op=reduce_op)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(222)_sync_ddp()
train/0 [1]:-> torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
train/0 [1]: /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/c10d_logger.py(72)wrapper()
train/0 [1]:-> return func(*args, **kwargs)
train/0 [1]:> /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py(1996)all_reduce()
train/0 [0]:RuntimeError('No backend type associated with device type cpu')
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
cc @carmocca
awaelchli
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingloggingRelated to the `LoggerConnector` and `log()`Related to the `LoggerConnector` and `log()`tunerver: 2.2.x