Skip to content

Existing metric keys not moved to device after LearningRateFinder #19813

@clumsy

Description

@clumsy

Bug description

Running LearningRateFinder leads to teardown() on training epoch loop's results being moved to "cpu" here.

The problem is that loop results are only moved to device when registering for the first time here. This leads to an issue for cumulated_batch_size reduction which used the device of the original value tensor when it was first created. So when it's still on cpu when the training starts for real after lr_find we face RuntimeError('No backend type associated with device type cpu').

E.g. the issue happens when using 2 GPU device (see logs below).

I'll submit a fix for review shortly.

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

train/0 [1]:-> s.trainer.fit(s.model, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(543)fit()
train/0 [1]:-> call._call_and_handle_interrupt(
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py(43)_call_and_handle_interrupt()
train/0 [1]:-> return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py(105)launch()
train/0 [1]:-> return function(*args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(579)_fit_impl()
train/0 [1]:-> self._run(model, ckpt_path=ckpt_path)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(986)_run()
train/0 [1]:-> results = self._run_stage()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py(1032)_run_stage()
train/0 [1]:-> self.fit_loop.run()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(205)run()
train/0 [1]:-> self.advance()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py(363)advance()
train/0 [1]:-> self.epoch_loop.run(self._data_fetcher)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(139)run()
train/0 [1]:-> self.on_advance_end(data_fetcher)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py(287)on_advance_end()
train/0 [1]:-> self.val_loop.run()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py(182)_decorator()
train/0 [1]:-> return loop_run(self, *args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(142)run()
train/0 [1]:-> return self.on_run_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(254)on_run_end()
train/0 [1]:-> self._on_evaluation_epoch_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py(336)_on_evaluation_epoch_end()
train/0 [1]:-> trainer._logger_connector.on_epoch_end()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(195)on_epoch_end()
train/0 [1]:-> metrics = self.metrics
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py(234)metrics()
train/0 [1]:-> return self.trainer._results.metrics(on_step)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(483)metrics()
train/0 [1]:-> value = self._get_cache(result_metric, on_step)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(447)_get_cache()
train/0 [1]:-> result_metric.compute()
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(289)wrapped_func()
train/0 [1]:-> self._computed = compute(*args, **kwargs)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py(251)compute()
train/0 [1]:-> cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py(342)reduce()
train/0 [1]:-> return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(172)_sync_ddp_if_available()
train/0 [1]:-> return _sync_ddp(result, group=group, reduce_op=reduce_op)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py(222)_sync_ddp()
train/0 [1]:-> torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
train/0 [1]:  /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/c10d_logger.py(72)wrapper()
train/0 [1]:-> return func(*args, **kwargs)
train/0 [1]:> /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py(1996)all_reduce()
train/0 [0]:RuntimeError('No backend type associated with device type cpu')

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @carmocca

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingloggingRelated to the `LoggerConnector` and `log()`tunerver: 2.2.x

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions