-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
When logging statistics at the end of an epoch from within on_train_epoch_end, the statistics end up on the wrong progress bar.
Since there doesn't seem to be a configuration to tell lightning nor the TQDMProgressBar to retain the bar for each epoch, I've been forced to inject a new line after each epoch ends, in order to not lose any of the valuable statistics in the console output.
The following is the output from a 3 epoch run:
Epoch 0: 100%|█████████████████████████| 938/938 [00:04<00:00, 206.06it/s, v_num=207]
Epoch 1: 100%|█████████████| 938/938 [00:04<00:00, 233.29it/s, v_num=207, loss=0.553]
Epoch 2: 100%|█████████████| 938/938 [00:04<00:00, 233.39it/s, v_num=207, loss=0.329]
`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|█████████████| 938/938 [00:04<00:00, 232.93it/s, v_num=207, loss=0.329]
- Loss for epoch 0 is incorrectly shown for epoch 1.
- Loss for epoch 1 is incorrectly shown for epoch 2.
- No logged loss at all is reported for epoch 0 nor epoch 2,
If there is a proper way to retain the progress bar for each epoch that is different from what I'm doing, then please let me know and this ticket can then be closed. If not, hopefully a fix can be found.
What version are you seeing the problem on?
v2.1
How to reproduce the bug
import torch
import torchvision
import pytorch_lightning as pl
class DemoNet(pl.LightningModule):
def __init__(self):
super().__init__()
self.fc = torch.nn.Linear(784, 10)
self.batch_losses = []
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.001)
def training_step(self, batch:torch.Tensor, _):
x, y = batch
x = x.reshape(x.size(0), -1)
yh = self.fc(x)
loss = torch.nn.functional.cross_entropy(yh, y)
self.batch_losses.append(loss)
return loss
def on_train_epoch_end(self):
loss = torch.stack(self.batch_losses).mean()
self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
self.batch_losses.clear()
print("")
ds = torchvision.datasets.MNIST(root="dataset/", train=True, transform=torchvision.transforms.ToTensor(), download=True)
train_loader = torch.utils.data.DataLoader(dataset=ds, batch_size=64, shuffle=False)
trainer = pl.Trainer(max_epochs=3)
trainer.fit(DemoNet(), train_loader)Error messages and logs
N/A
Environment
Current environment
-
Lightning:
- lightning-utilities: 0.9.0
- pytorch-lightning: 2.1.3
- torch: 2.1.2+cu118
- torchaudio: 2.1.2+cu118
- torchmetrics: 1.2.1
- torchvision: 0.16.2+cu118
- tqdm: 4.66.1
-
System:
- OS: Windows
- architecture:
- 64bit
- WindowsPE
- processor: AMD64 Family 25 Model 97 Stepping 2, AuthenticAMD
- python: 3.10.0
- release: 10
- version: 10.0.19045
-
CUDA:
- GPU:
- NVIDIA GeForce RTX 4090
- available: True
- version: 11.8
- GPU:
-
How you installed Lightning(
conda,pip, source): pip -
Running environment of LightningApp (e.g. local, cloud): local
More info
No response
cc @carmocca