Skip to content

stats logging in "on_train_epoch_end" ends up on wrong progress bar #19322

@jojje

Description

@jojje

Bug description

When logging statistics at the end of an epoch from within on_train_epoch_end, the statistics end up on the wrong progress bar.

Since there doesn't seem to be a configuration to tell lightning nor the TQDMProgressBar to retain the bar for each epoch, I've been forced to inject a new line after each epoch ends, in order to not lose any of the valuable statistics in the console output.

The following is the output from a 3 epoch run:

Epoch 0: 100%|█████████████████████████| 938/938 [00:04<00:00, 206.06it/s, v_num=207]
Epoch 1: 100%|█████████████| 938/938 [00:04<00:00, 233.29it/s, v_num=207, loss=0.553]
Epoch 2: 100%|█████████████| 938/938 [00:04<00:00, 233.39it/s, v_num=207, loss=0.329]
`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|█████████████| 938/938 [00:04<00:00, 232.93it/s, v_num=207, loss=0.329]
  • Loss for epoch 0 is incorrectly shown for epoch 1.
  • Loss for epoch 1 is incorrectly shown for epoch 2.
  • No logged loss at all is reported for epoch 0 nor epoch 2,

If there is a proper way to retain the progress bar for each epoch that is different from what I'm doing, then please let me know and this ticket can then be closed. If not, hopefully a fix can be found.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

import torch
import torchvision
import pytorch_lightning as pl

class DemoNet(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(784, 10)
        self.batch_losses = []

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

    def training_step(self, batch:torch.Tensor, _):
        x, y = batch
        x = x.reshape(x.size(0), -1)
        yh = self.fc(x)
        loss = torch.nn.functional.cross_entropy(yh, y)
        self.batch_losses.append(loss)
        return loss

    def on_train_epoch_end(self):
        loss = torch.stack(self.batch_losses).mean()
        self.log('loss', loss, on_step=False, on_epoch=True, prog_bar=True)
        self.batch_losses.clear()
        print("")

ds = torchvision.datasets.MNIST(root="dataset/", train=True, transform=torchvision.transforms.ToTensor(), download=True)
train_loader = torch.utils.data.DataLoader(dataset=ds, batch_size=64, shuffle=False)
trainer = pl.Trainer(max_epochs=3)
trainer.fit(DemoNet(), train_loader)

Error messages and logs

N/A

Environment

Current environment
  • Lightning:

    • lightning-utilities: 0.9.0
    • pytorch-lightning: 2.1.3
    • torch: 2.1.2+cu118
    • torchaudio: 2.1.2+cu118
    • torchmetrics: 1.2.1
    • torchvision: 0.16.2+cu118
    • tqdm: 4.66.1
  • System:

    • OS: Windows
    • architecture:
      • 64bit
      • WindowsPE
    • processor: AMD64 Family 25 Model 97 Stepping 2, AuthenticAMD
    • python: 3.10.0
    • release: 10
    • version: 10.0.19045
  • CUDA:

    • GPU:
      • NVIDIA GeForce RTX 4090
    • available: True
    • version: 11.8
  • How you installed Lightning(conda, pip, source): pip

  • Running environment of LightningApp (e.g. local, cloud): local

More info

No response

cc @carmocca

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingloggingRelated to the `LoggerConnector` and `log()`ver: 2.1.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions