Skip to content

Progress bar is broken when loading trainer state from checkpoint #20603

@JLenzy

Description

@JLenzy

Bug description

I am using lightning in conjunction with the mosaicML streaming library, which allows for stateful dataloaders for resumption of mid-epoch training. I am therefor passing in train/validation dataloaders manually to the trainer, as opposed to a datamodule. That said, as I am also looking to resume with optimizer state etc., I also pass in the checkpoint. Therefor my training is run as:

trainer.fit(
        model=lightning_model,
        train_dataloaders=train_dataloader,
        val_dataloaders=validation_dataloader,
        ckpt_path=args.ckpt
    )

Note that at this stage, if resuming, I have already loaded my dataloader and updated with their state dict.
I have confirmed that the dataloader is still returning len(dataloader) correctly, indicating exactly how many steps are in the epoch.

But, when calling with resume logic, for example resuming from step n. 25 I will see the following in progress bar:
25/?
So, it seems that the trainer has (correctly) deduced that the checkpoint is resuming from a global step of 25, but is not calling len(dataloader) anymore to verify how many steps remain.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

No response

Environment

No response

More info

No response

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions