-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
I am using lightning in conjunction with the mosaicML streaming library, which allows for stateful dataloaders for resumption of mid-epoch training. I am therefor passing in train/validation dataloaders manually to the trainer, as opposed to a datamodule. That said, as I am also looking to resume with optimizer state etc., I also pass in the checkpoint. Therefor my training is run as:
trainer.fit(
model=lightning_model,
train_dataloaders=train_dataloader,
val_dataloaders=validation_dataloader,
ckpt_path=args.ckpt
)
Note that at this stage, if resuming, I have already loaded my dataloader and updated with their state dict.
I have confirmed that the dataloader is still returning len(dataloader) correctly, indicating exactly how many steps are in the epoch.
But, when calling with resume logic, for example resuming from step n. 25 I will see the following in progress bar:
25/?
So, it seems that the trainer has (correctly) deduced that the checkpoint is resuming from a global step of 25, but is not calling len(dataloader) anymore to verify how many steps remain.
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
No response
Environment
No response
More info
No response
cc @lantiga