The trainer's global step and current epoch don't change #15970

Dee-Ma · 2022-12-08T21:28:07Z

Dee-Ma
Dec 8, 2022

Hi,

I am using PyTorch Lightning to train a model. The code looks like below. In the code, instead of calling trainer.fit() to run the training automatically, we manually run our train_step and separately run callbacks' inner functions (for example, on_train_batch_end(), on_train_epoch_end(), etc.) to try to make the callbacks work manually (this likes a 'debug' mode).

trainer = pl.Trainer()
for epoch in range(0, epochs):
      loss_history = []
      accuracy_history = []
      tbar = tqdm(train_dataset)
      
       for i, batch in enumerate(tbar): 
              loss = self.runner.train_step(batch, epoch, i)          
                for ii, callback in enumerate(callbacks):                  
                    callback.on_train_batch_end(trainer, self.model, batch=batch, batch_index=i)
             ...

After running the code like this, it seems that the trainer.global_step always equals to 0 and the trainer.current_epoch doesn't change neither during the training process. In this case, the callbacks (for example, ModelCheckpoint does not work properly).

Could I know if trainer.global_step will never change if we don't run trainer.fit()? Do we have a way to 'manually' set the value of trainer.global_step? I am wondering if the callbacks can only be used with trainer.fit() - so the code should never be written like this if we use PyTorch Lightning?

Many thanks for help!

justusschock · 2022-12-09T13:28:54Z

justusschock
Dec 9, 2022
Maintainer

Hi, PyTorch Lightning is not intended to be used like this currently. We are updating the counters for steps and epochs within our loop API; therefore, if you don't call fit, you bypass these updates.

Why do you code the loops yourself?
If it's a particular case you think you cannot build with Lightning today, please have a look at Manual Optimization and Loop Customization to see if these fit your needs.

3 replies

Dee-Ma Dec 9, 2022
Author

Hi @justusschock

Many thanks for your reply! That's very helpful!

The motivation of doing this is that we'd like to run the model under 'debug' mode - so we can check each training step and get training/val outputs. We have made this code work for Keras. But I see, pytorch lightning does not support doing this-which is different from Keras.

About the Loop Customization, could you please let us know how to call callbacks in it? If it is the same with the trainer.fit() - so that we just need to define callbacks in trainer (trainer=pl.Trainer(callbacks=my_callbacks)) and it will automatically call the callbacks?

Many thanks for your kind help!

justusschock Dec 9, 2022
Maintainer

Hey,
yes you just add the loop to the trainer as you see in the example and then call trainer.fit. This makes sure the callbacks are called accordingly

Dee-Ma Dec 9, 2022
Author

Many thanks for your help, @justusschock!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The trainer's global step and current epoch don't change #15970

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The trainer's global step and current epoch don't change #15970

Uh oh!

Dee-Ma Dec 8, 2022

Replies: 1 comment · 3 replies

Uh oh!

justusschock Dec 9, 2022 Maintainer

Uh oh!

Dee-Ma Dec 9, 2022 Author

Uh oh!

justusschock Dec 9, 2022 Maintainer

Uh oh!

Dee-Ma Dec 9, 2022 Author

Dee-Ma
Dec 8, 2022

Replies: 1 comment 3 replies

justusschock
Dec 9, 2022
Maintainer

Dee-Ma Dec 9, 2022
Author

justusschock Dec 9, 2022
Maintainer

Dee-Ma Dec 9, 2022
Author