accumulate_grad_batches argument changes trainer.global_step #9179

duskvirkus · 2021-08-28T21:33:53Z

duskvirkus
Aug 28, 2021

Not sure if this a bug but came across the side effect that using the accumulate_grad_batches changes the way trainer.global_step is counted.

Came across when trying to debug a custom callback that uses trainer.global_step to keep track of training progress. If there's a "better" way to go about this then let me know.

Not that hard to fix the problem now that I know what's happening but I figured I'd bring it up in case I'm going about things wrong or it's a bug.

toy example:

without accumulate_grad_batches

# intput
!python scripts/trainer.py --size 256 --gpus 1 --dataset_path /content/drive/MyDrive/alias-free-devel2/alias-free-gan/alias-free-gan-ci-files/flowers-test-dataset-32-256 --batch 8 --max_epochs 1 --resume_from '/content/drive/MyDrive/alias-free-devel2/alias-free-gan/alias-free-gan-ci-files/000000001-kimg-ci-checkpoint.pt'
# tail output
Epoch 0:  25% 1/4 [00:02<00:03,  1.11s/it, kimgs=1.000]global_step: 0, img_count: 1008
Epoch 0:  50% 2/4 [00:03<00:02,  1.28s/it, kimgs=1.008]global_step: 1, img_count: 1016
Epoch 0:  75% 3/4 [00:05<00:01,  1.37s/it, kimgs=1.016]global_step: 2, img_count: 1024
Epoch 0: 100% 4/4 [00:07<00:00,  1.43s/it, kimgs=1.024]global_step: 3, img_count: 1032
Epoch 0: 100% 4/4 [00:07<00:00,  1.43s/it, kimgs=1.024]

with

# input
!python scripts/trainer.py --accumulate_grad_batches 2 --size 256 --gpus 1 --dataset_path /content/drive/MyDrive/alias-free-devel2/alias-free-gan/alias-free-gan-ci-files/flowers-test-dataset-32-256 --batch 8 --max_epochs 1 --resume_from '/content/drive/MyDrive/alias-free-devel2/alias-free-gan/alias-free-gan-ci-files/000000001-kimg-ci-checkpoint.pt'
# tail output
Epoch 0:  25% 1/4 [00:02<00:03,  1.10s/it, kimgs=1.000]global_step: 0, img_count: 1008
Epoch 0:  50% 2/4 [00:03<00:02,  1.27s/it, kimgs=1.008]global_step: 0, img_count: 1008
Epoch 0:  75% 3/4 [00:05<00:01,  1.36s/it, kimgs=1.008]global_step: 1, img_count: 1016
Epoch 0: 100% 4/4 [00:07<00:00,  1.42s/it, kimgs=1.016]global_step: 1, img_count: 1016
Epoch 0: 100% 4/4 [00:07<00:00,  1.42s/it, kimgs=1.016]

rentainhe · 2021-09-16T15:54:44Z

rentainhe
Sep 16, 2021

Maybe I've met the same problem, would you like to share your callbacks?

0 replies

grudloff · 2021-09-16T20:39:08Z

grudloff
Sep 16, 2021

Not sure if this is related but I am running some experiments where a 10x grad accumulation leads to a 10x slowdown (as shown on tensorboard logs), which could be due to a change in the way steps are counted rather than an actual slowdown.

3 replies

rentainhe Sep 17, 2021

Yes，I've observed the same phenomenon, but I think maybe you can try manual_loss and set self.automatic_optimization = False

ananthsub Sep 17, 2021

The global_step indicates how many times the parameters are updated. Therefore, with gradient accumulation and automatic optimization enabled, the global step is only incremented after the gradients are accumulated and not after every training_step is run

rentainhe Sep 17, 2021

But it seems like the val_check_interval is just related to training_step not global_step

I wonder if we can define a callback by ourselves and link these two parameters, I tried use an older version of lightning with run_evaluation() but it seems that this function is deprecated in new version

ibro45 · 2022-12-23T03:21:18Z

ibro45
Dec 23, 2022

I've run into the same issue, does anyone have a nice way of handling this?

0 replies

YoelShoshan · 2022-12-30T15:26:29Z

YoelShoshan
Dec 30, 2022

@duskvirkus I think that this is expected behavior in the sense that "global_steps" means how many optimizer steps happened so far

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

accumulate_grad_batches argument changes trainer.global_step #9179

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

accumulate_grad_batches argument changes trainer.global_step #9179

Uh oh!

Replies: 4 comments · 3 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 3 replies