What is trainer/global_step in wandb logging? #20377
Replies: 3 comments 5 replies
-
|
Now that it's past the first epoch, it's clear that it wasn't the logger "falling behind". The second epoch started at approximately global step 7049, and according to the progress bar there are 27907 batches per epoch. So, again, a factor of four. Is this a bug? |
Beta Was this translation helpful? Give feedback.
-
|
Yeah I really don't get how wandb comes up with its steps either |
Beta Was this translation helpful? Give feedback.
-
|
I figured out what trainer/global_step is! It is exactly as they say: the total number of training optimization steps taken. The problem isn't with PytorchLightning the problem is with W&B's "step" which is probably why Lightning created their own step metric. Basically I calculated the number of steps that I was supposed to have in my training and it exactly matched trainer/global_step whereas W&B step was way off base (1/10 of the actual number of steps). In addition I changed my code in a way that I calculated shouldn't change the total number of optimization steps, despite this it halved the number of W&B "steps" but kept trainer/global_step constant! It is possible that Lightning internally calls W&B in some weird way which mangles the W&B's step counter. But regardless trainer/global_step should be trusted and W&B step shouldn't this is in pytorch_lightning==2.4.0 and wandb==0.19.11 P.S. I think this discussion is on the same topic: #8007 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
What does the value
trainer/global_stepmean in wandb logging?I am not using distributed training, and only looking during the first epoch. I would think that this is the number of batches processed, but it doesn't seem like it. What is it supposed to be? For example, the latest logged value is 1199. But my progress bar shows the current batch is 5000. I thought maybe the wandb logger is lagging a bit, but I doubt it is lagging that much. So where is this factor of about four difference coming from?
To make things concrete, I took these two screenshots at roughly the same time:
Beta Was this translation helpful? Give feedback.
All reactions