Understanding epoch metrics: acc_train_epoch does not appear to be the average of acc_train_step #14474
Unanswered
cfhammill
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 3 replies
-
it should be the same... normally it's just weighted average w.r.t the batch_size, but as you mentioned, it's 1 mostly, so epoch one must be 1 too. Is it possible for you to share a reproducible example to check this issue? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi lightning devs and users,
I’m using lightning to train some models for work and I’m having trouble understanding how the epoch level metrics are getting aggregated and computed. In my model the acc_train_step hits perfect accuracy and maintains that for 1000s of steps, where the acc_train_epoch stays < 0.7. From reading the documentation I would expect the acc_train_epoch to be the average acc_train_step for each step of the epoch, but then shouldn’t acc_train_epoch be 1 as well?
Can someone help me understand why these two graphs would be so different?
I’m using pytorch-lightning 1.6.3
python 3.9.10
Thanks!
extra details:
I’m training with parallel strategy dp on 4 gpus,
my training step looks like
Beta Was this translation helpful? Give feedback.
All reactions