Simple way to diagnose per-epoch slowness? #16519
Unanswered
turian
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This has come up for myself and other users before. If there is stalling between epochs, is there a simple way to diagnose where lightning is spending time and introducing overhead?
I've seen possible fixes discussed before, that require fiddling around and don't necessarily give visibility. What I'm asking for is simply how to see where the time is going.
In some cases this might be mitigated using, for example,
trainer.check_val_every_n_epoch > 1
and (now deprecated?)reload_dataloaders_every_epoch
. But it would be nice actually to have one or two lines that could instrument profiling / diagnostics of lightning logging or dataloader or whatever overhead, so users actually understand what is slow and why. That makes it easier to fix the root cause.Related:
#10660
#10389
#2367
Beta Was this translation helpful? Give feedback.
All reactions