Replies: 1 comment
-
I'm also curious if there's a clever way to do something here. Usually what I do is just log a summary every N steps, so that its effect on runtime is negligible. For my wandb logging I have a callback which aggregates metrics, but that's probably overkill for this. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In my training loop I use tqdm to iterate over batches I update the tqdm status bar with the summary of metrics (loss, acc, etc). Looking at a trace of the 3rd epoch, I see a lot of time spent in
MemcpyD2H
. I realize this is just waiting for cuda streams to complete, but if I disable the status bar update (so tqdm just logsit/s
) I notice a 15% performance improvement (66ms/batch -> 77ms/batch). I only sometimes care about live metrics, so I'd rather not take this penalty on every step. Is this a common problem? Is there a way to lazily copy results back to the host?Beta Was this translation helpful? Give feedback.
All reactions