You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to understand how performant my code is, and I thought I'd use jax.profiler.trace to figure out any bottlenecks. But the results I'm seeing right now don't really provide any information. As suggested by the docs, I've used the context manager over a training loop:
withjax.profiler.trace(log_dir='/scratch/shyamss/tmp/tensorboard'):
# Training functionmodel.train(..., n_iter)
When I open logdir on TensorBoard, I'm not seeing any details on the overview page. The pages that seem to have some detail are the trace_viewer page which I don't know how to interpret. One thing that did concern me is that the tensorflow_stats page seemed to state that 40.8% was Thunk while 59.2% was IDLE time. Here's some of the exported data from the page:
Rank,Host/device,Type,Operation,#Occurrences,Total time (us),Avg. time (us),Total self-time (us),Avg. self-time (us),Total self-time on Device (%),Cumulative total-self time on Device (%),Total self-time on Host (%),Cumulative total-self time on Host (%)
1,Device,IDLE,IDLE,0,562120574.823,0.0,562120574.823,0.0,0.5921957074522451,0.5921957074522451,0.0,0.0
2,Device,Thunk,Thunk,88150,387093625.397,4391.306016982417,387093625.397,4391.306016982417,0.4078042925477548,1.0,0.0,0.0
3,Host,IDLE,IDLE,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Does this mean that more than half the time the GPU isn't working? nvidia-smi seemed to consistently indicate that the utilization was at 99-100%. Is there something I'm not understanding here?
I'd greatly appreciate any suggestions / links to resources I should look at so that I can understand the issue better.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm trying to understand how performant my code is, and I thought I'd use
jax.profiler.trace
to figure out any bottlenecks. But the results I'm seeing right now don't really provide any information. As suggested by the docs, I've used the context manager over a training loop:When I open
logdir
on TensorBoard, I'm not seeing any details on the overview page. The pages that seem to have some detail are the trace_viewer page which I don't know how to interpret. One thing that did concern me is that the tensorflow_stats page seemed to state that 40.8% was Thunk while 59.2% was IDLE time. Here's some of the exported data from the page:Does this mean that more than half the time the GPU isn't working?
nvidia-smi
seemed to consistently indicate that the utilization was at 99-100%. Is there something I'm not understanding here?I'd greatly appreciate any suggestions / links to resources I should look at so that I can understand the issue better.
Beta Was this translation helpful? Give feedback.
All reactions