Interpreting `jax.profiler.trace` #9376

shyams2 · 2022-01-30T03:17:48Z

shyams2
Jan 30, 2022

Hi,

I'm trying to understand how performant my code is, and I thought I'd use jax.profiler.trace to figure out any bottlenecks. But the results I'm seeing right now don't really provide any information. As suggested by the docs, I've used the context manager over a training loop:

with jax.profiler.trace(log_dir = '/scratch/shyamss/tmp/tensorboard'):
        # Training function
        model.train(..., n_iter)

When I open logdir on TensorBoard, I'm not seeing any details on the overview page. The pages that seem to have some detail are the trace_viewer page which I don't know how to interpret. One thing that did concern me is that the tensorflow_stats page seemed to state that 40.8% was Thunk while 59.2% was IDLE time. Here's some of the exported data from the page:

Rank,Host/device,Type,Operation,#Occurrences,Total time (us),Avg. time (us),Total self-time (us),Avg. self-time (us),Total self-time on Device (%),Cumulative total-self time on Device (%),Total self-time on Host (%),Cumulative total-self time on Host (%)
1,Device,IDLE,IDLE,0,562120574.823,0.0,562120574.823,0.0,0.5921957074522451,0.5921957074522451,0.0,0.0
2,Device,Thunk,Thunk,88150,387093625.397,4391.306016982417,387093625.397,4391.306016982417,0.4078042925477548,1.0,0.0,0.0
3,Host,IDLE,IDLE,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Does this mean that more than half the time the GPU isn't working? nvidia-smi seemed to consistently indicate that the utilization was at 99-100%. Is there something I'm not understanding here?

I'd greatly appreciate any suggestions / links to resources I should look at so that I can understand the issue better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interpreting `jax.profiler.trace` #9376

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Interpreting jax.profiler.trace #9376

Uh oh!

shyams2 Jan 30, 2022

Replies: 0 comments

Interpreting `jax.profiler.trace` #9376

shyams2
Jan 30, 2022