Slow CUDA JIT of an ONNX imported model #20407

jchia · 2024-03-23T13:47:59Z

jchia
Mar 23, 2024

I have a program where each JIT CPU single-core run takes 400 microseconds and each JIT GPU run takes 300 microseconds. This is timed after a 5-iteration warmup loop. The GPU run is only ~20% faster than single-core CPU. The program is imported to JAX from an ONNX model using jaxonnxruntime.

I did some rough profiling of one inference using jax.profiler.trace() and saw what seemed to be many kernel launches interspersed with host activity. Could it be that the host activity is causing the slowness? How could I debug this and get more things to run on the GPU instead of the host? Is there a better way to import ONNX to JAX than jaxonnxruntime? Is there a way to show the JAX code corresponding to each kernel in the trace? The kernel names don't tell much.

Or, are there other avenues to try for understanding and debugging the slowness?

jchia · 2024-03-23T14:16:47Z

jchia
Mar 23, 2024
Author

Here is the Perfetto trace. I don't understand why there are large gaps between the kernels, the large amount of computation done on the host and the long duration (60 microseconds) of the MemcpyH2D. The input data is 336kB and much lower than the theoretical amount of 1875kB transferable in 60 microseconds at the 32GB/s of PCIe 4.0.

perfetto_trace.json.gz

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow CUDA JIT of an ONNX imported model #20407

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Slow CUDA JIT of an ONNX imported model #20407

Uh oh!

Uh oh!

jchia Mar 23, 2024

Replies: 1 comment

Uh oh!

Uh oh!

jchia Mar 23, 2024 Author

jchia
Mar 23, 2024

jchia
Mar 23, 2024
Author