You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[TUTORIAL] Measuring performance in persistent kernels tutorial in stable thermal state (triton-lang#5042)
Following the nvidia's recipe for measuring performance in
09-persistent-matmul.py tutorial: get system into a stable thermal state
by using long warmup run, then do 1000 runs of benchmark.
We couldn't done it in the beginning because creating and passing TMA
descriptors was creating GPU bubble that allowed GPU to cool down, thus
not reaching equilibrium, skewing TMA kernel results towards unfair
higher scores. With changes around passing descriptors via grid
constants I see results very close to the version with descriptor
re-use, so we can now use this methodology and get correct benchmarking
results.
Example cmd line for measuring perf of fp8 matmul across K=[512, 8192]:
`python 09-persistent-matmul.py --prec fp8 --K_range 512 8192`
0 commit comments