-
Notifications
You must be signed in to change notification settings - Fork 34
Add inductor_benchmarker as latency measurement option #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a83ce70
to
37509ff
Compare
Besides that, I recalled @xuzhao9 mentioned somewhere we can use kineto's trace to sum only GPU time. |
I'm not sure about this change. Will this benchmark function change the results for pure handwritten triton kernel's results? If it matches the results of triton.testing.do_bench for other triton kernels, I feel we should just use this new function. If not, we need to understand what are the differences and see how to fix it. Adding a new what do you think? @xuzhao9 |
https://github.com/triton-lang/triton/blob/37f265932b68868021b2fade6354b44e613dc124/python/triton/testing.py#L163 |
@FindHao benchmarking.benchmark_gpu seems to be better for benchmarks than triton.testing.do_bench. Please check this scripts for comparison: P1908752761 ![]()
This sounds good! It would be great if we have a pytorch utility to benchmark only gpu time. We can converge all kernel benchmarks to use it if possible. |
Can you clarify what's the differences between these two implementations? What I mean is: if you claim that the Dynamo overhead is completely hidden in real runs, I am willing to trust that. However, we still need to verify that the results from benchmark_gpu are consistent with those from do_bench on other handwritten Triton kernels. Alternatively, measuring only the pure GPU time may be misleading, since it could overlook real Python or Triton runtime overhead. |
Would we be open to having a mode that measures only GPU time? I believe for real-world model use causes, the kernel launch or Python overhead is usually hidden by the previous kernel (assuming no bubble), so having a mode that only measures GPU time is useful and reflects that scenario too. |
Try Line 15 in a404ea7
Is this what you need? |
Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden. I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate. Pull Request resolved: #160921 Approved by: https://github.com/BoyuanFeng
Due to the limitation of triton's do_bench/do_bench_cudagraph, it is beneficial to have inductor benchmarker and more latency measurement modes (e.g. power-limit-aware benchmarking) in the future. |
x.grad = None | ||
|
||
# Measure only the function execution time | ||
ms_time = benchmarker.benchmark_gpu(fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering does inductor benchmarker use cudagraph?
…ch#160921) Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden. I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate. Pull Request resolved: pytorch#160921 Approved by: https://github.com/BoyuanFeng
As discussed in https://fb.workplace.com/groups/257735836456307/posts/967097558853461/?comment_id=971504168412800&reply_comment_id=971545435075340, the normal
triton.testing.do_bench
measurement includes torch.compile Dynamo invocation overhead and doesn't reflect real-world model use case where Dynamo overhead is usually hidden.This PR adds an option to use Inductor benchmarker as the timing measurement tool, which uses cuda event for timing measurement and thus more accurately measuring only the CUDA kernel runtime.