-
Notifications
You must be signed in to change notification settings - Fork 76
Use iteration count instead of time for parameters warmup and rep of do_bench* functions for benchmarks
#2256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Anatoly Myachev <[email protected]>
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
|
@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:
Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of running multiple times (>3, only in this case "*-CV" column will not be |
I think that when the warmup runs "too many times" the GPU may start heating up and then throttle the frequency down, so when the timed run start the performance is reduced. That means we are better off not increasing the rep/warmup to the point we see performance degradations in the benchmarks. |
etiotto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should increase the number of repetition too much. Going from 10 too 600 repetitions is a huge increase.
The kernel timing distribution should be a normal (gaussian) curve. We only need to run the benchmark enough times to approximate a gaussian "bell" curve. From https://www.scribbr.com/statistics/central-limit-theorem/#:~:text=By%20convention%2C%20we%20consider%20a,if%20the%20population%20is%20normal. looks like 30 is the number of reps we should use.
| v = torch.randn((Z, H, N_CTX, D_HEAD), device='xpu', dtype=dtype) | ||
| sm_scale = 0.125 | ||
| quantiles = [0.5, 0.0, 1.0] | ||
| warmup, rep = 10, 600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.
This value is measured in milliseconds and is needed for some test combinations where one run takes more than 100 ms.
|
If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone? |
@whitneywhtsang Most likely yes. However, I made a change to make |
I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO. |
I agree. To me "rep" should really mean the number of repetition of the kernel after warmup. I think we can diverge from upstream here, and then try to upstream our changes. |
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
| # compute number of warmup and repeat | ||
| n_warmup = max(1, int(warmup / estimate_ms)) | ||
| n_repeat = max(1, int(rep / estimate_ms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no point in calculating the number of iterations through the expected time of one iteration, since the required number of iterations is requested by the user.
| # compute number of warmup and repeat | ||
| n_warmup = max(1, int(warmup / estimate_ms)) | ||
| n_repeat = max(1, int(rep / estimate_ms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no point in calculating the number of iterations through the expected time of one iteration, since the required number of iterations is requested by the user.
| # compute warmup and repeat times | ||
| warmup_time = n_warmup * estimate_ms | ||
| rep_time = n_repeat * estimate_ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I translate the parameters into those that upstream (triton_do_bench) understands.
warmup and rep for FA benchmarkwarmup and rep of do_bench* functions for benchmarks
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@etiotto @whitneywhtsang ready for review |
|
Please update PR description. |
Closes #2255
Partially reusing the changes, which were removed in #2142 (namely the part related to using iteration count instead of time) solves the problem of not having enough data for "CV" column.
Old comments
CI status:
UPD: For some reason this greatly affects the mean time. However, if I reduce
warmup, the mean does not deteriorate as much.