[CI] Fail benchmark CI only on regressions in CPU instruction count #20277

ianayl · 2025-10-03T21:07:46Z

Benchmark script changes in this PR are NFC.

Bit of a long overdue change: Wall time has proven to not be a good metric for stable performance numbers. This PR changes SYCL benchmarking CI regression filters to only fail on severe regressions in tests measuring CPU instruction count only.

This PR does not affect the behavior of benchmarking scripts. It only affects the behavior of the CI w.r.t. regression failures for SYCL: it does not concern regressions detected in UR or L0.

sarnex

no flags, leaving to benchmark team to do in depth review

ianayl · 2025-10-03T21:12:42Z

Hey @uditagarwal97, just a friendly ping for awareness

uditagarwal97 · 2025-10-03T21:33:50Z

Hey @uditagarwal97, just a friendly ping for awareness

(1) I assume CPU instruction count to have less noise, is that the case? Do you have any standard deviation numbers?
(2) What's the current regression threshold above which the benchmarking workflow will fail in CI? Shouldn't that threshold also change with this PR?

ianayl · 2025-10-03T22:40:43Z

(1) I assume CPU instruction count to have less noise, is that the case? Do you have any standard deviation numbers?

Unfortunately I don't have standard deviation numbers or hard statistics, but most regressions as a result of noise (since enabling CPU instruction count) that I've seen in CI has been wall time results anyway.

But, if you look at benchmark results from "CPU count" runs, they are (visibly) remarkably more stable than our timed runs.

(2) What's the current regression threshold above which the benchmarking workflow will fail in CI? Shouldn't that threshold also change with this PR?

The common "definition" of a regression used around Intel is 5%; I don't think our wall-time results are anywhere near that metric, but I think 5% is still a reasonable regression threshold.

PatKamin · 2025-10-06T10:12:45Z

The common "definition" of a regression used around Intel is 5%; I don't think our wall-time results are anywhere near that metric, but I think 5% is still a reasonable regression threshold.

Given that cpu count results are very stable, I feel this threshold could be lowered, perhaps to 3%. But such a threshold change ultimately should be applied only after analyzing historical data. @ianayl, do you think you could verify the lowered threshold with available data?

ianayl · 2025-10-06T14:47:03Z

@ianayl, do you think you could verify the lowered threshold with available data?

I think I could, but I'm not sure if I have enough time to get around to this. Either way, we can probably have this conversation after the PR merges: lowering the threshold is a simple tweak in options.py.

PatKamin · 2025-10-06T14:51:59Z

@ianayl, do you think you could verify the lowered threshold with available data?

I think I could, but I'm not sure if I have enough time to get around to this. Either way, we can probably have this conversation after the PR merges: lowering the threshold is a simple tweak in options.py.

Ok, might be another PR. Without changing the threshold now we'll already have gains from this PR by reducing the number of noise in redundancy reports. LGTM.

ianayl added 2 commits October 2, 2025 15:15

change filter to include only CPU cycles

0cf6c45

leave comment to prevent rug pulls

b2fa113

ianayl requested review from a team as code owners October 3, 2025 21:07

ianayl changed the title ~~[CI] Fail benchmark CI only onregressions in CPU instruction count~~ [CI] Fail benchmark CI only on regressions in CPU instruction count Oct 3, 2025

ianayl temporarily deployed to WindowsCILock October 3, 2025 21:08 — with GitHub Actions Inactive

sarnex approved these changes Oct 3, 2025

View reviewed changes

ianayl had a problem deploying to WindowsCILock October 3, 2025 21:29 — with GitHub Actions Failure

ianayl temporarily deployed to WindowsCILock October 3, 2025 21:29 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Fail benchmark CI only on regressions in CPU instruction count #20277

[CI] Fail benchmark CI only on regressions in CPU instruction count #20277

ianayl commented Oct 3, 2025 •

edited

Loading

Uh oh!

sarnex left a comment

Uh oh!

ianayl commented Oct 3, 2025

Uh oh!

uditagarwal97 commented Oct 3, 2025

Uh oh!

ianayl commented Oct 3, 2025

Uh oh!

PatKamin commented Oct 6, 2025

Uh oh!

ianayl commented Oct 6, 2025

Uh oh!

PatKamin commented Oct 6, 2025

Uh oh!

Uh oh!

[CI] Fail benchmark CI only on regressions in CPU instruction count #20277

Are you sure you want to change the base?

[CI] Fail benchmark CI only on regressions in CPU instruction count #20277

Conversation

ianayl commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarnex left a comment

Choose a reason for hiding this comment

Uh oh!

ianayl commented Oct 3, 2025

Uh oh!

uditagarwal97 commented Oct 3, 2025

Uh oh!

ianayl commented Oct 3, 2025

Uh oh!

PatKamin commented Oct 6, 2025

Uh oh!

ianayl commented Oct 6, 2025

Uh oh!

PatKamin commented Oct 6, 2025

Uh oh!

Uh oh!

ianayl commented Oct 3, 2025 •

edited

Loading