Skip to content

Conversation

ianayl
Copy link
Contributor

@ianayl ianayl commented Oct 3, 2025

Benchmark script changes in this PR are NFC.

Bit of a long overdue change: Wall time has proven to not be a good metric for stable performance numbers. This PR changes SYCL benchmarking CI regression filters to only fail on severe regressions in tests measuring CPU instruction count only.

This PR does not affect the behavior of benchmarking scripts. It only affects the behavior of the CI w.r.t. regression failures for SYCL: it does not concern regressions detected in UR or L0.

@ianayl ianayl requested review from a team as code owners October 3, 2025 21:07
@ianayl ianayl changed the title [CI] Fail benchmark CI only onregressions in CPU instruction count [CI] Fail benchmark CI only on regressions in CPU instruction count Oct 3, 2025
Copy link
Contributor

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no flags, leaving to benchmark team to do in depth review

@ianayl
Copy link
Contributor Author

ianayl commented Oct 3, 2025

Hey @uditagarwal97, just a friendly ping for awareness

@uditagarwal97
Copy link
Contributor

Hey @uditagarwal97, just a friendly ping for awareness

(1) I assume CPU instruction count to have less noise, is that the case? Do you have any standard deviation numbers?
(2) What's the current regression threshold above which the benchmarking workflow will fail in CI? Shouldn't that threshold also change with this PR?

@ianayl
Copy link
Contributor Author

ianayl commented Oct 3, 2025

(1) I assume CPU instruction count to have less noise, is that the case? Do you have any standard deviation numbers?

Unfortunately I don't have standard deviation numbers or hard statistics, but most regressions as a result of noise (since enabling CPU instruction count) that I've seen in CI has been wall time results anyway.

But, if you look at benchmark results from "CPU count" runs, they are (visibly) remarkably more stable than our timed runs.

(2) What's the current regression threshold above which the benchmarking workflow will fail in CI? Shouldn't that threshold also change with this PR?

The common "definition" of a regression used around Intel is 5%; I don't think our wall-time results are anywhere near that metric, but I think 5% is still a reasonable regression threshold.

@PatKamin
Copy link
Contributor

PatKamin commented Oct 6, 2025

The common "definition" of a regression used around Intel is 5%; I don't think our wall-time results are anywhere near that metric, but I think 5% is still a reasonable regression threshold.

Given that cpu count results are very stable, I feel this threshold could be lowered, perhaps to 3%. But such a threshold change ultimately should be applied only after analyzing historical data. @ianayl, do you think you could verify the lowered threshold with available data?

@ianayl
Copy link
Contributor Author

ianayl commented Oct 6, 2025

@ianayl, do you think you could verify the lowered threshold with available data?

I think I could, but I'm not sure if I have enough time to get around to this. Either way, we can probably have this conversation after the PR merges: lowering the threshold is a simple tweak in options.py.

@PatKamin
Copy link
Contributor

PatKamin commented Oct 6, 2025

@ianayl, do you think you could verify the lowered threshold with available data?

I think I could, but I'm not sure if I have enough time to get around to this. Either way, we can probably have this conversation after the PR merges: lowering the threshold is a simple tweak in options.py.

Ok, might be another PR. Without changing the threshold now we'll already have gains from this PR by reducing the number of noise in redundancy reports. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants