-
Notifications
You must be signed in to change notification settings - Fork 808
[CI] Fail benchmark CI only on regressions in CPU instruction count #20277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no flags, leaving to benchmark team to do in depth review
Hey @uditagarwal97, just a friendly ping for awareness |
(1) I assume CPU instruction count to have less noise, is that the case? Do you have any standard deviation numbers? |
Unfortunately I don't have standard deviation numbers or hard statistics, but most regressions as a result of noise (since enabling CPU instruction count) that I've seen in CI has been wall time results anyway. But, if you look at benchmark results from "CPU count" runs, they are (visibly) remarkably more stable than our timed runs.
The common "definition" of a regression used around Intel is 5%; I don't think our wall-time results are anywhere near that metric, but I think 5% is still a reasonable regression threshold. |
Given that cpu count results are very stable, I feel this threshold could be lowered, perhaps to 3%. But such a threshold change ultimately should be applied only after analyzing historical data. @ianayl, do you think you could verify the lowered threshold with available data? |
I think I could, but I'm not sure if I have enough time to get around to this. Either way, we can probably have this conversation after the PR merges: lowering the threshold is a simple tweak in options.py. |
Ok, might be another PR. Without changing the threshold now we'll already have gains from this PR by reducing the number of noise in redundancy reports. LGTM. |
Benchmark script changes in this PR are NFC.
Bit of a long overdue change: Wall time has proven to not be a good metric for stable performance numbers. This PR changes SYCL benchmarking CI regression filters to only fail on severe regressions in tests measuring CPU instruction count only.
This PR does not affect the behavior of benchmarking scripts. It only affects the behavior of the CI w.r.t. regression failures for SYCL: it does not concern regressions detected in UR or L0.