-
Notifications
You must be signed in to change notification settings - Fork 4
Update how we calculate correctness score and performance score #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit ambivalent about the performance score, but here's my take. Generally, when you are looking at these numbers, you want to incentive people to make them go down. I would either make FAIL_FACTOR bigger (possibly 2.0 or larger) to emphasize correctness or just make them separate by having performance score not include failed tests.
As an overall score I would recommend using fast_p from kernelbench https://arxiv.org/html/2502.10517v1 with aten as the reference. The dict that gets produced in #92 should make this much easier.
I am a bit confused on the correctness changes.
Also if you want to get more granular about scoring, I think pr 92 may help once it gets merged
For fastp, p=0.8 by default Running
Running
Edit: |
lmk when you're ready for another review |
@msaroufim Thanks! It's ready now. |
New commits:
|
BackendBench/scripts/main.py
Outdated
"--p", | ||
default=1.0, | ||
type=float, | ||
help="Performance score threshold for perf@p score calculation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment on whether increasing this number is more or less stringent, it's something that regularly trips people up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@click.option(
"--p",
default=1.0,
type=float,
help=(
"Performance score threshold for perf@p score calculation"
"Note: Increasing this value makes the threshold more stringent, "
"requiring a higher speedup to meet the performance criteria."
)
)
overall_correctness.tolist(), overall_performance.tolist(), p | ||
) | ||
|
||
assert torch.allclose( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add a comment about why the perf@p score is calculated subtly differently, but the test here works. Otherwise, it looks like we can just swap out the two scores.
(something like after averages are calculated it's the same suffices tbh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
# Note: The perf@p score calculation here differs subtly from the original fastp score in
# kernel bench. The original fastp score filters correct samples first, then averages.
# Here, perf@p averages first, then filters correct samples. Despite this difference,
# both methods produce equivalent results after averaging, so the test remains valid.
@@ -243,7 +257,11 @@ def cli( | |||
results = evaluator.get_results() | |||
|
|||
for result in results: | |||
correctness_score = result.correctness_score | |||
correctness_score = all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a per@p on the operator level as well in test_data (I'd just do this in eval.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how perf@p works on the operator level. Currently, perf@p is defined as "the ratio of ops that are both correct and have a speedup greater than p." Applying this metric at the operator level may require adjusting the definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! A few minor comments, but thanks for working with us to get the scoring right :)
Thanks! I'm not entirely sure about how perf@k applies at the operator level, but I'm happy to discuss it further and submit a follow-up PR to address that. |
@@ -209,7 +220,14 @@ def cli( | |||
test.correctness_tests, | |||
test.performance_tests, | |||
) | |||
overall_correctness.append(correctness) | |||
|
|||
overall_correctness.append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're calculating correctness score here as an aggregate of all the tests per op. Therefore, we are calculating perf@p on the level of the aggregates per op rather than individual tests as kernelbench does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe KernelBench fastp is at the same level as us because one task in KernelBench corresponds to one operation in BackendBench.
For each task in KernelBench, they verify the correctness of the generated kernel by comparing it against reference PyTorch operators multiple times using randomized inputs. Then, they measure speedup over multiple runs. The final fastp metric is calculated at the kernel level, rather than for individual runs. This is the same as BackendBench, where we verify the correctness of an operation by running a series of correctness tests and compute speedup by averaging performance results across multiple tests.
According to the discussion in #97 , I have updated how we calculate correctness score and performance score
Specifically, foreval_correctness
, the function now returns three values:return correct == total, correct, total.
This indicates whether the operation passes all tests, along with the number of correct tests and the total number of tests.For eval_performance, I introduced aFAIL_FACTOR
that increases the test time as a penalty when a test fails. Currently, this factor is set to 1.1.The above changes have been reverted. Below are the new commits:
Edit:
For the correctness score calculation, I now use
op_test_data
introduce in #92 to check if all tests pass usingall(data["correctness_score"] for data in op_test_data.values()
to judge if an op is correct. The correctness score is computed as the ratio of correct ops.For the performance score calculation, the original performance score remains unchanged. I have introduced a new metric: perf_at_p (perf@p for writing), similar to fastp from KernelBench. perf_at_p score is the ratio of ops that are both correct and with a speedup greater than p. For example, if p == 0, perf_at_p score is the same as correctness score. If p == 1, perf_at_p score reflects the ratio of correct ops that are also faster than baseline.
Questions:
@Laurawly Hi Laura! does the perf_at_p score make sense to you? Do you have any suggestions for other performance metrics that would be useful here? Thanks!