Update how we calculate correctness score and performance score #107

jiannanWang · 2025-08-21T23:02:49Z

According to the discussion in #97 , I have updated how we calculate correctness score and performance score

Specifically, for eval_correctness, the function now returns three values: return correct == total, correct, total. This indicates whether the operation passes all tests, along with the number of correct tests and the total number of tests.

~~For eval_performance, I introduced a FAIL_FACTOR that increases the test time as a penalty when a test fails. Currently, this factor is set to 1.1.~~

The above changes have been reverted. Below are the new commits:

Edit:
For the correctness score calculation, I now use op_test_data introduce in #92 to check if all tests pass using all(data["correctness_score"] for data in op_test_data.values() to judge if an op is correct. The correctness score is computed as the ratio of correct ops.

For the performance score calculation, the original performance score remains unchanged. I have introduced a new metric: perf_at_p (perf@p for writing), similar to fastp from KernelBench. perf_at_p score is the ratio of ops that are both correct and with a speedup greater than p. For example, if p == 0, perf_at_p score is the same as correctness score. If p == 1, perf_at_p score reflects the ratio of correct ops that are also faster than baseline.

Questions:
@Laurawly Hi Laura! does the perf_at_p score make sense to you? Do you have any suggestions for other performance metrics that would be useful here? Thanks!

PaliC

I'm a bit ambivalent about the performance score, but here's my take. Generally, when you are looking at these numbers, you want to incentive people to make them go down. I would either make FAIL_FACTOR bigger (possibly 2.0 or larger) to emphasize correctness or just make them separate by having performance score not include failed tests.

As an overall score I would recommend using fast_p from kernelbench https://arxiv.org/html/2502.10517v1 with aten as the reference. The dict that gets produced in #92 should make this much easier.

I am a bit confused on the correctness changes.

Also if you want to get more granular about scoring, I think pr 92 may help once it gets merged

BackendBench/eval.py

jiannanWang · 2025-08-22T23:00:45Z

For fastp, p=0.8 by default

Running uv run python BackendBench/scripts/main.py --suite opinfo --backend aten: (

correctness score (mean pass rate over all operators): 0.99
performance score (geomean speedup over all operators): nan
fastp score (rate of correct samples with a speedup greater than p): 0.00

Running uv run python BackendBench/scripts/main.py --suite torchbench --backend aten --ops "add,sub,mul,div,relu"

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): 1.00
fastp score (rate of correct samples with a speedup greater than p): 1.00

Edit:
The default p is set to 1.0.
The score's name is changed to perf_at_p (perf@p for writing).

BackendBench/score.py

msaroufim · 2025-08-25T20:37:40Z

lmk when you're ready for another review

jiannanWang · 2025-08-25T20:43:53Z

@msaroufim Thanks! It's ready now.

BackendBench/score.py

BackendBench/scripts/main.py

jiannanWang · 2025-08-25T22:35:37Z

New commits:

Move perf_at_score to eval.py and the test to test_eval.py
fix multiprocessing branch by replacing op_test_data with result.test_data.
Run uv run python BackendBench/scripts/main.py --backend aten --suite opinfo --num-workers 8
Results:

<Logs>
[2025-08-25 15:28:41][INFO][multiprocessing_eval.py] Collected 356 results out of 356 tasks
[2025-08-25 15:28:41][INFO][multiprocessing_eval.py] Shutting down multiprocessing evaluator...
[2025-08-25 15:28:43][INFO][multiprocessing_eval.py] Multiprocessing evaluator shutdown complete
correctness score (mean pass rate over all operators): 0.99
performance score (geomean speedup over all operators): nan
perf@p score (rate of correct samples with a speedup greater than p, p=1.0): 0.00

msaroufim · 2025-08-26T01:43:56Z

BackendBench/scripts/main.py

+    "--p",
+    default=1.0,
+    type=float,
+    help="Performance score threshold for perf@p score calculation",


add a comment on whether increasing this number is more or less stringent, it's something that regularly trips people up

Added.

@click.option( "--p", default=1.0, type=float, help=( "Performance score threshold for perf@p score calculation" "Note: Increasing this value makes the threshold more stringent, " "requiring a higher speedup to meet the performance criteria." ) )

PaliC · 2025-08-26T03:10:44Z

test/test_eval.py

+                    overall_correctness.tolist(), overall_performance.tolist(), p
+                )
+
+                assert torch.allclose(


Just add a comment about why the perf@p score is calculated subtly differently, but the test here works. Otherwise, it looks like we can just swap out the two scores.

(something like after averages are calculated it's the same suffices tbh)

Added.

# Note: The perf@p score calculation here differs subtly from the original fastp score in # kernel bench. The original fastp score filters correct samples first, then averages. # Here, perf@p averages first, then filters correct samples. Despite this difference, # both methods produce equivalent results after averaging, so the test remains valid.

PaliC · 2025-08-26T03:12:31Z

BackendBench/scripts/main.py

@@ -243,7 +257,11 @@ def cli(
            results = evaluator.get_results()

        for result in results:
-            correctness_score = result.correctness_score
+            correctness_score = all(


Can you also add a per@p on the operator level as well in test_data (I'd just do this in eval.py)

I'm not sure how perf@p works on the operator level. Currently, perf@p is defined as "the ratio of ops that are both correct and have a speedup greater than p." Applying this metric at the operator level may require adjusting the definition.

PaliC

lgtm! A few minor comments, but thanks for working with us to get the scoring right :)

jiannanWang · 2025-08-26T03:54:26Z

Thanks! I'm not entirely sure about how perf@k applies at the operator level, but I'm happy to discuss it further and submit a follow-up PR to address that.

PaliC · 2025-08-26T04:19:03Z

BackendBench/scripts/main.py

@@ -209,7 +220,14 @@ def cli(
                test.correctness_tests,
                test.performance_tests,
            )
-            overall_correctness.append(correctness)
+
+            overall_correctness.append(


you're calculating correctness score here as an aggregate of all the tests per op. Therefore, we are calculating perf@p on the level of the aggregates per op rather than individual tests as kernelbench does.

I believe KernelBench fastp is at the same level as us because one task in KernelBench corresponds to one operation in BackendBench.

For each task in KernelBench, they verify the correctness of the generated kernel by comparing it against reference PyTorch operators multiple times using randomized inputs. Then, they measure speedup over multiple runs. The final fastp metric is calculated at the kernel level, rather than for individual runs. This is the same as BackendBench, where we verify the correctness of an operation by running a series of correctness tests and compute speedup by averaging performance results across multiple tests.

jiannanWang added 2 commits August 21, 2025 15:53

change correctness to all or nothing

02f4c98

change correctness and performance score calculation

c15db33

jiannanWang requested review from msaroufim and PaliC as code owners August 21, 2025 23:02

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 21, 2025

PaliC requested changes Aug 21, 2025

View reviewed changes

BackendBench/eval.py Outdated Show resolved Hide resolved

BackendBench/eval.py Outdated Show resolved Hide resolved

jiannanWang added 9 commits August 22, 2025 11:02

update correctness

6ef5cf3

fix merge conflict

e684dd9

revert performance factor and add fastp

e3753d8

Merge branch 'main' into jiannanWang/evalcorrectness

bac65d5

revert changes and use op_test_data for correctness

2f4b74a

push

8fa5e21

fix

34ce7a1

fix test data no correctness_score

f53f882

ruff

d7ac47c

jiannanWang requested a review from PaliC August 22, 2025 23:01

PaliC requested changes Aug 25, 2025

View reviewed changes

BackendBench/score.py Outdated Show resolved Hide resolved

jiannanWang added 3 commits August 25, 2025 13:12

update

91fdc33

ruff

2bba829

fix test

b8e75f2

msaroufim requested changes Aug 25, 2025

View reviewed changes

BackendBench/score.py Outdated Show resolved Hide resolved

BackendBench/scripts/main.py Outdated Show resolved Hide resolved

move perf@p to eval.py; fix test_data in multiprocessing

caac191

jiannanWang requested review from msaroufim and PaliC August 25, 2025 22:35

msaroufim reviewed Aug 26, 2025

View reviewed changes

msaroufim approved these changes Aug 26, 2025

View reviewed changes

PaliC reviewed Aug 26, 2025

View reviewed changes

PaliC approved these changes Aug 26, 2025

View reviewed changes

add notes

09d534c

jiannanWang merged commit f2b685a into main Aug 26, 2025
3 checks passed

PaliC reviewed Aug 26, 2025

View reviewed changes

Update how we calculate correctness score and performance score #107

Update how we calculate correctness score and performance score #107

Uh oh!

Conversation

jiannanWang commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaliC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jiannanWang commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

msaroufim commented Aug 25, 2025

Uh oh!

jiannanWang commented Aug 25, 2025

Uh oh!

Uh oh!

Uh oh!

jiannanWang commented Aug 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaliC left a comment

Choose a reason for hiding this comment

Uh oh!

jiannanWang commented Aug 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiannanWang commented Aug 21, 2025 •

edited

Loading

PaliC left a comment •

edited

Loading

jiannanWang commented Aug 22, 2025 •

edited

Loading