Skip to content

Conversation

jiannanWang
Copy link
Contributor

@jiannanWang jiannanWang commented Aug 21, 2025

According to the discussion in #97 , I have updated how we calculate correctness score and performance score

Specifically, for eval_correctness, the function now returns three values: return correct == total, correct, total. This indicates whether the operation passes all tests, along with the number of correct tests and the total number of tests.

For eval_performance, I introduced a FAIL_FACTOR that increases the test time as a penalty when a test fails. Currently, this factor is set to 1.1.

The above changes have been reverted. Below are the new commits:

Edit:
For the correctness score calculation, I now use op_test_data introduce in #92 to check if all tests pass using all(data["correctness_score"] for data in op_test_data.values() to judge if an op is correct. The correctness score is computed as the ratio of correct ops.

For the performance score calculation, the original performance score remains unchanged. I have introduced a new metric: perf_at_p (perf@p for writing), similar to fastp from KernelBench. perf_at_p score is the ratio of ops that are both correct and with a speedup greater than p. For example, if p == 0, perf_at_p score is the same as correctness score. If p == 1, perf_at_p score reflects the ratio of correct ops that are also faster than baseline.

Questions:
@Laurawly Hi Laura! does the perf_at_p score make sense to you? Do you have any suggestions for other performance metrics that would be useful here? Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 21, 2025
Copy link
Contributor

@PaliC PaliC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit ambivalent about the performance score, but here's my take. Generally, when you are looking at these numbers, you want to incentive people to make them go down. I would either make FAIL_FACTOR bigger (possibly 2.0 or larger) to emphasize correctness or just make them separate by having performance score not include failed tests.

As an overall score I would recommend using fast_p from kernelbench https://arxiv.org/html/2502.10517v1 with aten as the reference. The dict that gets produced in #92 should make this much easier.

I am a bit confused on the correctness changes.

Also if you want to get more granular about scoring, I think pr 92 may help once it gets merged

@jiannanWang
Copy link
Contributor Author

jiannanWang commented Aug 22, 2025

For fastp, p=0.8 by default

Running uv run python BackendBench/scripts/main.py --suite opinfo --backend aten: (

correctness score (mean pass rate over all operators): 0.99
performance score (geomean speedup over all operators): nan
fastp score (rate of correct samples with a speedup greater than p): 0.00

Running uv run python BackendBench/scripts/main.py --suite torchbench --backend aten --ops "add,sub,mul,div,relu"

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): 1.00
fastp score (rate of correct samples with a speedup greater than p): 1.00

Edit:
The default p is set to 1.0.
The score's name is changed to perf_at_p (perf@p for writing).

@jiannanWang jiannanWang requested a review from PaliC August 22, 2025 23:01
@msaroufim
Copy link
Member

lmk when you're ready for another review

@jiannanWang
Copy link
Contributor Author

@msaroufim Thanks! It's ready now.

@jiannanWang
Copy link
Contributor Author

New commits:

  1. Move perf_at_score to eval.py and the test to test_eval.py
  2. fix multiprocessing branch by replacing op_test_data with result.test_data.
    Run uv run python BackendBench/scripts/main.py --backend aten --suite opinfo --num-workers 8
    Results:
<Logs>
[2025-08-25 15:28:41][INFO][multiprocessing_eval.py] Collected 356 results out of 356 tasks
[2025-08-25 15:28:41][INFO][multiprocessing_eval.py] Shutting down multiprocessing evaluator...
[2025-08-25 15:28:43][INFO][multiprocessing_eval.py] Multiprocessing evaluator shutdown complete
correctness score (mean pass rate over all operators): 0.99
performance score (geomean speedup over all operators): nan
perf@p score (rate of correct samples with a speedup greater than p, p=1.0): 0.00

@jiannanWang jiannanWang requested review from msaroufim and PaliC August 25, 2025 22:35
"--p",
default=1.0,
type=float,
help="Performance score threshold for perf@p score calculation",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on whether increasing this number is more or less stringent, it's something that regularly trips people up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@click.option(
    "--p",
    default=1.0,
    type=float,
    help=(
        "Performance score threshold for perf@p score calculation"
        "Note: Increasing this value makes the threshold more stringent, "
        "requiring a higher speedup to meet the performance criteria."
    )
)

overall_correctness.tolist(), overall_performance.tolist(), p
)

assert torch.allclose(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add a comment about why the perf@p score is calculated subtly differently, but the test here works. Otherwise, it looks like we can just swap out the two scores.

(something like after averages are calculated it's the same suffices tbh)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

                # Note: The perf@p score calculation here differs subtly from the original fastp score in
                # kernel bench. The original fastp score filters correct samples first, then averages.
                # Here, perf@p averages first, then filters correct samples. Despite this difference,
                # both methods produce equivalent results after averaging, so the test remains valid.

@@ -243,7 +257,11 @@ def cli(
results = evaluator.get_results()

for result in results:
correctness_score = result.correctness_score
correctness_score = all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a per@p on the operator level as well in test_data (I'd just do this in eval.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how perf@p works on the operator level. Currently, perf@p is defined as "the ratio of ops that are both correct and have a speedup greater than p." Applying this metric at the operator level may require adjusting the definition.

Copy link
Contributor

@PaliC PaliC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! A few minor comments, but thanks for working with us to get the scoring right :)

@jiannanWang
Copy link
Contributor Author

Thanks! I'm not entirely sure about how perf@k applies at the operator level, but I'm happy to discuss it further and submit a follow-up PR to address that.

@jiannanWang jiannanWang merged commit f2b685a into main Aug 26, 2025
3 checks passed
@@ -209,7 +220,14 @@ def cli(
test.correctness_tests,
test.performance_tests,
)
overall_correctness.append(correctness)

overall_correctness.append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're calculating correctness score here as an aggregate of all the tests per op. Therefore, we are calculating perf@p on the level of the aggregates per op rather than individual tests as kernelbench does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe KernelBench fastp is at the same level as us because one task in KernelBench corresponds to one operation in BackendBench.

For each task in KernelBench, they verify the correctness of the generated kernel by comparing it against reference PyTorch operators multiple times using randomized inputs. Then, they measure speedup over multiple runs. The final fastp metric is calculated at the kernel level, rather than for individual runs. This is the same as BackendBench, where we verify the correctness of an operation by running a series of correctness tests and compute speedup by averaging performance results across multiple tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants