Add verbose mode #92

PaliC · 2025-08-18T23:26:54Z

Most of the fix for: #55

The goal of this PR is to actually create the massive log of results (in a dict/json). By default this is not saved. In a later PR I'm planning to change save_verbose_results to save things in a format something more useful. Right now I am thinking a summarization csv of ops/tests + correctness and these verbose entries splayed into a directorybench.

The idea here to collect the following stats for every single test we do correctness_score, benchmark_time, speedup, correctness_errors, absolute_error, and relative_error. Both absolute_error and relative_error are calculated as means (should we do maxes instead?). We also do not calculate those stats if the outputs are sparse tensors due to complexity / memory constraints.

An example of the output is here for flaggems + opinfo

https://gist.github.com/PaliC/4dade4f874b6f39447b368ecdbab6e7d

repro:

python BackendBench/scripts/main.py --suite opinfo --backend flag_gems --
output-path logs/output_opinfo_flaggem.json

And to show performance this is torchbench + aten

https://gist.github.com/PaliC/4c166eaf0bfb50364421f46d599dd961

repro:

with-proxy python BackendBench/scripts/main.py --suite opinfo --backend flag_gems --
output-path logs/output_opinfo_flaggem.json

Actual Changes

There are two significant changes in this PR 1) changing the allclose function and 2) adding compute_errors

Originally the allclose function was as follows

def allclose(a, b):
    if isinstance(a, torch.Tensor):
        torch.testing.assert_close(a, b, equal_nan=True, atol=1e-2, rtol=1e-2)
        return True
    if isinstance(a, (list, tuple)):
        if len(a) != len(b):
            raise ValueError(f"Length mismatch: {len(a)} vs {len(b)}")
        return all(allclose(x, y) for x, y in zip(a, b))
    return a == b

and it was always wrapped in a try catch loop. In practice the way it said if tensors were not close was by having the failing assert. If the input was floats or something then that is the only time we would get False. In practice this was a bug as we never really got the true value out of all close. The new logic is to bake the try catch loop into allclose and have a helper function (_all_close) error out in the cases in which the current allclose would return False/ fail the assertion. I also got rid of the recursion, as I actually hit recursion depth errors while testing.

This comes with the added benefit that logs no longer have a bunch of tensor mismatch errors

compute_errors is logic brought in to give us a relative and absolute error for the user to look at. This should only be additive to the code and should not affect things performance wise. I added some unit tests to convince folks things work.

The return values of eval_one_op, eval_correctness, and eval_performance are also changed to expose relevant information for verbose_dict.

Testing

For testing I ran the following commands on this branch + main and verified the outputs / logs are the same (outside of timestamps and tensor mismatch errors)

python BackendBench/scripts/main.py --suite torchbench --backend aten --topn 3
python BackendBench/scripts/main.py --suite opinfo --backend aten
python BackendBench/scripts/main.py --suite opinfo --backend flag_gems

BackendBench/eval.py

Add verbose mode

39c82e4

PaliC requested review from msaroufim and jiannanWang as code owners August 18, 2025 23:26

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 18, 2025

PaliC marked this pull request as draft August 18, 2025 23:27

PaliC and others added 18 commits August 18, 2025 16:56

fixes

a670d5c

fixes

5f47059

fixes

9f80f3b

fixes

350d768

Merge branch 'main' into verbose

6fed400

multiprocessing changes

f40d905

Merge branch 'main' into verbose

afa444e

fix eval

92a59e4

fix eval

649a2b6

Merge branch 'main' into verbose

55cdd2d

fix test

70a23f4

minimize changes in multiprocessing

e5894b4

add dummy verbose data

71a5bf1

backcompat

c4c6062

remove traceback

d9c1842

fix eval bug

0ce8bc5

fix eval bug

2774008

fix eval bug

8998c45

PaliC marked this pull request as ready for review August 21, 2025 02:31

msaroufim reviewed Aug 21, 2025

View reviewed changes

BackendBench/eval.py Outdated Show resolved Hide resolved

add tests

243c4ed

PaliC requested review from Laurawly and msaroufim August 21, 2025 18:54

This was referenced Aug 21, 2025

Remove ops that you can't test for correctness against #104

Closed

Filter out operators we cannot test correctness for #105

Merged

PaliC mentioned this pull request Aug 21, 2025

Update how we calculate correctness score and performance score #107

Merged

msaroufim approved these changes Aug 22, 2025

View reviewed changes

msaroufim merged commit 028fb33 into main Aug 22, 2025
3 checks passed

This was referenced Aug 22, 2025

Add a --verbose mode to eval #55

Closed

Fix operation filtering and integrate KernelAgent with BackendBench test cases #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add verbose mode #92

Add verbose mode #92

Uh oh!

PaliC commented Aug 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add verbose mode #92

Add verbose mode #92

Uh oh!

Conversation

PaliC commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Actual Changes

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PaliC commented Aug 18, 2025 •

edited

Loading