Skip to content

Conversation

PaliC
Copy link
Contributor

@PaliC PaliC commented Aug 18, 2025

Most of the fix for: #55

The goal of this PR is to actually create the massive log of results (in a dict/json). By default this is not saved. In a later PR I'm planning to change save_verbose_results to save things in a format something more useful. Right now I am thinking a summarization csv of ops/tests + correctness and these verbose entries splayed into a directorybench.

The idea here to collect the following stats for every single test we do correctness_score, benchmark_time, speedup, correctness_errors, absolute_error, and relative_error. Both absolute_error and relative_error are calculated as means (should we do maxes instead?). We also do not calculate those stats if the outputs are sparse tensors due to complexity / memory constraints.

An example of the output is here for flaggems + opinfo

https://gist.github.com/PaliC/4dade4f874b6f39447b368ecdbab6e7d

repro:

python BackendBench/scripts/main.py --suite opinfo --backend flag_gems --
output-path logs/output_opinfo_flaggem.json

And to show performance this is torchbench + aten

https://gist.github.com/PaliC/4c166eaf0bfb50364421f46d599dd961

repro:

with-proxy python BackendBench/scripts/main.py --suite opinfo --backend flag_gems --
output-path logs/output_opinfo_flaggem.json

Actual Changes

There are two significant changes in this PR 1) changing the allclose function and 2) adding compute_errors

Originally the allclose function was as follows

def allclose(a, b):
    if isinstance(a, torch.Tensor):
        torch.testing.assert_close(a, b, equal_nan=True, atol=1e-2, rtol=1e-2)
        return True
    if isinstance(a, (list, tuple)):
        if len(a) != len(b):
            raise ValueError(f"Length mismatch: {len(a)} vs {len(b)}")
        return all(allclose(x, y) for x, y in zip(a, b))
    return a == b

and it was always wrapped in a try catch loop. In practice the way it said if tensors were not close was by having the failing assert. If the input was floats or something then that is the only time we would get False. In practice this was a bug as we never really got the true value out of all close. The new logic is to bake the try catch loop into allclose and have a helper function (_all_close) error out in the cases in which the current allclose would return False/ fail the assertion. I also got rid of the recursion, as I actually hit recursion depth errors while testing.

This comes with the added benefit that logs no longer have a bunch of tensor mismatch errors

compute_errors is logic brought in to give us a relative and absolute error for the user to look at. This should only be additive to the code and should not affect things performance wise. I added some unit tests to convince folks things work.

The return values of eval_one_op, eval_correctness, and eval_performance are also changed to expose relevant information for verbose_dict.

Testing

For testing I ran the following commands on this branch + main and verified the outputs / logs are the same (outside of timestamps and tensor mismatch errors)

python BackendBench/scripts/main.py --suite torchbench --backend aten --topn 3
python BackendBench/scripts/main.py --suite opinfo --backend aten
python BackendBench/scripts/main.py --suite opinfo --backend flag_gems

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 18, 2025
@PaliC PaliC marked this pull request as draft August 18, 2025 23:27
@PaliC PaliC marked this pull request as ready for review August 21, 2025 02:31
@msaroufim msaroufim merged commit 028fb33 into main Aug 22, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants