-
Notifications
You must be signed in to change notification settings - Fork 4
Add verbose mode #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Add verbose mode #92
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
msaroufim
reviewed
Aug 21, 2025
This was referenced Aug 21, 2025
msaroufim
approved these changes
Aug 22, 2025
This was referenced Aug 22, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Most of the fix for: #55
The goal of this PR is to actually create the massive log of results (in a dict/json). By default this is not saved. In a later PR I'm planning to change
save_verbose_results
to save things in a format something more useful. Right now I am thinking a summarization csv of ops/tests + correctness and these verbose entries splayed into a directorybench.The idea here to collect the following stats for every single test we do
correctness_score, benchmark_time, speedup, correctness_errors, absolute_error, and relative_error
. Bothabsolute_error
andrelative_error
are calculated as means (should we do maxes instead?). We also do not calculate those stats if the outputs are sparse tensors due to complexity / memory constraints.An example of the output is here for flaggems + opinfo
https://gist.github.com/PaliC/4dade4f874b6f39447b368ecdbab6e7d
repro:
And to show performance this is torchbench + aten
https://gist.github.com/PaliC/4c166eaf0bfb50364421f46d599dd961
repro:
Actual Changes
There are two significant changes in this PR 1) changing the allclose function and 2) adding compute_errors
Originally the allclose function was as follows
and it was always wrapped in a try catch loop. In practice the way it said if tensors were not close was by having the failing assert. If the input was floats or something then that is the only time we would get False. In practice this was a bug as we never really got the true value out of all close. The new logic is to bake the try catch loop into
allclose
and have a helper function (_all_close
) error out in the cases in which the current allclose would return False/ fail the assertion. I also got rid of the recursion, as I actually hit recursion depth errors while testing.This comes with the added benefit that logs no longer have a bunch of tensor mismatch errors
compute_errors is logic brought in to give us a relative and absolute error for the user to look at. This should only be additive to the code and should not affect things performance wise. I added some unit tests to convince folks things work.
The return values of
eval_one_op
,eval_correctness
, andeval_performance
are also changed to expose relevant information for verbose_dict.Testing
For testing I ran the following commands on this branch + main and verified the outputs / logs are the same (outside of timestamps and tensor mismatch errors)