Skip to content

[utils] Remove useless compare.py output #274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tomershafir
Copy link
Contributor

@tomershafir tomershafir commented Aug 5, 2025

The last part of the output by print(d.describe()) aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise.

Next, I plan to support quantile merging, and stddev for mean.

The last part of the output by `print(d.describe())` aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise.

Next, I plan to support quantile merging, and stddev for mean.
@tomershafir tomershafir force-pushed the utils-compare-remove-useless-output branch from 57ad7a4 to 2bb04cd Compare August 5, 2025 12:29
@jcohen-apple
Copy link

Could you make this controllable with a flag? My only concern is that if there are downstream projects / CI jobs that somehow rely on parsing those metrics, they could break if they just disappear. Either make them make sense or have a flag to silence the printout IMO.

@tomershafir
Copy link
Contributor Author

Making these metrics make sense is not feasible without breaking the format. It requires to insert another dimension of a workload. Also such a flag would be ugly. Shouldn't we assume people don't use it because its garbage?

@guy-david
Copy link
Contributor

I'm not too familiar with what's actually being removed, can you share the output? If it can be considered as a debug print or worse, then omitting it sounds fine to me, but otherwise we could add a flag like --minimal-names that controls this aspect of the output.

@tomershafir
Copy link
Contributor Author

tomershafir commented Aug 11, 2025

Its the last part of the output (non debug), for example:

           exec_time                             compile_time                                   size                           
l/r              lhs            rhs         diff          lhs          rhs        diff           lhs           rhs         diff
count  4310.000000    4310.000000    4310.000000  3463.000000  3463.000000  483.000000  3.463000e+03  3.463000e+03  3463.000000
mean   457.936342     458.528907     0.001676     0.103256     0.103186     0.000674    7.371460e+04  7.371937e+04 -0.000023   
std    10880.264844   10887.120051   0.069010     0.862819     0.862152     0.016053    9.972944e+04  9.976209e+04  0.002889   
min    0.000500       0.000500      -0.302439     0.000000     0.000000    -0.057416    1.728800e+04  1.728800e+04 -0.160348   
25%    0.000500       0.000500      -0.000044     0.000000     0.000000    -0.007525    3.407200e+04  3.407200e+04  0.000000   
50%    0.007200       0.007200       0.000000     0.000000     0.000000     0.000000    5.074400e+04  5.074400e+04  0.000000   
75%    0.051861       0.052006       0.000092     0.000000     0.000000     0.008551    1.048240e+05  1.048240e+05  0.000000   
max    321317.722681  321886.186324  1.904244     24.240900    24.227100    0.060109    2.704328e+06  2.704328e+06  0.042861 

The problem I have with a flag is that this output doesn't make sense at all. If there is a single benchmark, there is only a single value that doesn't require statistical aggregation. If there are multiple benchmarks, this output means nothing.

@tomershafir
Copy link
Contributor Author

@llvm/pr-subscribers-testing-tools maybe you can provide insights?

@MatzeB
Copy link
Contributor

MatzeB commented Aug 11, 2025

If there are multiple benchmarks, this output means nothing.

This seems a bit of a strong statement... while you are probably right that the "mean" value does not make statistical sense, the "count", "min", "max", quantiles aggregates seem sensible to me (especially when using the default mode of compare.py where only a couple rows at the beginning and end of the data are shown) and you would have to debate whether they are worthwhile enough to show (you can probably convince me to hide them by default).

It's a bit unfortunate that the describe() function in pandas comes as-is with nearly no way to modify it, so that if you want different aggregates you are force to implement similar functionality from scratch yourself...

That said, would be happy to see some actual development happen on compare.py and more apropriate aggregates (harmonic mean?) ... Would it make sense to wait for having the improved aggregates/statistics before landing this? (LGTM when replaced with better aggregates)

@MatzeB
Copy link
Contributor

MatzeB commented Aug 11, 2025

FWIW: I wouldn't worry about CI too much... I meant for this script to be used by humans first! I'm not immediately aware of any CI depending on it, and if there are, I'd be happy to argue that they better read the lit json files directly (or that we add a 2nd script that does a more low-level conversion from lit-json to something easy to post-process like csv/tsv files).

@tomershafir
Copy link
Contributor Author

Yea, I guess I expressed too strongly. Sounds good to me, Ill try to first improve the tool, and re-evaluate this patch after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants