-
Notifications
You must be signed in to change notification settings - Fork 357
[utils] Remove useless compare.py output #274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[utils] Remove useless compare.py output #274
Conversation
The last part of the output by `print(d.describe())` aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise. Next, I plan to support quantile merging, and stddev for mean.
57ad7a4
to
2bb04cd
Compare
Could you make this controllable with a flag? My only concern is that if there are downstream projects / CI jobs that somehow rely on parsing those metrics, they could break if they just disappear. Either make them make sense or have a flag to silence the printout IMO. |
Making these metrics make sense is not feasible without breaking the format. It requires to insert another dimension of a workload. Also such a flag would be ugly. Shouldn't we assume people don't use it because its garbage? |
I'm not too familiar with what's actually being removed, can you share the output? If it can be considered as a debug print or worse, then omitting it sounds fine to me, but otherwise we could add a flag like |
Its the last part of the output (non debug), for example:
The problem I have with a flag is that this output doesn't make sense at all. If there is a single benchmark, there is only a single value that doesn't require statistical aggregation. If there are multiple benchmarks, this output means nothing. |
@llvm/pr-subscribers-testing-tools maybe you can provide insights? |
This seems a bit of a strong statement... while you are probably right that the "mean" value does not make statistical sense, the "count", "min", "max", quantiles aggregates seem sensible to me (especially when using the default mode of compare.py where only a couple rows at the beginning and end of the data are shown) and you would have to debate whether they are worthwhile enough to show (you can probably convince me to hide them by default). It's a bit unfortunate that the describe() function in pandas comes as-is with nearly no way to modify it, so that if you want different aggregates you are force to implement similar functionality from scratch yourself... That said, would be happy to see some actual development happen on compare.py and more apropriate aggregates (harmonic mean?) ... Would it make sense to wait for having the improved aggregates/statistics before landing this? (LGTM when replaced with better aggregates) |
FWIW: I wouldn't worry about CI too much... I meant for this script to be used by humans first! I'm not immediately aware of any CI depending on it, and if there are, I'd be happy to argue that they better read the lit json files directly (or that we add a 2nd script that does a more low-level conversion from lit-json to something easy to post-process like csv/tsv files). |
Yea, I guess I expressed too strongly. Sounds good to me, Ill try to first improve the tool, and re-evaluate this patch after that. |
The last part of the output by
print(d.describe())
aggregates numbers from different programs and doesn't statistically makes sense, making it pure noise.Next, I plan to support quantile merging, and stddev for mean.