-
Notifications
You must be signed in to change notification settings - Fork 36
Add viz.py visualization tool #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This tool is extracted from some early work I did in a jupyter notebook to anlayze and compare results relative to a baseline benchmark run. The generated artificat is a single html file with both tabular and grahpic data comparing benchmark results from each benchmark json input.
| {% set stats = benchmark.stats if prefix == baseline else benchmark.stats.relative[prefix] %} | ||
| {% if prefix == baseline %} | ||
| {% set class = "inconsistent" if stats.cv > 5 else "" %} | ||
| <td class="{{ class }}">{{ "%.2f"|format(stats.p25) }} +/- {{ "%.2f"|format(stats.cv) }}%</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: no reason to format cycles (p25) as float here, will always be an integer.
|
Super exciting to see some movement here! Regarding your interesting decisions:
Using CV for determining whether a benchmark is too noisy seems fine by me, but using it for statistical significance doesn't seem like the right approach to me. I think we should continue to use effect size confidence intervals for the latter (and for displaying the results to users).
Seems okay in principle. My one concern in practice: will this mean that we require ~4x more benchmark runs to get a large enough bottom-quartile sample size to feed into the rest of the analyses?
👍 (and/or instruction counts, if we want an even-more-stable measure)
big 👍 Backing up a bit: the emitted page looks great for analyzing a particular run against a baseline (e.g. Are you thinking about that angle at all? Not that you have to do everything all at once before anything can land or anything like that, but I'm just trying to suss out your thoughts and plans here. Double checking: have you read our historical RFCs on benchmarking?
They should provide some nice context on what our north stars (that we unfortunately haven't had enough resources to invest in implementing very much thus far) have historically been. Not saying we can't revisit anything there if we have good motivation to do so, but we should deviate intentionally and with good motivation, not accidentally because we forgot about something that we previously thought about. Finally, another neat potential future improvement to this comparing-two-runs view: it would be super cool to have graphs of the normalized-to-the-basline results, filtered for only statistically significant benchmarks, and ideally with confidence-interval bars overlaid on top. Something vaguely like this: |
|
Oh one final note: are you leveraging a lot of python libraries whose functionality would otherwise be hard to replicate in Rust here? Ideally, this functionality would be built into the existing (Rust-based) |
I took a brief look at this and, for this prototype, just ported from my notebook, but porting to rust should be doable. Polars should provide everything pandas does for working with data frames (not strictly necessary but nice for stats stuff). Plotting doesn't have as nice a story but for the html case it should be easy enough to emit data and tie it together with one of the js charting libs. |
Yeah, I have this in mind; I think its still an open question of how many issues we'll have in comparing benchmark results for nightly runs across different days. Right now I'm thinking we do a nightly run that:
This gives us some immediately useful data and lets us build up a window of data that we can then start to determine if its useful for comparison over time/git-sha. |
Thank you for these links, I hadn't seen these and there's some great details, especially related to the statistical methods mentioned. I may follow up with some specific questions once I've done a more thorough review -- as you note, we haven't (and probably won't have) time to deliver all of the stated goals, but I'm optimistic we can deliver something like this near-term that hopefully gives us:
In part, I think even imperfect (so long as it isn't misleading) but more accessible data is probably a significant improvement that will drive interest and further improvements. |
Yeah, I think that, without a dedicated, non-virtual server (and someone to sys admin it), we won't be able to use cycles or anything like that. We would have to use instruction counts for this. Additionally, I think this is where we would need to start saving the raw data results of night runs somewhere (handwaves) rather than building 60 So clearly, this historical, nightly view is future work. |
I haven't used it before but there is also https://crates.io/crates/plotters, fwiw. Whatever is easiest for you though (JS on client side vs Rust at report generation time) should be all good. |
Agreed. Big fan of incrementalism. Just want to be sure that we are actually incrementally moving towards our north star, and not moving to local maxima that we then have to throw away. I think the most important factor here is, as you say, not being misleading, and the best way to do that is to be intentional with which statistical analyses we employ.
This would be super sweet, would absolutely love to have this all wired up and published on Couple tiny nitpicks/suggestions, but I don't feel super strongly about them if you have different intuitions:
|
I did some quick hacking to see how the rust ecosystem for data science stuff is to work with, with the idea of having something like a |
|
@posborne, I read through this today as well as all the comments and I just wanted to add an encouraging note; @fitzgen has already said all the important points. I would just add that:
|
|
I've got a little bit more work to get it to PR, but I have a sightglass subcommand in rust now that reproduces what the python script does that is working fine; I'm going to close this PR in favor of that future tool which should ease use and maintenance. |

This tool is extracted from some early work I did in a jupyter notebook to anlayze and compare results relative to a baseline benchmark run. The generated artificat is a single html file with both tabular and grahpic data comparing benchmark results from each benchmark json input.
Attached is an example run (mostly to get some data). The variants benchmarked here are:
-W epoch-interrupts=y-W epoch-interrupts=y -W fuel=999999999999999-W all-proposals=ySome interesting decisions here (not css bikeshedding) include:
Data here was from an x86_64 linux machine but not fully tuned for perfectly consistent behavior.
I think the tools output is probably what I would be looking to use as the output we could publish on each nightly run, though I find it useful in isolation as well. As demonstrated by this input data, it's often useful to be able to compare more than two engine versions (with different flags to modify the config, etc.).
Output is a single HTML file gzipped as Github doesn't allow attaching .html directly.
benchmark-results.html.gz.
Some quick screenshots as well for quick reference.
Overview Table / Links:

Individual Benchmark Viz; libsodium-box_easy2 is an example of benchmark we should probably just not use as you can see the large standard deviation across all benchmarks.
