Skip to content

Conversation

@posborne
Copy link
Collaborator

This tool is extracted from some early work I did in a jupyter notebook to anlayze and compare results relative to a baseline benchmark run. The generated artificat is a single html file with both tabular and grahpic data comparing benchmark results from each benchmark json input.


Attached is an example run (mostly to get some data). The variants benchmarked here are:

Some interesting decisions here (not css bikeshedding) include:

  • Use of CV (Coefficient of Variation) as central measure to determine both if a benchmark is too noisy to be useful and whether delta from baseline is statistically significant.
  • Use of lower 25th percentile as the reference measure between benchmarks; the assumption is that most interference on benchmark runs will result in increased time and, for this specific case, something closer to the minimum time is a better measure. There's an argument to be made for using the min, but that feels much riskier.
  • Using cycles rather than time as the base measure (should be helping with freq. scaling impacts)
  • Generating output as single html with everything embedded.
    Data here was from an x86_64 linux machine but not fully tuned for perfectly consistent behavior.

I think the tools output is probably what I would be looking to use as the output we could publish on each nightly run, though I find it useful in isolation as well. As demonstrated by this input data, it's often useful to be able to compare more than two engine versions (with different flags to modify the config, etc.).

Output is a single HTML file gzipped as Github doesn't allow attaching .html directly.
benchmark-results.html.gz.

Some quick screenshots as well for quick reference.

Overview Table / Links:
image

Individual Benchmark Viz; libsodium-box_easy2 is an example of benchmark we should probably just not use as you can see the large standard deviation across all benchmarks.
image

This tool is extracted from some early work I did in a
jupyter notebook to anlayze and compare results relative
to a baseline benchmark run.  The generated artificat
is a single html file with both tabular and grahpic data
comparing benchmark results from each benchmark json
input.
{% set stats = benchmark.stats if prefix == baseline else benchmark.stats.relative[prefix] %}
{% if prefix == baseline %}
{% set class = "inconsistent" if stats.cv > 5 else "" %}
<td class="{{ class }}">{{ "%.2f"|format(stats.p25) }} +/- {{ "%.2f"|format(stats.cv) }}%</td>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: no reason to format cycles (p25) as float here, will always be an integer.

@fitzgen
Copy link
Member

fitzgen commented May 1, 2025

Super exciting to see some movement here!

Regarding your interesting decisions:

Use of CV (Coefficient of Variation) as central measure to determine both if a benchmark is too noisy to be useful and whether delta from baseline is statistically significant.

Using CV for determining whether a benchmark is too noisy seems fine by me, but using it for statistical significance doesn't seem like the right approach to me. I think we should continue to use effect size confidence intervals for the latter (and for displaying the results to users).

Use of lower 25th percentile as the reference measure between benchmarks; the assumption is that most interference on benchmark runs will result in increased time and, for this specific case, something closer to the minimum time is a better measure. There's an argument to be made for using the min, but that feels much riskier.

Seems okay in principle. My one concern in practice: will this mean that we require ~4x more benchmark runs to get a large enough bottom-quartile sample size to feed into the rest of the analyses?

Using cycles rather than time as the base measure (should be helping with freq. scaling impacts)

👍 (and/or instruction counts, if we want an even-more-stable measure)

Generating output as single html with everything embedded.

big 👍


Backing up a bit: the emitted page looks great for analyzing a particular run against a baseline (e.g. feature branch versus main). What we've always discussed for the nightly-benchmarking visualization was to compare many runs (not just two) and plot them over time, similar to Are We Fast Yet: https://arewefastyet.com/win11/benchmarks/overview?numDays=60

Are you thinking about that angle at all? Not that you have to do everything all at once before anything can land or anything like that, but I'm just trying to suss out your thoughts and plans here.

Double checking: have you read our historical RFCs on benchmarking?

They should provide some nice context on what our north stars (that we unfortunately haven't had enough resources to invest in implementing very much thus far) have historically been. Not saying we can't revisit anything there if we have good motivation to do so, but we should deviate intentionally and with good motivation, not accidentally because we forgot about something that we previously thought about.


Finally, another neat potential future improvement to this comparing-two-runs view: it would be super cool to have graphs of the normalized-to-the-basline results, filtered for only statistically significant benchmarks, and ideally with confidence-interval bars overlaid on top. Something vaguely like this:

                                   Execution

               2.0 |
normalized         |      #
cycles         1.0 | - - -#- - - - - - - - - - - - - -#- - - - -
(lower is          |      #               #           #
 better)       0.0 |      #               #           #
                   +--------------------------------------------
                     spidermonkey   pulldown-cmark   bz2   ...

                                  benchmarks

@fitzgen
Copy link
Member

fitzgen commented May 1, 2025

Oh one final note: are you leveraging a lot of python libraries whose functionality would otherwise be hard to replicate in Rust here? Ideally, this functionality would be built into the existing (Rust-based) sightglass CLI executable...

@posborne
Copy link
Collaborator Author

posborne commented May 1, 2025

Oh one final note: are you leveraging a lot of python libraries whose functionality would otherwise be hard to replicate in Rust here? Ideally, this functionality would be built into the existing (Rust-based) sightglass CLI executable...

I took a brief look at this and, for this prototype, just ported from my notebook, but porting to rust should be doable. Polars should provide everything pandas does for working with data frames (not strictly necessary but nice for stats stuff). Plotting doesn't have as nice a story but for the html case it should be easy enough to emit data and tie it together with one of the js charting libs.

@posborne
Copy link
Collaborator Author

posborne commented May 1, 2025

Backing up a bit: the emitted page looks great for analyzing a particular run against a baseline (e.g. feature branch versus main). What we've always discussed for the nightly-benchmarking visualization was to compare many runs (not just two) and plot them over time, similar to Are We Fast Yet: https://arewefastyet.com/win11/benchmarks/overview?numDays=60

Are you thinking about that angle at all? Not that you have to do everything all at once before anything can land or anything like that, but I'm just trying to suss out your thoughts and plans here.

Yeah, I have this in mind; I think its still an open question of how many issues we'll have in comparing benchmark results for nightly runs across different days. Right now I'm thinking we do a nightly run that:

  1. Builds the engine off main and a couple previous releases.
  2. Compares these engine versions (maybe using current tagged release as baseline).
  3. Records the run data JSON for future use. Generation of the HTML for access and storage for raw JSON data would probably be done by pushing to a gh-pages branch (or whatever we want to name it).

This gives us some immediately useful data and lets us build up a window of data that we can then start to determine if its useful for comparison over time/git-sha.

@posborne
Copy link
Collaborator Author

posborne commented May 1, 2025

Double checking: have you read our historical RFCs on benchmarking?
...

Thank you for these links, I hadn't seen these and there's some great details, especially related to the statistical methods mentioned. I may follow up with some specific questions once I've done a more thorough review -- as you note, we haven't (and probably won't have) time to deliver all of the stated goals, but I'm optimistic we can deliver something like this near-term that hopefully gives us:

  • Moderately reliable regular benchmark timings which can highlight a performance regression (but will require human review) by just going to the sightglass gh-pages.
  • A more friendly developer experience to encourage benchmarking experiments locally.

In part, I think even imperfect (so long as it isn't misleading) but more accessible data is probably a significant improvement that will drive interest and further improvements.

@fitzgen
Copy link
Member

fitzgen commented May 1, 2025

I think its still an open question of how many issues we'll have in comparing benchmark results for nightly runs across different days.

Yeah, I think that, without a dedicated, non-virtual server (and someone to sys admin it), we won't be able to use cycles or anything like that. We would have to use instruction counts for this.

Additionally, I think this is where we would need to start saving the raw data results of night runs somewhere (handwaves) rather than building 60 libwasmtime-bench-api.sos and running the benchmark suite in each one of them in order to get an HTML page showing the last 60 days of results, for example.

So clearly, this historical, nightly view is future work.

@fitzgen
Copy link
Member

fitzgen commented May 1, 2025

Plotting doesn't have as nice a story but for the html case it should be easy enough to emit data and tie it together with one of the js charting libs.

I haven't used it before but there is also https://crates.io/crates/plotters, fwiw.

Whatever is easiest for you though (JS on client side vs Rust at report generation time) should be all good.

@fitzgen
Copy link
Member

fitzgen commented May 1, 2025

In part, I think even imperfect (so long as it isn't misleading) but more accessible data is probably a significant improvement that will drive interest and further improvements.

Agreed. Big fan of incrementalism. Just want to be sure that we are actually incrementally moving towards our north star, and not moving to local maxima that we then have to throw away. I think the most important factor here is, as you say, not being misleading, and the best way to do that is to be intentional with which statistical analyses we employ.

Right now I'm thinking we do a nightly run that:

  1. Builds the engine off main and a couple previous releases.

  2. Compares these engine versions (maybe using current tagged release as baseline).

  3. Records the run data JSON for future use. Generation of the HTML for access and storage for raw JSON data would probably be done by pushing to a gh-pages branch (or whatever we want to name it).

This would be super sweet, would absolutely love to have this all wired up and published on gh-pages.

Couple tiny nitpicks/suggestions, but I don't feel super strongly about them if you have different intuitions:

  • I think comparing yesterday's nightly and today's nightly is the way to go for this milestone of benchmarking infrastructure (for lack of a better term). Comparing to the last tagged release doesn't seem as useful to me as comparing to the day before. I think that identifying longer-term trends is probably better done in the historical Are We Fast Yet-style report that we've discussed, and I fear that adding too many comparisons to this what-is-the-state-of-the-current-nightly-like report will be distracting rather than helpful.

  • I think storing the raw, historical data as CSV might be a little nicer (the sightglass CLI supports both) since I think it will have better/easier support in more one-off stats tools (e.g. doing a one-off analysis in R is probably easier with the CSV format than JSON).

@posborne
Copy link
Collaborator Author

posborne commented May 1, 2025

Finally, another neat potential future improvement to this comparing-two-runs view: it would be super cool to have graphs of the normalized-to-the-basline results, filtered for only statistically significant benchmarks, and ideally with confidence-interval bars overlaid on top. Something vaguely like this:

I did some quick hacking to see how the rust ecosystem for data science stuff is to work with, with the idea of having something like a report command and it went alright using polars and plotlars which targets plotly js and was able to emit normalized data as you suggested without too much trouble. I haven't written off plotters but the story for embedding into a single html page looks like it would be messy (probably embedding a wasm module(s) in the html).

image

poc-polars-plotlars.html.zip

@abrown
Copy link
Member

abrown commented May 1, 2025

@posborne, I read through this today as well as all the comments and I just wanted to add an encouraging note; @fitzgen has already said all the important points. I would just add that:

  1. Thank you for tackling this as the UI part of this is quick complex, IMHO; there are quite a few dimensions that may not be apparent at first glance (engine, engine version, engine features, benchmark, architecture, measurement, date, etc.) and different groups of people care about different dimensions, aggregating them in different ways.
  2. Also, I'm all for an incremental approach here and I like the idea of controlling all of this infrastructure ourselves, but, if you find yourself thinking "who is going to maintain this when it breaks?" or "how will other users add new views?" it is not unreasonable to bring back other options: there's a whole spectrum of tools to do this kind of thing that range from zero control for us to lots of control but lots of maintenance. What I mean is: if this path turns out to be difficult, there are others!

@posborne
Copy link
Collaborator Author

I've got a little bit more work to get it to PR, but I have a sightglass subcommand in rust now that reproduces what the python script does that is working fine; I'm going to close this PR in favor of that future tool which should ease use and maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants