Skip to content

Commit 38985bb

Browse files
committed
docs: Add suggestions for dealing with inconsistent test runs.
1 parent 0694319 commit 38985bb

File tree

2 files changed

+50
-0
lines changed

2 files changed

+50
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ See the [examples folder](./examples/) for more common usage examples.
101101
- [Baseline Comparisons](#baseline-comparisons)
102102
- [Statistical Significance Testing](#statistical-significance-testing-t-test)
103103
- [Direct API Usage](#direct-api-usage)
104+
- [Fixing Inconclusive Tests](#fixing-inconclusive-tests)
104105
- [Writing JavaScript Mistakes](#writing-javascript-mistakes)
105106

106107
## Sponsors
@@ -826,6 +827,8 @@ This helps identify when a benchmark shows a difference due to random variance v
826827

827828
**Note**: Running the entire benchmark suite multiple times may still show variance in absolute numbers due to system-level factors (CPU frequency scaling, thermal throttling, background processes). The t-test helps determine if differences are statistically significant within each benchmark session, but results can vary between separate benchmark runs due to changing system conditions.
828829

830+
See also: [Fixing Inconclusive Tests](doc/Inconclusive.md).
831+
829832
### Direct API Usage
830833

831834
You can also use the t-test utilities directly for custom analysis:

doc/Inconclusive.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Fixing Inconclusive Tests
2+
3+
t-tests are looking at the distribution of both sets of results and trying to determine if they overlap in a way that
4+
makes the average value significant or just noise in the results. A run with a bimodal distribution for instance, caused
5+
by problems with the machine the tests are running on or the NodeJS runtime doing things in the background. Here are a
6+
few causes.
7+
8+
## Random Input Data
9+
10+
Variability in the inputs between runs can lead to big changes in the runtime of an algorithm. Particularly with code
11+
that sorts, filters, or conditionally operates on input data, feeding them certain combinations of data will result in
12+
wildly different run times from one loop to the next or occasionally from one sample to the next. The Central Limit
13+
Theorem (that over a long enough time a situation will revert to the mean), does not invalidate the existence of the
14+
Gambler's Paradox (that it will revert to the mean before I become bankrupt).
15+
16+
It is better to do your fuzzing in fuzz tests and pick representative data for your benchmarks. Partially informed by
17+
the results of your fuzz tests, and other bug reports.
18+
19+
## Underprovisioned VM, Oversubscribed hardware
20+
21+
For a clean micro benchmark, we generally want to be the only one using the machine at the time. There are a number of
22+
known issues running benchmarks on machines that are thermally throttling, or on cheap VMs that use best-effort to
23+
allocate CPU time to the running processes. In particular, docker images with `cpu-shares` are especially poor targets
24+
for running benchmarks because the quota might expire for one timeslice in the middle of one test or between benchmarks
25+
in a single Suite. This creates an unfair advantage for the first test, and/or lots of noise in the results. We are
26+
currently investigating ways to detect this sort of noise, and analyzing if the t-tests are sufficient to do so.
27+
28+
## Epicycles in GC or JIT compilatiov
29+
30+
If the warmup time is insufficient to get V8 to optimize the code, it may kick in during the middle of a sample, which
31+
will introduce a bimodal distribution of answers (before, and after). There is currently not a way to adjust the warmup
32+
time of `bench-node`, but should be added as a feature.
33+
34+
One of the nastiest performance issues to detect in garbage collected code is allocation epicycles. This happens when
35+
early parts of a calculation create lots of temporary data but not sufficient to cross the incremental or full GC
36+
threshold, so that the next function in a call sequence routinely gets hit with exceeding the threshold. This is
37+
especially common in code that generates a JSON or HTML response to a series of calculations - it is the single biggest
38+
allocation in the sequence, but it gets blamed in the performance report for the lion's share of the CPU time.
39+
40+
If you change the `minTime` up or down, that will alter the number of iterations per sample which may smooth out the
41+
results. You can also try increasing `minSamples` to get more samples. But also take this as a suggestion that your code
42+
may have a performance bug that is worth prioritizing.
43+
44+
In production code, particularly where p9# values are used as a fitness test, it is sometimes better to chose the
45+
algorithm with more consistent runtime over the one with supposedly better average runtime. This can also be true where
46+
DDOS scenarios are possible - the attacker will always chose the worst, most assymetric request to send to your machine,
47+
and mean response time will not matter one whit. If `bench-node` is complaining, the problem may not be `bench-node`.

0 commit comments

Comments
 (0)