Skip to content

Commit 53e20aa

Browse files
authored
feat: add t-test mode for statistical significance testing (#133)
- Add ttest option to Suite that automatically sets repeatSuite=30 - Implement Welch's t-test for comparing benchmark results - Display significance stars (*, **, ***) based on p-values - Add T-Test Mode indicator in reporter output - Update TypeScript definitions with ttest and ReporterOptions - Add comprehensive tests for t-test utilities - Add statistical-significance example demonstrating the feature - Update documentation with usage and interpretation guide The t-test compares 30 independent runs of each benchmark to determine if performance differences are statistically significant, helping identify real improvements vs. random variance. Signed-off-by: RafaelGSS <rafael.nunu@hotmail.com>
1 parent 18302e3 commit 53e20aa

File tree

12 files changed

+1185
-30
lines changed

12 files changed

+1185
-30
lines changed

README.md

Lines changed: 155 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,10 @@ See the [examples folder](./examples/) for more common usage examples.
9494
- [Benchmark Modes](#benchmark-modes)
9595
- [Operations Mode](#operations-mode)
9696
- [Time Mode](#time-mode)
97+
- [Baseline Comparisons](#baseline-comparisons)
98+
- [Statistical Significance Testing](#statistical-significance-testing)
99+
- [Using with Reporters](#using-with-reporters)
100+
- [Direct API Usage](#direct-api-usage)
97101
- [Writing JavaScript Mistakes](#writing-javascript-mistakes)
98102

99103
## Sponsors
@@ -116,13 +120,17 @@ A `Suite` manages and executes benchmark functions. It provides two methods: `ad
116120
* `opsSec` {string} Operations per second.
117121
* `iterations` {Number} Number of iterations.
118122
* `histogram` {Histogram} Histogram instance.
119-
* `reporterOptions` {Object} Reporter-specific options.
120-
* `printHeader` {boolean} Whether to print system information header. **Default:** `true`.
123+
* `ttest` {boolean} Enable Welch's t-test for statistical significance testing. Automatically sets `repeatSuite=30`. **Default:** `false`.
124+
* `reporterOptions` {Object} Reporter-specific options.
125+
* `printHeader` {boolean} Whether to print system information header. **Default:** `true`.
126+
* `labelWidth` {number} Width for benchmark labels in output. **Default:** `45`.
127+
* `alpha` {number} Significance level for t-test (e.g., 0.05 for 95% confidence). **Default:** `0.05`.
121128
* `benchmarkMode` {string} Benchmark mode to use. Can be 'ops' or 'time'. **Default:** `'ops'`.
122129
* `'ops'` - Measures operations per second (traditional benchmarking).
123130
* `'time'` - Measures actual execution time for a single run.
124131
* `useWorkers` {boolean} Whether to run benchmarks in worker threads. **Default:** `false`.
125132
* `plugins` {Array} Array of plugin instances to use.
133+
* `repeatSuite` {number} Number of times to repeat each benchmark. Automatically set to `30` when `ttest: true`. **Default:** `1`.
126134
* `minSamples` {number} Minimum number of samples per round for all benchmarks in the suite. Can be overridden per benchmark. **Default:** `10` samples.
127135

128136
If no `reporter` is provided, results are printed to the console.
@@ -147,6 +155,7 @@ const suite = new Suite({ reporter: false });
147155
* `maxTime` {number} Maximum duration for the benchmark to run. **Default:** `0.5` seconds.
148156
* `repeatSuite` {number} Number of times to repeat benchmark to run. **Default:** `1` times.
149157
* `minSamples` {number} Number minimum of samples the each round. **Default:** `10` samples.
158+
* `baseline` {boolean} Mark this benchmark as the baseline for comparison. Only one benchmark per suite can be baseline. **Default:** `false`.
150159
* `fn` {Function|AsyncFunction} The benchmark function. Can be synchronous or asynchronous.
151160
* Returns: {Suite}
152161

@@ -644,6 +653,150 @@ Quick Operation with 5 repeats x 0.0000s (5 samples) v8-never-optimize=true
644653

645654
See [examples/time-mode.js](./examples/time-mode.js) for a complete example.
646655

656+
## Baseline Comparisons
657+
658+
You can mark one benchmark as a baseline to compare all other benchmarks against it:
659+
660+
```js
661+
const { Suite } = require('bench-node');
662+
663+
const suite = new Suite();
664+
665+
suite
666+
.add('baseline', { baseline: true }, () => {
667+
// baseline implementation
668+
const arr = [1, 2, 3];
669+
arr.includes(2);
670+
})
671+
.add('alternative', () => {
672+
// alternative implementation
673+
const arr = [1, 2, 3];
674+
arr.indexOf(2) !== -1;
675+
});
676+
677+
suite.run();
678+
```
679+
680+
Example output with baseline:
681+
```
682+
baseline x 52,832,865 ops/sec (10 runs sampled) min..max=(18.50ns...19.22ns)
683+
alternative x 53,550,219 ops/sec (11 runs sampled) min..max=(18.26ns...18.89ns)
684+
685+
Summary (vs. baseline):
686+
baseline (baseline)
687+
alternative (1.01x faster)
688+
```
689+
690+
## Statistical Significance Testing (T-Test)
691+
692+
> Stability: 1.0 (Experimental)
693+
694+
When comparing benchmarks, especially on machines with high variance (cloud VMs, shared environments),
695+
raw ops/sec differences may not be meaningful. `bench-node` provides **Welch's t-test** to determine
696+
if performance differences are statistically significant.
697+
698+
Welch's t-test is preferred over Student's t-test because it doesn't assume equal variances between
699+
the two samples, which is common in benchmark scenarios.
700+
701+
### Enabling T-Test Mode
702+
703+
Enable t-test mode with `ttest: true`. This automatically sets `repeatSuite=30` to collect enough
704+
independent samples for reliable statistical analysis (per the Central Limit Theorem):
705+
706+
```js
707+
const { Suite } = require('bench-node');
708+
709+
const suite = new Suite({
710+
ttest: true, // Enables t-test and auto-sets repeatSuite=30
711+
});
712+
713+
suite
714+
.add('baseline', { baseline: true }, () => {
715+
let sum = 0;
716+
for (let i = 0; i < 100; i++) sum += i;
717+
})
718+
.add('optimized', () => {
719+
let sum = (99 * 100) / 2; // Gauss formula
720+
});
721+
722+
suite.run();
723+
```
724+
725+
Example output:
726+
```
727+
T-Test Mode: Enabled (repeatSuite=30)
728+
729+
baseline x 1,234,567 ops/sec (300 runs sampled) min..max=(810.05ns...812.45ns)
730+
optimized x 9,876,543 ops/sec (305 runs sampled) min..max=(101.23ns...102.87ns)
731+
732+
Summary (vs. baseline):
733+
baseline (baseline)
734+
optimized (8.00x faster) ***
735+
736+
Significance: * p<0.05, ** p<0.01, *** p<0.001
737+
```
738+
739+
The asterisks indicate significance level:
740+
- `***` = p < 0.001 (0.1% risk of false positive)
741+
- `**` = p < 0.01 (1% risk of false positive)
742+
- `*` = p < 0.05 (5% risk of false positive)
743+
- (no stars) = not statistically significant
744+
745+
This helps identify when a benchmark shows a difference due to random variance vs. a real performance improvement.
746+
747+
**How it works**: With `ttest: true`, each benchmark runs 30 times independently (via `repeatSuite=30`). The t-test compares the 30 ops/sec values from the baseline against the 30 ops/sec values from each test benchmark. This accounts for run-to-run variance within that benchmark session.
748+
749+
**Note**: Running the entire benchmark suite multiple times may still show variance in absolute numbers due to system-level factors (CPU frequency scaling, thermal throttling, background processes). The t-test helps determine if differences are statistically significant within each benchmark session, but results can vary between separate benchmark runs due to changing system conditions.
750+
751+
### Direct API Usage
752+
753+
You can also use the t-test utilities directly for custom analysis:
754+
755+
```js
756+
const { welchTTest, compareBenchmarks } = require('bench-node');
757+
758+
// Raw sample data from two benchmarks (e.g., timing samples in nanoseconds)
759+
const baseline = [100, 102, 99, 101, 100, 98, 103, 99, 100, 101];
760+
const optimized = [50, 51, 49, 52, 50, 48, 51, 49, 50, 51];
761+
762+
// High-level comparison
763+
const result = compareBenchmarks(optimized, baseline, 0.05);
764+
console.log(result);
765+
// {
766+
// significant: true,
767+
// pValue: 0.00001,
768+
// confidence: '99.99%',
769+
// stars: '***',
770+
// difference: 'faster',
771+
// tStatistic: 45.2,
772+
// degreesOfFreedom: 17.8
773+
// }
774+
775+
// Low-level Welch's t-test
776+
const ttest = welchTTest(optimized, baseline);
777+
console.log(ttest);
778+
// {
779+
// tStatistic: 45.2,
780+
// degreesOfFreedom: 17.8,
781+
// pValue: 0.00001,
782+
// significant: true,
783+
// mean1: 50.1,
784+
// mean2: 100.3,
785+
// variance1: 1.43,
786+
// variance2: 2.23
787+
// }
788+
```
789+
790+
#### Interpreting Results
791+
792+
- **`significant: true`** - The performance difference is statistically significant at the given alpha level
793+
- **`pValue`** - Probability that the observed difference occurred by chance (lower = more confident)
794+
- **`confidence`** - Confidence level (e.g., "99.95%" means 99.95% confident the difference is real)
795+
- **`stars`** - Visual indicator of significance: `'***'` (p<0.001), `'**'` (p<0.01), `'*'` (p<0.05), or `''` (not significant)
796+
- **`difference`** - Whether the first sample is `'faster'`, `'slower'`, or `'same'` as the second
797+
798+
A common threshold is `alpha = 0.05` (95% confidence). If `pValue < alpha`, the difference is significant.
799+
647800
## Writing JavaScript Mistakes
648801

649802
When working on JavaScript micro-benchmarks, it’s easy to forget that modern engines use
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Statistical Significance Testing (T-Test)
2+
3+
This example demonstrates how to use Welch's t-test to determine if benchmark differences are statistically significant.
4+
5+
## The Problem
6+
7+
When running benchmarks on shared or cloud environments, results can vary due to:
8+
- CPU throttling
9+
- Background processes
10+
- Memory pressure
11+
- Cache effects
12+
13+
A benchmark might show one implementation as "1.05x faster", but is that a real improvement or just noise?
14+
15+
## The Solution
16+
17+
Enable t-test mode with `ttest: true`:
18+
19+
```js
20+
const { Suite } = require('bench-node');
21+
22+
const suite = new Suite({
23+
ttest: true, // Automatically sets repeatSuite=30
24+
});
25+
26+
suite.add('baseline', { baseline: true }, () => {
27+
// ...
28+
});
29+
30+
suite.add('alternative', () => {
31+
// ...
32+
});
33+
```
34+
35+
When `ttest: true` is set, the suite automatically:
36+
1. Sets `repeatSuite=30` for all benchmarks (can be overridden)
37+
2. Runs Welch's t-test to compare results against baseline
38+
3. Displays significance stars in the output
39+
40+
## Understanding the Output
41+
42+
The output will show significance stars next to comparisons:
43+
44+
```
45+
Summary (vs. baseline):
46+
baseline/for-loop (baseline)
47+
forEach (1.80x slower) ***
48+
for-of-loop (1.09x slower) ***
49+
reduce (1.06x faster) **
50+
51+
Significance: * p<0.05, ** p<0.01, *** p<0.001
52+
```
53+
54+
- `***` = p < 0.001 - Very high confidence (99.9%) the difference is real
55+
- `**` = p < 0.01 - High confidence (99%) the difference is real
56+
- `*` = p < 0.05 - Moderate confidence (95%) the difference is real
57+
- (no stars) = Not statistically significant - difference may be noise
58+
59+
## When to Use
60+
61+
1. **Comparing similar implementations** - Is the "optimization" actually faster?
62+
2. **CI/CD pipelines** - Detect real regressions vs. flaky results
63+
3. **Cloud/shared environments** - High variance requires statistical validation
64+
4. **Small differences** - 5% faster could be noise or real
65+
66+
## Run the Example
67+
68+
```bash
69+
node --allow-natives-syntax node.js
70+
```
71+
72+
## Sample Output
73+
74+
```
75+
baseline/for-loop x 85,009,221 ops/sec (311 runs sampled)
76+
reduce x 89,853,937 ops/sec (321 runs sampled)
77+
for-of-loop x 78,268,434 ops/sec (302 runs sampled)
78+
forEach x 47,249,597 ops/sec (334 runs sampled)
79+
80+
Summary (vs. baseline):
81+
baseline/for-loop (baseline)
82+
forEach (1.80x slower) ***
83+
for-of-loop (1.09x slower) ***
84+
reduce (1.06x faster) **
85+
86+
Significance: * p<0.05, ** p<0.01, *** p<0.001
87+
```
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
/**
2+
* Statistical Significance Example
3+
*
4+
* This example demonstrates how to use Welch's t-test to determine
5+
* if benchmark differences are statistically significant.
6+
*
7+
* When running benchmarks, especially on shared/cloud environments,
8+
* small performance differences may just be random noise. The t-test
9+
* helps identify when a difference is real vs. just variance.
10+
*
11+
* Run with: node --allow-natives-syntax node.js
12+
*/
13+
14+
const { Suite } = require('../../lib');
15+
16+
// Enable t-test mode - this automatically sets repeatSuite=30 for all benchmarks
17+
const suite = new Suite({
18+
ttest: true,
19+
});
20+
21+
// Baseline: Simple array sum using for loop
22+
suite.add('baseline/for-loop', { baseline: true }, () => {
23+
const arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
24+
let sum = 0;
25+
for (let i = 0; i < arr.length; i++) {
26+
sum += arr[i];
27+
}
28+
return sum;
29+
});
30+
31+
// Alternative 1: Using reduce (typically slower due to function call overhead)
32+
suite.add('reduce', () => {
33+
const arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
34+
return arr.reduce((acc, val) => acc + val, 0);
35+
});
36+
37+
// Alternative 2: for-of loop (similar performance to for loop)
38+
suite.add('for-of-loop', () => {
39+
const arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
40+
let sum = 0;
41+
for (const val of arr) {
42+
sum += val;
43+
}
44+
return sum;
45+
});
46+
47+
// Alternative 3: forEach (slower due to function call per element)
48+
suite.add('forEach', () => {
49+
const arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
50+
let sum = 0;
51+
arr.forEach((val) => {
52+
sum += val;
53+
});
54+
return sum;
55+
});
56+
57+
suite.run();

0 commit comments

Comments
 (0)