You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add t-test mode for statistical significance testing (#133)
- Add ttest option to Suite that automatically sets repeatSuite=30
- Implement Welch's t-test for comparing benchmark results
- Display significance stars (*, **, ***) based on p-values
- Add T-Test Mode indicator in reporter output
- Update TypeScript definitions with ttest and ReporterOptions
- Add comprehensive tests for t-test utilities
- Add statistical-significance example demonstrating the feature
- Update documentation with usage and interpretation guide
The t-test compares 30 independent runs of each benchmark to determine
if performance differences are statistically significant, helping identify
real improvements vs. random variance.
Signed-off-by: RafaelGSS <rafael.nunu@hotmail.com>
*`printHeader` {boolean} Whether to print system information header. **Default:**`true`.
126
+
*`labelWidth` {number} Width for benchmark labels in output. **Default:**`45`.
127
+
*`alpha` {number} Significance level for t-test (e.g., 0.05 for 95% confidence). **Default:**`0.05`.
121
128
*`benchmarkMode` {string} Benchmark mode to use. Can be 'ops' or 'time'. **Default:**`'ops'`.
122
129
*`'ops'` - Measures operations per second (traditional benchmarking).
123
130
*`'time'` - Measures actual execution time for a single run.
124
131
*`useWorkers` {boolean} Whether to run benchmarks in worker threads. **Default:**`false`.
125
132
*`plugins` {Array} Array of plugin instances to use.
133
+
*`repeatSuite` {number} Number of times to repeat each benchmark. Automatically set to `30` when `ttest: true`. **Default:**`1`.
126
134
*`minSamples` {number} Minimum number of samples per round for all benchmarks in the suite. Can be overridden per benchmark. **Default:**`10` samples.
127
135
128
136
If no `reporter` is provided, results are printed to the console.
@@ -147,6 +155,7 @@ const suite = new Suite({ reporter: false });
147
155
*`maxTime` {number} Maximum duration for the benchmark to run. **Default:**`0.5` seconds.
148
156
*`repeatSuite` {number} Number of times to repeat benchmark to run. **Default:**`1` times.
149
157
*`minSamples` {number} Number minimum of samples the each round. **Default:**`10` samples.
158
+
*`baseline` {boolean} Mark this benchmark as the baseline for comparison. Only one benchmark per suite can be baseline. **Default:**`false`.
150
159
*`fn` {Function|AsyncFunction} The benchmark function. Can be synchronous or asynchronous.
151
160
* Returns: {Suite}
152
161
@@ -644,6 +653,150 @@ Quick Operation with 5 repeats x 0.0000s (5 samples) v8-never-optimize=true
644
653
645
654
See [examples/time-mode.js](./examples/time-mode.js) for a complete example.
646
655
656
+
## Baseline Comparisons
657
+
658
+
You can mark one benchmark as a baseline to compare all other benchmarks against it:
659
+
660
+
```js
661
+
const { Suite } =require('bench-node');
662
+
663
+
constsuite=newSuite();
664
+
665
+
suite
666
+
.add('baseline', { baseline:true }, () => {
667
+
// baseline implementation
668
+
constarr= [1, 2, 3];
669
+
arr.includes(2);
670
+
})
671
+
.add('alternative', () => {
672
+
// alternative implementation
673
+
constarr= [1, 2, 3];
674
+
arr.indexOf(2) !==-1;
675
+
});
676
+
677
+
suite.run();
678
+
```
679
+
680
+
Example output with baseline:
681
+
```
682
+
baseline x 52,832,865 ops/sec (10 runs sampled) min..max=(18.50ns...19.22ns)
683
+
alternative x 53,550,219 ops/sec (11 runs sampled) min..max=(18.26ns...18.89ns)
684
+
685
+
Summary (vs. baseline):
686
+
baseline (baseline)
687
+
alternative (1.01x faster)
688
+
```
689
+
690
+
## Statistical Significance Testing (T-Test)
691
+
692
+
> Stability: 1.0 (Experimental)
693
+
694
+
When comparing benchmarks, especially on machines with high variance (cloud VMs, shared environments),
695
+
raw ops/sec differences may not be meaningful. `bench-node` provides **Welch's t-test** to determine
696
+
if performance differences are statistically significant.
697
+
698
+
Welch's t-test is preferred over Student's t-test because it doesn't assume equal variances between
699
+
the two samples, which is common in benchmark scenarios.
700
+
701
+
### Enabling T-Test Mode
702
+
703
+
Enable t-test mode with `ttest: true`. This automatically sets `repeatSuite=30` to collect enough
704
+
independent samples for reliable statistical analysis (per the Central Limit Theorem):
705
+
706
+
```js
707
+
const { Suite } =require('bench-node');
708
+
709
+
constsuite=newSuite({
710
+
ttest:true, // Enables t-test and auto-sets repeatSuite=30
711
+
});
712
+
713
+
suite
714
+
.add('baseline', { baseline:true }, () => {
715
+
let sum =0;
716
+
for (let i =0; i <100; i++) sum += i;
717
+
})
718
+
.add('optimized', () => {
719
+
let sum = (99*100) /2; // Gauss formula
720
+
});
721
+
722
+
suite.run();
723
+
```
724
+
725
+
Example output:
726
+
```
727
+
T-Test Mode: Enabled (repeatSuite=30)
728
+
729
+
baseline x 1,234,567 ops/sec (300 runs sampled) min..max=(810.05ns...812.45ns)
730
+
optimized x 9,876,543 ops/sec (305 runs sampled) min..max=(101.23ns...102.87ns)
731
+
732
+
Summary (vs. baseline):
733
+
baseline (baseline)
734
+
optimized (8.00x faster) ***
735
+
736
+
Significance: * p<0.05, ** p<0.01, *** p<0.001
737
+
```
738
+
739
+
The asterisks indicate significance level:
740
+
-`***` = p < 0.001 (0.1% risk of false positive)
741
+
-`**` = p < 0.01 (1% risk of false positive)
742
+
-`*` = p < 0.05 (5% risk of false positive)
743
+
- (no stars) = not statistically significant
744
+
745
+
This helps identify when a benchmark shows a difference due to random variance vs. a real performance improvement.
746
+
747
+
**How it works**: With `ttest: true`, each benchmark runs 30 times independently (via `repeatSuite=30`). The t-test compares the 30 ops/sec values from the baseline against the 30 ops/sec values from each test benchmark. This accounts for run-to-run variance within that benchmark session.
748
+
749
+
**Note**: Running the entire benchmark suite multiple times may still show variance in absolute numbers due to system-level factors (CPU frequency scaling, thermal throttling, background processes). The t-test helps determine if differences are statistically significant within each benchmark session, but results can vary between separate benchmark runs due to changing system conditions.
750
+
751
+
### Direct API Usage
752
+
753
+
You can also use the t-test utilities directly for custom analysis:
0 commit comments