|
| 1 | +# Fixing Inconclusive Tests |
| 2 | + |
| 3 | +t-tests are looking at the distribution of both sets of results and trying to determine if they overlap in a way that |
| 4 | +makes the average value significant or just noise in the results. A run with a bimodal distribution for instance, caused |
| 5 | +by problems with the machine the tests are running on or the NodeJS runtime doing things in the background. Here are a |
| 6 | +few causes. |
| 7 | + |
| 8 | +## Random Input Data |
| 9 | + |
| 10 | +Variability in the inputs between runs can lead to big changes in the runtime of an algorithm. Particularly with code |
| 11 | +that sorts, filters, or conditionally operates on input data, feeding them certain combinations of data will result in |
| 12 | +wildly different run times from one loop to the next or occasionally from one sample to the next. The Central Limit |
| 13 | +Theorem (that over a long enough time a situation will revert to the mean), does not invalidate the existence of the |
| 14 | +Gambler's Paradox (that it will revert to the mean before I become bankrupt). |
| 15 | + |
| 16 | +It is better to do your fuzzing in fuzz tests and pick representative data for your benchmarks. Partially informed by |
| 17 | +the results of your fuzz tests, and other bug reports. |
| 18 | + |
| 19 | +## Underprovisioned VM, Oversubscribed hardware |
| 20 | + |
| 21 | +For a clean micro benchmark, we generally want to be the only one using the machine at the time. There are a number of |
| 22 | +known issues running benchmarks on machines that are thermally throttling, or on cheap VMs that use best-effort to |
| 23 | +allocate CPU time to the running processes. In particular, docker images with `cpu-shares` are especially poor targets |
| 24 | +for running benchmarks because the quota might expire for one timeslice in the middle of one test or between benchmarks |
| 25 | +in a single Suite. This creates an unfair advantage for the first test, and/or lots of noise in the results. We are |
| 26 | +currently investigating ways to detect this sort of noise, and analyzing if the t-tests are sufficient to do so. |
| 27 | + |
| 28 | +## Epicycles in GC or JIT compilatiov |
| 29 | + |
| 30 | +If the warmup time is insufficient to get V8 to optimize the code, it may kick in during the middle of a sample, which |
| 31 | +will introduce a bimodal distribution of answers (before, and after). There is currently not a way to adjust the warmup |
| 32 | +time of `bench-node`, but should be added as a feature. |
| 33 | + |
| 34 | +One of the nastiest performance issues to detect in garbage collected code is allocation epicycles. This happens when |
| 35 | +early parts of a calculation create lots of temporary data but not sufficient to cross the incremental or full GC |
| 36 | +threshold, so that the next function in a call sequence routinely gets hit with exceeding the threshold. This is |
| 37 | +especially common in code that generates a JSON or HTML response to a series of calculations - it is the single biggest |
| 38 | +allocation in the sequence, but it gets blamed in the performance report for the lion's share of the CPU time. |
| 39 | + |
| 40 | +If you change the `minTime` up or down, that will alter the number of iterations per sample which may smooth out the |
| 41 | +results. You can also try increasing `minSamples` to get more samples. But also take this as a suggestion that your code |
| 42 | +may have a performance bug that is worth prioritizing. |
| 43 | + |
| 44 | +In production code, particularly where p9# values are used as a fitness test, it is sometimes better to chose the |
| 45 | +algorithm with more consistent runtime over the one with supposedly better average runtime. This can also be true where |
| 46 | +DDOS scenarios are possible - the attacker will always chose the worst, most assymetric request to send to your machine, |
| 47 | +and mean response time will not matter one whit. If `bench-node` is complaining, the problem may not be `bench-node`. |
0 commit comments