Skip to content

Commit d304a01

Browse files
committed
test(ab): Add noise threshold
The idea here is the following: Intuitively, different parameterization of the same metric (e.g. the metric "snapshot restore latency" across different microvm configurations) are not completely independent. Therefore, we would generally expect impact to be reflected across multiple tested configurations (and particularly, if different configurations show mean changes in different directions, e.g. 1vcpu gives +5ms latency, but 2vcpu gives -5ms latency, then this is a further hint at a result being noise). To counteract this, we consider the average of the reported mean regressions, and only report failures for a metric whose average regression is at lesat 5%. Mathematically, this has the following justification: By the central limit theorem, the means of samples are normal distributed. Denote by A and B the distributions of the mean of samples from the 'A' and 'B' tests respectively. Under our null hypothesis, the distributions of the 'A' and 'B' samples are identical (although we dont know what the exact distributions are), meaning so are A and B, say A ~ B ~ N(mu, sigma^2). The difference of two normal distributions is also normal distributed, with the means being subtracted and the variances being added. Therefore, A - B ~ N(0, 2sigma^2). However, each parameterization (e.g. 1vcpu, 2vcpu, and so on) will potentially have different variances sigma^2. Here, we do the following assumption: For all parameterizations of the same metric, we have sigma^2/mu^2 = const. I have no mathematical justification for this (as it is a property of our tests), but empirically it works out. This means that (A-B)/mu ~ N(0, c), with c being a constant that is identical across all parameterizations of a metric. This means that we can combine the relative means across different parameterizations, and get a distributions whose expected value is 0, provided our null hypothesis was true. This is exactly what the code added in this commit verifies: The empirical average of our samples of this distribution only negligibly deviates from 0. Only if it deviates significantly (here by more than 0.05), we actually allow failures in that metric. For all tests but the snapshot restore test on x86_64/{5.10,6.1}, this allows us to completely drop the per-test noise threshold to 0. Signed-off-by: Patrick Roy <[email protected]>
1 parent 7bb409d commit d304a01

File tree

1 file changed

+56
-6
lines changed

1 file changed

+56
-6
lines changed

tools/ab_test.py

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,9 @@ def analyze_data(processed_emf_a, processed_emf_b, *, n_resamples: int = 9999):
215215
return results
216216

217217

218-
def ab_performance_test(a_revision, b_revision, test, p_thresh, strength_thresh):
218+
def ab_performance_test(
219+
a_revision, b_revision, test, p_thresh, strength_thresh, noise_threshold
220+
):
219221
"""Does an A/B-test of the specified test across the given revisions"""
220222
_, commit_list, _ = utils.run_cmd(
221223
f"git --no-pager log --oneline {a_revision}..{b_revision}"
@@ -232,16 +234,58 @@ def ab_performance_test(a_revision, b_revision, test, p_thresh, strength_thresh)
232234
b_revision=b_revision,
233235
)
234236

237+
# We sort our A/B-Testing results keyed by metric here. The resulting lists of values
238+
# will be approximately normal distributed, and we will use this property as a means of error correction.
239+
# The idea behind this is that testing the same metric (say, restore_latency) across different scenarios (e.g.
240+
# different vcpu counts) will be related in some unknown way (meaning most scenarios will show a change in the same
241+
# direction). In particular, if one scenario yields a slight improvement and the next yields a
242+
# slight degradation, we take this as evidence towards both being mere noise that cancels out.
243+
#
244+
# Empirical evidence for this assumption is that
245+
# 1. Historically, a true performance change has never shown up in just a single test, it always showed up
246+
# across most (if not all) tests for a specific metric.
247+
# 2. Analyzing data collected from historical runs shows that across different parameterizations of the same
248+
# metric, the collected samples approximately follow mean / variance = const, with the constant independent
249+
# of the parameterization.
250+
#
251+
# Mathematically, this has the following justification: By the central
252+
# limit theorem, the means of samples are (approximately) normal distributed. Denote by A
253+
# and B the distributions of the mean of samples from the 'A' and 'B'
254+
# tests respectively. Under our null hypothesis, the distributions of the
255+
# 'A' and 'B' samples are identical (although we dont know what the exact
256+
# distributions are), meaning so are A and B, say A ~ B ~ N(mu, sigma^2).
257+
# The difference of two normal distributions is also normal distributed,
258+
# with the means being subtracted and the variances being added.
259+
# Therefore, A - B ~ N(0, 2sigma^2). If we now normalize this distribution by mu (which
260+
# corresponds to considering the distribution of relative regressions instead), we get (A-B)/mu ~ N(0, c), with c
261+
# being the constant from point 2. above. This means that we can combine the relative means across
262+
# different parameterizations, and get a distributions whose expected
263+
# value is 0, provided our null hypothesis was true. It is exactly this distribution
264+
# for which we collect samples in the dictionary below. Therefore, a sanity check
265+
# on the average of the average of the performance changes for a single metric
266+
# is a good candidates for a sanity check against false-positives.
267+
#
268+
# Note that with this approach, for performance changes to "cancel out", we would need essentially a perfect split
269+
# between scenarios that improve performance and scenarios that degrade performance, something we have not
270+
# ever observed to actually happen.
271+
relative_changes_by_metric = {}
272+
235273
failures = []
236274
for (dimension_set, metric), (result, unit) in results.items():
237275
if is_ignored(dict(dimension_set)):
238276
continue
239277

240278
values_a = processed_emf_a[dimension_set][metric][0]
279+
baseline_mean = statistics.mean(values_a)
280+
281+
if metric not in relative_changes_by_metric:
282+
relative_changes_by_metric[metric] = []
283+
relative_changes_by_metric[metric].append(result.statistic / baseline_mean)
241284

242-
if result.pvalue < p_thresh and abs(result.statistic) > abs(
243-
statistics.mean(values_a)
244-
) * (strength_thresh):
285+
if (
286+
result.pvalue < p_thresh
287+
and abs(result.statistic) > baseline_mean * strength_thresh
288+
):
245289
failures.append((dimension_set, metric, result, unit))
246290

247291
failure_report = "\n".join(
@@ -254,8 +298,10 @@ def ab_performance_test(a_revision, b_revision, test, p_thresh, strength_thresh)
254298
f"characteristics did not change across the tested commits, has a probability of {result.pvalue:.2%}. "
255299
f"Tested Dimensions:\n{json.dumps(dict(dimension_set), indent=2)}"
256300
for (dimension_set, metric, result, unit) in failures
301+
# Sanity check as described above
302+
if abs(statistics.mean(relative_changes_by_metric[metric])) > noise_threshold
257303
)
258-
assert not failures, "\n" + failure_report
304+
assert not failure_report, "\n" + failure_report
259305
print("No regressions detected!")
260306

261307

@@ -280,13 +326,16 @@ def canonicalize_revision(revision):
280326
parser.add_argument(
281327
"--significance",
282328
help="The p-value threshold that needs to be crossed for a test result to be considered significant",
329+
type=float,
283330
default=0.01,
284331
)
285332
parser.add_argument(
286333
"--relative-strength",
287334
help="The minimal delta required before a regression will be considered valid",
288-
default=0.2,
335+
type=float,
336+
default=0.0,
289337
)
338+
parser.add_argument("--noise-threshold", type=float, default=0.05)
290339
args = parser.parse_args()
291340

292341
ab_performance_test(
@@ -296,4 +345,5 @@ def canonicalize_revision(revision):
296345
args.test,
297346
args.significance,
298347
args.relative_strength,
348+
args.noise_threshold,
299349
)

0 commit comments

Comments
 (0)