[A/B] Update script to use stored `metrics.json` #5481

ShadowCurse · 2025-10-16T13:16:36Z

Changes

Update the ab_test.py script to utilize metrics.json files instead of test-report.json to obtain metrics emitted by tests.
Also do minor clean up of the script, removing the hack for importing our test framework into the script.

Reason

Simplification.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

This functionality cannot be added in rust-vmm.

codecov · 2025-10-16T13:26:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.73%. Comparing base (837c2e7) to head (2a5d9e5).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5481      +/-   ##
==========================================
- Coverage   82.74%   82.73%   -0.01%     
==========================================
  Files         269      269              
  Lines       27798    27798              
==========================================
- Hits        23001    23000       -1     
- Misses       4797     4798       +1

Flag	Coverage Δ
5.10-m5n.metal	`82.90% <ø> (ø)`
5.10-m6a.metal	`82.16% <ø> (+<0.01%)`	⬆️
5.10-m6g.metal	`79.56% <ø> (-0.01%)`	⬇️
5.10-m6i.metal	`82.90% <ø> (ø)`
5.10-m7a.metal-48xl	`82.15% <ø> (-0.01%)`	⬇️
5.10-m7g.metal	`79.56% <ø> (ø)`
5.10-m7i.metal-24xl	`82.86% <ø> (-0.02%)`	⬇️
5.10-m7i.metal-48xl	`82.86% <ø> (-0.01%)`	⬇️
5.10-m8g.metal-24xl	`79.56% <ø> (ø)`
5.10-m8g.metal-48xl	`79.56% <ø> (-0.01%)`	⬇️
6.1-m5n.metal	`82.92% <ø> (-0.01%)`	⬇️
6.1-m6a.metal	`82.19% <ø> (-0.01%)`	⬇️
6.1-m6g.metal	`79.56% <ø> (ø)`
6.1-m6i.metal	`82.92% <ø> (ø)`
6.1-m7a.metal-48xl	`82.18% <ø> (ø)`
6.1-m7g.metal	`79.56% <ø> (-0.01%)`	⬇️
6.1-m7i.metal-24xl	`82.93% <ø> (-0.01%)`	⬇️
6.1-m7i.metal-48xl	`82.93% <ø> (ø)`
6.1-m8g.metal-24xl	`79.55% <ø> (-0.01%)`	⬇️
6.1-m8g.metal-48xl	`79.56% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Since we store metrics for each test directly, we don't need to parse emf logs from test-report.json. Replace the logic of gathering metrics in ab_test.py to look at all `metrics.json` files instead. Signed-off-by: Egor Lazarchuk <[email protected]>

This function was only used in ab_test.py. Signed-off-by: Egor Lazarchuk <[email protected]>

This functions was only used in ab_test.py Signed-off-by: Egor Lazarchuk <[email protected]>

This function was only used in ab_test.py Signed-off-by: Egor Lazarchuk <[email protected]>

There is no reason to emit metrics when running the A/B script. This is also the last peice imported from test framework, so remove the hack used to be able to import things from it. Signed-off-by: Egor Lazarchuk <[email protected]>

This functions was not used anymore Signed-off-by: Egor Lazarchuk <[email protected]>

Manciukic · 2025-10-17T09:47:50Z

tools/ab_test.py

-            metrics_logger.set_dimensions({"metric": metric, **dict(dimension_set)})
-            metrics_logger.put_metric("p_value", float(result.pvalue), "None")
-            metrics_logger.put_metric("mean_difference", float(result.statistic), unit)
-            metrics_logger.set_property("data_a", values_a)
-            metrics_logger.set_property("data_b", metrics_b[metric][0])
-            metrics_logger.flush()


I don't think I ever looked at these metrics, but how will be the way to check these? should we print a report of the A/B run on stdout or on to a file that we can explore in buildkite?

I can dump the output like this: https://github.com/firecracker-microvm/firecracker/pull/4923/files#diff-edc53a8d8d2432bf93a2590fdb5aac94c515586dbd8f916cb9fda4fc78166e17R295 and it will be included into the uploaded archive. But I don't seet a big reason to do this because we can just rerun A/B script on the downloaded data localy to debug it.

sure, you can always download everything manually and run it locally but that takes a non negligible amount of time. Maybe nobody will ever look at it as for these metrics, but dumping everything to a report (maybe a json one) seems reasonable to me as a first step when understanding A/B results.

Manciukic

LGTM. While we're changing this, maybe we can add a small report of the A/B run. I found out here we have some metrics that I never looked into, but something like that could be useful to debug the A/B. For example, only if the test fails we get the report right now, we could print all the other into a file so that we can check it when we are debugging a regression (maybe it wasn't significant enough but A/B found it nonetheless).

ShadowCurse force-pushed the ab_update branch from c65891b to 75b4b5b Compare October 16, 2025 13:42

ShadowCurse marked this pull request as ready for review October 16, 2025 14:10

ShadowCurse self-assigned this Oct 16, 2025

ShadowCurse added Status: Awaiting review Indicates that a pull request is ready to be reviewed Type: Enhancement Indicates new feature requests labels Oct 16, 2025

ShadowCurse added 6 commits October 16, 2025 17:41

refactor: move format_with_reduced_unit into ab_test.py

81b257b

This function was only used in ab_test.py. Signed-off-by: Egor Lazarchuk <[email protected]>

refactor: move check_regression into ab_test.py

8e1c935

This functions was only used in ab_test.py Signed-off-by: Egor Lazarchuk <[email protected]>

refactor: move binary_ab_test to ab_test.py

5827379

This function was only used in ab_test.py Signed-off-by: Egor Lazarchuk <[email protected]>

refactor: remove metrics emiting from ab_test.py

d09fc3b

There is no reason to emit metrics when running the A/B script. This is also the last peice imported from test framework, so remove the hack used to be able to import things from it. Signed-off-by: Egor Lazarchuk <[email protected]>

refactor: remove emit_raw_emf

2a5d9e5

This functions was not used anymore Signed-off-by: Egor Lazarchuk <[email protected]>

ShadowCurse force-pushed the ab_update branch from 75b4b5b to 2a5d9e5 Compare October 16, 2025 16:41

Manciukic reviewed Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[A/B] Update script to use stored `metrics.json` #5481

[A/B] Update script to use stored `metrics.json` #5481

Uh oh!

ShadowCurse commented Oct 16, 2025

Uh oh!

codecov bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

Manciukic Oct 17, 2025

Uh oh!

ShadowCurse Oct 17, 2025

Uh oh!

Manciukic Oct 17, 2025

Uh oh!

Manciukic left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[A/B] Update script to use stored metrics.json #5481

Are you sure you want to change the base?

[A/B] Update script to use stored metrics.json #5481

Uh oh!

Conversation

ShadowCurse commented Oct 16, 2025

Changes

Reason

License Acceptance

PR Checklist

Uh oh!

codecov bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Manciukic Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

ShadowCurse Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Manciukic Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Manciukic left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[A/B] Update script to use stored `metrics.json` #5481

[A/B] Update script to use stored `metrics.json` #5481

codecov bot commented Oct 16, 2025 •

edited

Loading