Skip to content

Conversation

ShadowCurse
Copy link
Contributor

Changes

Update the ab_test.py script to utilize metrics.json files instead of test-report.json to obtain metrics emitted by tests.
Also do minor clean up of the script, removing the hack for importing our test framework into the script.

Reason

Simplification.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Oct 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.73%. Comparing base (837c2e7) to head (2a5d9e5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5481      +/-   ##
==========================================
- Coverage   82.74%   82.73%   -0.01%     
==========================================
  Files         269      269              
  Lines       27798    27798              
==========================================
- Hits        23001    23000       -1     
- Misses       4797     4798       +1     
Flag Coverage Δ
5.10-m5n.metal 82.90% <ø> (ø)
5.10-m6a.metal 82.16% <ø> (+<0.01%) ⬆️
5.10-m6g.metal 79.56% <ø> (-0.01%) ⬇️
5.10-m6i.metal 82.90% <ø> (ø)
5.10-m7a.metal-48xl 82.15% <ø> (-0.01%) ⬇️
5.10-m7g.metal 79.56% <ø> (ø)
5.10-m7i.metal-24xl 82.86% <ø> (-0.02%) ⬇️
5.10-m7i.metal-48xl 82.86% <ø> (-0.01%) ⬇️
5.10-m8g.metal-24xl 79.56% <ø> (ø)
5.10-m8g.metal-48xl 79.56% <ø> (-0.01%) ⬇️
6.1-m5n.metal 82.92% <ø> (-0.01%) ⬇️
6.1-m6a.metal 82.19% <ø> (-0.01%) ⬇️
6.1-m6g.metal 79.56% <ø> (ø)
6.1-m6i.metal 82.92% <ø> (ø)
6.1-m7a.metal-48xl 82.18% <ø> (ø)
6.1-m7g.metal 79.56% <ø> (-0.01%) ⬇️
6.1-m7i.metal-24xl 82.93% <ø> (-0.01%) ⬇️
6.1-m7i.metal-48xl 82.93% <ø> (ø)
6.1-m8g.metal-24xl 79.55% <ø> (-0.01%) ⬇️
6.1-m8g.metal-48xl 79.56% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ShadowCurse ShadowCurse marked this pull request as ready for review October 16, 2025 14:10
@ShadowCurse ShadowCurse self-assigned this Oct 16, 2025
@ShadowCurse ShadowCurse added Status: Awaiting review Indicates that a pull request is ready to be reviewed Type: Enhancement Indicates new feature requests labels Oct 16, 2025
Since we store metrics for each test directly, we don't
need to parse emf logs from test-report.json. Replace
the logic of gathering metrics in ab_test.py to look
at all `metrics.json` files instead.

Signed-off-by: Egor Lazarchuk <[email protected]>
This function was only used in ab_test.py.

Signed-off-by: Egor Lazarchuk <[email protected]>
This functions was only used in ab_test.py

Signed-off-by: Egor Lazarchuk <[email protected]>
This function was only used in ab_test.py

Signed-off-by: Egor Lazarchuk <[email protected]>
There is no reason to emit metrics when running the
A/B script.
This is also the last peice imported from test framework,
so remove the hack used to be able to import things from
it.

Signed-off-by: Egor Lazarchuk <[email protected]>
This functions was not used anymore

Signed-off-by: Egor Lazarchuk <[email protected]>
Comment on lines -285 to -290
metrics_logger.set_dimensions({"metric": metric, **dict(dimension_set)})
metrics_logger.put_metric("p_value", float(result.pvalue), "None")
metrics_logger.put_metric("mean_difference", float(result.statistic), unit)
metrics_logger.set_property("data_a", values_a)
metrics_logger.set_property("data_b", metrics_b[metric][0])
metrics_logger.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I ever looked at these metrics, but how will be the way to check these? should we print a report of the A/B run on stdout or on to a file that we can explore in buildkite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can dump the output like this: https://github.com/firecracker-microvm/firecracker/pull/4923/files#diff-edc53a8d8d2432bf93a2590fdb5aac94c515586dbd8f916cb9fda4fc78166e17R295 and it will be included into the uploaded archive. But I don't seet a big reason to do this because we can just rerun A/B script on the downloaded data localy to debug it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, you can always download everything manually and run it locally but that takes a non negligible amount of time. Maybe nobody will ever look at it as for these metrics, but dumping everything to a report (maybe a json one) seems reasonable to me as a first step when understanding A/B results.

Copy link
Contributor

@Manciukic Manciukic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. While we're changing this, maybe we can add a small report of the A/B run. I found out here we have some metrics that I never looked into, but something like that could be useful to debug the A/B. For example, only if the test fails we get the report right now, we could print all the other into a file so that we can check it when we are debugging a regression (maybe it wasn't significant enough but A/B found it nonetheless).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed Type: Enhancement Indicates new feature requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants