Skip to content

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Apr 23, 2025

fio emits latency metrics regarding how much time was spent inside the
guest operating system (submission latency, slat) or how much time was
spent in the device (clat). For firecracker, the latter could be
relevant, so emit them from the block performance tests.

Signed-off-by: Patrick Roy [email protected]## Changes

...

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

this option only affects the output to stdout, but we are ignoring fio's
stdout (we only work with the log files, which are separate). So drop
this parameter.

Signed-off-by: Patrick Roy [email protected]

roypat added 3 commits April 23, 2025 15:27
this option only affects the output to stdout, but we are ignoring fio's
stdout (we only work with the log files, which are separate). So drop
this parameter.

Signed-off-by: Patrick Roy <[email protected]>
This test just boots a VM, which a ton of other tests also do, so if
memory overhead really does change, we'll catch it in other tests. On
the other hand, having this test just crash if memory overhead goes
above 5MB is not very useful, because it prevent this test from being
run as a A/B-test in scenarios where memory overhead is indeed
increasing.

Signed-off-by: Patrick Roy <[email protected]>
Ensure that `ps(1)` does not truncate the command, which might result in
the grep failing (if the jailer_id gets truncated), using the -ww
option. While we're at it, also use -o cmd so that ps only prints the
command names and nothing else (as we're not using anything else from
this output).

This causes false-positives instead of false-negatives funnily enough,
because we're using check_output, meaning if the grep doesnt find
anything we fail the command (in the "everything works" scenario,
firecracker is dead but grep still matches the "ps | grep" process
itself).

Signed-off-by: Patrick Roy <[email protected]>
Copy link

codecov bot commented Apr 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.06%. Comparing base (52919c4) to head (f15271a).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5166      +/-   ##
==========================================
+ Coverage   83.01%   83.06%   +0.05%     
==========================================
  Files         250      250              
  Lines       26897    26897              
==========================================
+ Hits        22328    22342      +14     
+ Misses       4569     4555      -14     
Flag Coverage Δ
5.10-c5n.metal 83.56% <ø> (ø)
5.10-m5n.metal 83.56% <ø> (ø)
5.10-m6a.metal 82.79% <ø> (+<0.01%) ⬆️
5.10-m6g.metal 79.34% <ø> (ø)
5.10-m6i.metal 83.55% <ø> (ø)
5.10-m7a.metal-48xl 82.77% <ø> (?)
5.10-m7g.metal 79.34% <ø> (ø)
5.10-m7i.metal-24xl 83.52% <ø> (?)
5.10-m7i.metal-48xl 83.52% <ø> (?)
5.10-m8g.metal-24xl 79.34% <ø> (?)
5.10-m8g.metal-48xl 79.34% <ø> (?)
6.1-c5n.metal 83.61% <ø> (+<0.01%) ⬆️
6.1-m5n.metal 83.61% <ø> (+<0.01%) ⬆️
6.1-m6a.metal 82.83% <ø> (-0.01%) ⬇️
6.1-m6g.metal 79.34% <ø> (ø)
6.1-m6i.metal 83.59% <ø> (-0.01%) ⬇️
6.1-m7a.metal-48xl 82.82% <ø> (?)
6.1-m7g.metal 79.33% <ø> (-0.01%) ⬇️
6.1-m7i.metal-24xl 83.62% <ø> (?)
6.1-m7i.metal-48xl 83.62% <ø> (?)
6.1-m8g.metal-24xl 79.34% <ø> (?)
6.1-m8g.metal-48xl 79.34% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Apr 23, 2025
kalyazin
kalyazin previously approved these changes Apr 23, 2025
@roypat roypat marked this pull request as draft April 23, 2025 15:30
@roypat roypat force-pushed the block-latency-test branch 2 times, most recently from fa0b2dd to 391331a Compare April 23, 2025 17:17
@roypat roypat marked this pull request as ready for review April 24, 2025 06:08
@roypat roypat force-pushed the block-latency-test branch from 391331a to 2b5b1f6 Compare April 24, 2025 06:14
Manciukic
Manciukic previously approved these changes Apr 24, 2025
roypat added 2 commits April 24, 2025 12:02
Currently, if something matches the A/B-testing ignore list, then all
metrics emitted from a test with a dimension set that is a super set of
an ignored one is ignored. Refine this to allow only ignoring specific
metrics.

Realize this by synthesizing a fake dimension called 'metric' that
stores the metric.

This will later be used when we introduce block latency tests, as we
will want to A/B-test throughput but ignore latency in scenarios where
fio's async workload generator is used.

Signed-off-by: Patrick Roy <[email protected]>
When an A/B-Test fails, it prints all dimensions associated with the
metric that changed. However, if some dimension is the same across
literally all metrics emitted (for example, instance name and host
kernel version will never change in the middle of a test run), then
that's arguably just noise, and makes it hard to parse potentially
interesting dimensions. So avoid printing all dimensions that are
literally the same across all metrics.

Note that this does _not_ mean for that example if cpu_utilization only
changes to read throughput that the "read vs write" dimension won't be
printed anymore. We only drop dimensions if the are the same across
_all_ metrics, regardless of whether they had a statistically
significant change. In this scenario, the "mode: write" metric still
exists, it simply didn't change, and so the "mode: read" line won't be
dropped from the output.

Before:

[Firecracker A/B-Test Runner] A/B-testing shows a change of -2.07μs, or
-4.70%, (from 44.04μs to 41.98μs) for metric clat_read with p=0.0002.
This means that observing a change of this magnitude or worse, assuming
that performance characteristics did not change across the tested
commits, has a probability of 0.02%. Tested Dimensions:
{
  "cpu_model": "AMD EPYC 7R13 48-Core Processor",
  "fio_block_size": "4096",
  "fio_mode": "randrw",
  "guest_kernel": "linux-6.1",
  "guest_memory": "1024.0MB",
  "host_kernel": "linux-6.8",
  "instance": "m6a.metal",
  "io_engine": "Sync",
  "performance_test": "test_block_latency",
  "rootfs": "ubuntu-24.04.squashfs",
  "vcpus": "2"
}

After:

[Firecracker A/B-Test Runner] A/B-testing shows a change of -2.07μs, or
-4.70%, (from 44.04μs to 41.98μs) for metric clat_read with p=0.0002.
This means that observing a change of this magnitude or worse, assuming
that performance characteristics did not change across the tested
commits, has a probability of 0.02%. Tested Dimensions:
{
  "guest_kernel": "linux-6.1",
  "io_engine": "Sync",
  "vcpus": "2"
}

Signed-off-by: Patrick Roy <[email protected]>
roypat added 3 commits April 24, 2025 13:02
Allow passing arbitrary pytest options through to the ab-testing script,
so that things like `-k` can be used for test selection.

Signed-off-by: Patrick Roy <[email protected]>
This has two reasons:
- When adding block latency tests (e.g. duplicating all existing test
  cases to also run with fio's sync workload generator), the runtime
  will exceed 1 hour, which is the buildkite pipeline timeout)
- Having the sync and async cases in the same buildkite step means that
  the A/B-testing framework will try to cross-correct between the sync
  and async engine, but comparing results from these makes no sense
  because they are completely disjoint code paths in firecracker and
  the host kernel, so there is no reason to believe that their
  regressions should be correlated.

Signed-off-by: Patrick Roy <[email protected]>
fio emits latency metrics regarding how much time was spent inside the
guest operating system (submission latency, slat) or how much time was
spent in the device (clat). For firecracker, the latter could be
relevant, so emit these from our perf tests.

To get non-volatile latency numbers, we need to use a synchronous fio
worker to get non-volatile metrics. However, for throughput tests the
use of the async engine in the guest is required to get maximum
throughput.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the block-latency-test branch from e6b81de to 9bb205f Compare April 24, 2025 12:02
@roypat roypat enabled auto-merge (rebase) April 24, 2025 14:29
@roypat roypat merged commit ae078ee into firecracker-microvm:main Apr 25, 2025
6 of 7 checks passed
@roypat roypat deleted the block-latency-test branch April 25, 2025 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants