Improved Snapshot Latency Testing #5061

roypat · 2025-03-04T09:22:15Z

Add tests that measure post-snapshot-restore latency, mainly around page faults. We have two flavors of tests:

Measuring how long it takes to fault in a 128MB region from inside the guest after a snapshot restore, and
Measuring how long it takes to populate 128MB of guest memory from the UFFD handler

While doing this, do some optimizations to our example UFFD handlers to reduce page fault latency by ~98.75%, as otherwise the overhead of bookkeeping inside the UFFD handler would dominate all other interesting effects (and the tests would report taking around 40s, which is less than useless).

Note that these won't work until CI artifacts have been rebuilt from this branch, as the fast_page_fault_helper tool needs to be recompiled into the rootfs.

I've validated these by playing around with MAP_SHARED vs MAP_PRIVATE anon mappings, and the population latency test for example very clearly exhibited the added latency of using shared memory

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

I have read and understand CONTRIBUTING.md.
I have run tools/devtool checkstyle to verify that the PR passes the
automated style checks.
I have described what is done in these changes, why they are needed, and
how they are solving the problem in a clear and encompassing way.
I have updated any relevant documentation (both in code and in the docs)
in the PR.
I have mentioned all user-facing changes in CHANGELOG.md.
If a specific issue led to this PR, this PR closes the issue.
When making API changes, I have followed the
Runbook for Firecracker API changes.
I have tested all new and changed functionalities in unit tests and/or
integration tests.
I have linked an issue to every new TODO.

This functionality cannot be added in rust-vmm.

codecov · 2025-03-04T09:28:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.18%. Comparing base (cc5d6b4) to head (3e37ae6).
Report is 23 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5061   +/-   ##
=======================================
  Coverage   83.18%   83.18%           
=======================================
  Files         247      247           
  Lines       26816    26816           
=======================================
  Hits        22306    22306           
  Misses       4510     4510

Flag	Coverage Δ
5.10-c5n.metal	`83.62% <ø> (ø)`
5.10-m5n.metal	`83.61% <ø> (ø)`
5.10-m6a.metal	`82.81% <ø> (ø)`
5.10-m6g.metal	`79.60% <ø> (ø)`
5.10-m6i.metal	`83.61% <ø> (ø)`
5.10-m7g.metal	`79.60% <ø> (ø)`
6.1-c5n.metal	`83.62% <ø> (+<0.01%)`	⬆️
6.1-m5n.metal	`83.61% <ø> (ø)`
6.1-m6a.metal	`82.81% <ø> (ø)`
6.1-m6g.metal	`79.60% <ø> (?)`
6.1-m6i.metal	`83.60% <ø> (ø)`
6.1-m7g.metal	`79.60% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/firecracker/examples/uffd/uffd_utils.rs

tests/integration_tests/performance/test_snapshot_ab.py

tools/ab_test.py

Manciukic · 2025-03-05T10:01:50Z

Do you have a preview of the metrics? I'm curious what's the current latency

resources/overlay/usr/local/bin/fast_page_fault_helper.c

tools/devtool

Instead use the property of the `UffdHandler` object that `spawn_uffd_handler` returns. Signed-off-by: Patrick Roy <[email protected]>

The memset causes page faults in the guest which will be handled by UFFD. By timing the memset from inside the guest, we can measure the latency of these page faults as observed by an application running inside the guest. Signed-off-by: Patrick Roy <[email protected]>

Make the example UFFD handler use a HashSet to only store address ranges that were actually removed due to expanding memory balloon. This eliminates the HashMap that previously held entries for every single page in guest memory, which was causing the example uffd handler to slow to a crawl (taking about 40s to fault in 128MB of memory). This is possible because the UFFD handler doesn't actually care about the 4 states a page can supposedly be in, it only needs to know whether a page needs to be served from the memory snapshot file, or zeroed out. Since zero-ing out is the rarer of the two (only happens if a balloon device caused some previously faulted in memory to be madvise'd away), we only store information about pages that we got remove messages about, and simply assume every other page needs to be served from the snapshot file. This means that in the happy path of "no balloon device", we never need to store any information about the memory at all. This one simple optimization brings the fault time down to 500ms (from 40s) for the 128MB scenario, which makes the example handler usable for performance tests for page fault latencies. Signed-off-by: Patrick Roy <[email protected]>

Just store the binary dir inside the microvm class instead of separate paths to firecracker and jailer binaries. This centralizes the existance assertions, but means we cant theoretically use jailer and firecracker binaries from different locations anymore (however, nothing in the test framework was doing this, as MicroVMFactory already did not support this and hardcoded the binary names relative to the --binary-dir option). Signed-off-by: Patrick Roy <[email protected]>

Instead of get_binary assuming it always operates on a cargo build directory, have it directly operate on a "binary directory" in the sense of the --binary-dir parameter. Nothing was using the workspace_dir arg, so no functional changes. Signed-off-by: Patrick Roy <[email protected]>

To allow specifying a different directory in which to look for the uffd example binaries, so that they can be used in A/B-tests. Signed-off-by: Patrick Roy <[email protected]>

Also copy the example binaries into the target directory when building some arbitrary revision (for example for A/B-tests). Signed-off-by: Patrick Roy <[email protected]>

When building a microvm from a snapshot, there is no need to specify a guest kernel. Remove the parameter from SnapshotRestoreTest.sample_latency. Signed-off-by: Patrick Roy <[email protected]>

This was the result of a find and replace gone awry it seems. Signed-off-by: Patrick Roy <[email protected]>

So that we can later refer to it in tests, for example to parse its logs. Signed-off-by: Patrick Roy <[email protected]>

Add a utility function that allows building multiple microvms from the same snapshot sequentially. Signed-off-by: Patrick Roy <[email protected]>

Replace snapshot creation loops in test_5_snapshots, test_vmgenid and test_restore_latency with calls to build_n_from_snapshot Signed-off-by: Patrick Roy <[email protected]>

Move the initialization of the metrics logger dimensions into `configure_vm`, as later on there will be multiple tests having to do exactly the same sequence of configure + setup metrics. Signed-off-by: Patrick Roy <[email protected]>

No need to defer the starting. Also rename it to `boot_vm` to more accurately reflect what it does now. Signed-off-by: Patrick Roy <[email protected]>

This function only has one possible call-site, and that's test_restore_latency. Inlining it into its call site makes things a bit simpler (no need for a temporary array). Signed-off-by: Patrick Roy <[email protected]>

Add a snapshot latency test that output metrics on how long it takes to memset a 128MB memory region post-snapshot restore. We only collect these latencies for a single configuration (1024MB, 2vCPU), since they will be the same no matter the size of the memory. Run with an on-demand uffd handler and compare to mmap-ing the snapshot file as a baseline. Also use the fault_all handler, which will allow us to measure latencies due to "fast" page faults, e.g. those that are just for setting up stage-2 page able entries but where otherwise memory is already faulted into userspace. Signed-off-by: Patrick Roy <[email protected]>

Runs the new test_post_restore_latency test also for hugepages-backed VMs Signed-off-by: Patrick Roy <[email protected]>

While the post_restore_latency test measures the time on-demand faulting takes from the guest's perspective, this test measures the actual time it takes to fault in guest memory. We cannot collect this data by simply using the fault_all handler in the post-restore test, because the handler will get triggered way before the fast_page_fault_helper script continues running (because the first page fault will be triggered by sshd or the kernel, so by the time out helper runs, the uffd handler will already be done and we won't notice any latency). Therefore, have the fault_all handler print the time it took to populate all of guest memory to its log file, and parse this number. Signed-off-by: Patrick Roy <[email protected]>

We were calling a non-existing function on an error path. This is why I prefer compiled languages smh. Signed-off-by: Patrick Roy <[email protected]>

When using metrics.put_dimensions from python code, then the resulting EMF messages will have a list of lists in their "Dimensions" key. Currently, ab_test.py assumes that this list of lists always has length one, meaning it ignores additional dimensions added via put_dimensions. Have it instead flatten the list of lists so that it always considers all dimensions. Signed-off-by: Patrick Roy <[email protected]>

This better describes what the handler actually does (technically, fault_all_handler is also "valid"), and now that some tests emit the name of the uffd handler as a metric, the names should be descriptive from the get-go to avoid confusion and future renamings. Signed-off-by: Patrick Roy <[email protected]>

To bring down the overhead of the testing script, only touch a single byte on each page (as that's enough to fault in the entire page), instead of memsetting everything. Signed-off-by: Patrick Roy <[email protected]>

On m6a/6.1, post_populate with fault_all handler and 2M pages fails because of the UFFD handler panicking during faulting, returning `PartiallyCopied(1022MB)`. This is because we are depleting the huge pages pool. Thus, allocate some more. Admittedly, I do not understand why thisd only happens on m6a/6.1. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the snapshot-latency-test branch from 45bc481 to 9462885 Compare March 4, 2025 11:06

kalyazin reviewed Mar 4, 2025

View reviewed changes

src/firecracker/examples/uffd/uffd_utils.rs Show resolved Hide resolved

tests/integration_tests/performance/test_snapshot_ab.py Show resolved Hide resolved

tests/integration_tests/performance/test_snapshot_ab.py Show resolved Hide resolved

roypat force-pushed the snapshot-latency-test branch 3 times, most recently from 8b33710 to 83da66e Compare March 4, 2025 13:10

kalyazin previously approved these changes Mar 4, 2025

View reviewed changes

roypat dismissed kalyazin’s stale review via 91a1f09 March 4, 2025 14:09

roypat force-pushed the snapshot-latency-test branch 2 times, most recently from 91a1f09 to d48693e Compare March 4, 2025 14:15

roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 4, 2025

Manciukic reviewed Mar 5, 2025

View reviewed changes

tests/integration_tests/performance/test_snapshot_ab.py Show resolved Hide resolved

Manciukic reviewed Mar 5, 2025

View reviewed changes

tools/ab_test.py Show resolved Hide resolved

roypat force-pushed the snapshot-latency-test branch from 5e5ac74 to fcc6c62 Compare March 6, 2025 12:51

roypat requested review from pb8o and xmarcalx as code owners March 6, 2025 12:51

roypat force-pushed the snapshot-latency-test branch from 6fb8c5b to 896101f Compare March 6, 2025 14:25

kalyazin reviewed Mar 6, 2025

View reviewed changes

resources/overlay/usr/local/bin/fast_page_fault_helper.c Outdated Show resolved Hide resolved

roypat force-pushed the snapshot-latency-test branch from e3fd139 to 648347f Compare March 6, 2025 15:03

kalyazin reviewed Mar 6, 2025

View reviewed changes

tools/devtool Show resolved Hide resolved

roypat force-pushed the snapshot-latency-test branch from c3f22b4 to 83ebb6c Compare March 6, 2025 17:02

kalyazin previously approved these changes Mar 6, 2025

View reviewed changes

roypat dismissed kalyazin’s stale review via 229a1a2 March 7, 2025 14:54

roypat force-pushed the snapshot-latency-test branch from 91afe9d to 229a1a2 Compare March 7, 2025 14:54

roypat added 5 commits March 7, 2025 14:55

refactor(test): stop hardcoding uffd socket path everywhere

0b9ff99

Instead use the property of the `UffdHandler` object that `spawn_uffd_handler` returns. Signed-off-by: Patrick Roy <[email protected]>

roypat added 18 commits March 7, 2025 14:55

test: pass through kwargs from uffd_handler() to get_example()

73683cb

To allow specifying a different directory in which to look for the uffd example binaries, so that they can be used in A/B-tests. Signed-off-by: Patrick Roy <[email protected]>

devtool: also copy example binaries when doing build --rev

295bb78

Also copy the example binaries into the target directory when building some arbitrary revision (for example for A/B-tests). Signed-off-by: Patrick Roy <[email protected]>

fix(test): stop passing guest kernel when building vms from snapshot

a2ce40c

When building a microvm from a snapshot, there is no need to specify a guest kernel. Remove the parameter from SnapshotRestoreTest.sample_latency. Signed-off-by: Patrick Roy <[email protected]>

test_snapshot_ab: rename guest_kernel_acpi to guest_kernel

e926f08

This was the result of a find and replace gone awry it seems. Signed-off-by: Patrick Roy <[email protected]>

test: store uffd handler in microvm attribute

5ef13fa

So that we can later refer to it in tests, for example to parse its logs. Signed-off-by: Patrick Roy <[email protected]>

test: Add MicroVMFactory.build_n_from_snapshot

1ac8a3a

Add a utility function that allows building multiple microvms from the same snapshot sequentially. Signed-off-by: Patrick Roy <[email protected]>

test: use build_n_from_snapshot to replace explicit loops

2e01008

Replace snapshot creation loops in test_5_snapshots, test_vmgenid and test_restore_latency with calls to build_n_from_snapshot Signed-off-by: Patrick Roy <[email protected]>

test_snapshot_ab: start VM inside configure_vm

5b19b19

No need to defer the starting. Also rename it to `boot_vm` to more accurately reflect what it does now. Signed-off-by: Patrick Roy <[email protected]>

test_snapshot_ab: open code sample_latency

873515a

This function only has one possible call-site, and that's test_restore_latency. Inlining it into its call site makes things a bit simpler (no need for a temporary array). Signed-off-by: Patrick Roy <[email protected]>

test: Also emit post-restore latency metrics for huge pages

187f807

Runs the new test_post_restore_latency test also for hugepages-backed VMs Signed-off-by: Patrick Roy <[email protected]>

fix(ab): its dict.keys(), not dict.key()

b0ecfb4

We were calling a non-existing function on an error path. This is why I prefer compiled languages smh. Signed-off-by: Patrick Roy <[email protected]>

test: avoid memsetting in favor of just touching one byte per page

35861f0

To bring down the overhead of the testing script, only touch a single byte on each page (as that's enough to fault in the entire page), instead of memsetting everything. Signed-off-by: Patrick Roy <[email protected]>

roypat force-pushed the snapshot-latency-test branch from 229a1a2 to a2fb944 Compare March 7, 2025 14:55

Merge branch 'main' into snapshot-latency-test

c2d23d9

kalyazin approved these changes Mar 10, 2025

View reviewed changes

Manciukic approved these changes Mar 10, 2025

View reviewed changes

Merge branch 'main' into snapshot-latency-test

3e37ae6

roypat merged commit 333bc2d into firecracker-microvm:main Mar 10, 2025
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved Snapshot Latency Testing #5061

Improved Snapshot Latency Testing #5061

Uh oh!

roypat commented Mar 4, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Manciukic commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improved Snapshot Latency Testing #5061

Improved Snapshot Latency Testing #5061

Uh oh!

Conversation

roypat commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

License Acceptance

PR Checklist

Uh oh!

codecov bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Manciukic commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roypat commented Mar 4, 2025 •

edited

Loading

codecov bot commented Mar 4, 2025 •

edited

Loading