Skip to content

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Mar 4, 2025

Add tests that measure post-snapshot-restore latency, mainly around page faults. We have two flavors of tests:

  1. Measuring how long it takes to fault in a 128MB region from inside the guest after a snapshot restore, and
  2. Measuring how long it takes to populate 128MB of guest memory from the UFFD handler

While doing this, do some optimizations to our example UFFD handlers to reduce page fault latency by ~98.75%, as otherwise the overhead of bookkeeping inside the UFFD handler would dominate all other interesting effects (and the tests would report taking around 40s, which is less than useless).

Note that these won't work until CI artifacts have been rebuilt from this branch, as the fast_page_fault_helper tool needs to be recompiled into the rootfs.

I've validated these by playing around with MAP_SHARED vs MAP_PRIVATE anon mappings, and the population latency test for example very clearly exhibited the added latency of using shared memory

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Mar 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.18%. Comparing base (cc5d6b4) to head (3e37ae6).
Report is 23 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5061   +/-   ##
=======================================
  Coverage   83.18%   83.18%           
=======================================
  Files         247      247           
  Lines       26816    26816           
=======================================
  Hits        22306    22306           
  Misses       4510     4510           
Flag Coverage Δ
5.10-c5n.metal 83.62% <ø> (ø)
5.10-m5n.metal 83.61% <ø> (ø)
5.10-m6a.metal 82.81% <ø> (ø)
5.10-m6g.metal 79.60% <ø> (ø)
5.10-m6i.metal 83.61% <ø> (ø)
5.10-m7g.metal 79.60% <ø> (ø)
6.1-c5n.metal 83.62% <ø> (+<0.01%) ⬆️
6.1-m5n.metal 83.61% <ø> (ø)
6.1-m6a.metal 82.81% <ø> (ø)
6.1-m6g.metal 79.60% <ø> (?)
6.1-m6i.metal 83.60% <ø> (ø)
6.1-m7g.metal 79.60% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@roypat roypat force-pushed the snapshot-latency-test branch from 45bc481 to 9462885 Compare March 4, 2025 11:06
@roypat roypat force-pushed the snapshot-latency-test branch 3 times, most recently from 8b33710 to 83da66e Compare March 4, 2025 13:10
kalyazin
kalyazin previously approved these changes Mar 4, 2025
@roypat roypat force-pushed the snapshot-latency-test branch 2 times, most recently from 91a1f09 to d48693e Compare March 4, 2025 14:15
@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 4, 2025
@Manciukic
Copy link
Contributor

Do you have a preview of the metrics? I'm curious what's the current latency

@roypat roypat force-pushed the snapshot-latency-test branch from 5e5ac74 to fcc6c62 Compare March 6, 2025 12:51
@roypat roypat requested review from pb8o and xmarcalx as code owners March 6, 2025 12:51
@roypat roypat force-pushed the snapshot-latency-test branch from 6fb8c5b to 896101f Compare March 6, 2025 14:25
@roypat roypat force-pushed the snapshot-latency-test branch from e3fd139 to 648347f Compare March 6, 2025 15:03
@roypat roypat force-pushed the snapshot-latency-test branch from c3f22b4 to 83ebb6c Compare March 6, 2025 17:02
kalyazin
kalyazin previously approved these changes Mar 6, 2025
roypat added 5 commits March 7, 2025 14:55
Instead use the property of the `UffdHandler` object that
`spawn_uffd_handler` returns.

Signed-off-by: Patrick Roy <[email protected]>
The memset causes page faults in the guest which will be handled by
UFFD. By timing the memset from inside the guest, we can measure the
latency of these page faults as observed by an application running
inside the guest.

Signed-off-by: Patrick Roy <[email protected]>
Make the example UFFD handler use a HashSet to only store address ranges
that were actually removed due to expanding memory balloon. This
eliminates the HashMap that previously held entries for every single
page in guest memory, which was causing the example uffd handler to slow
to a crawl (taking about 40s to fault in 128MB of memory).

This is possible because the UFFD handler doesn't actually care about
the 4 states a page can supposedly be in, it only needs to know whether
a page needs to be served from the memory snapshot file, or zeroed out.
Since zero-ing out is the rarer of the two (only happens if a balloon
device caused some previously faulted in memory to be madvise'd away),
we only store information about pages that we got remove messages about,
and simply assume every other page needs to be served from the snapshot
file. This means that in the happy path of "no balloon device", we never
need to store any information about the memory at all.

This one simple optimization brings the fault time down to 500ms (from
40s) for the 128MB scenario, which makes the example handler usable for
performance tests for page fault latencies.

Signed-off-by: Patrick Roy <[email protected]>
Just store the binary dir inside the microvm class instead of separate
paths to firecracker and jailer binaries. This centralizes the existance
assertions, but means we cant theoretically use jailer and firecracker
binaries from different locations anymore (however, nothing in the test
framework was doing this, as MicroVMFactory already did not support this
and hardcoded the binary names relative to the --binary-dir option).

Signed-off-by: Patrick Roy <[email protected]>
Instead of get_binary assuming it always operates on a cargo build
directory, have it directly operate on a "binary directory" in the sense
of the --binary-dir parameter. Nothing was using the workspace_dir arg,
so no functional changes.

Signed-off-by: Patrick Roy <[email protected]>
roypat added 18 commits March 7, 2025 14:55
To allow specifying a different directory in which to look for the uffd
example binaries, so that they can be used in A/B-tests.

Signed-off-by: Patrick Roy <[email protected]>
Also copy the example binaries into the target directory when building
some arbitrary revision (for example for A/B-tests).

Signed-off-by: Patrick Roy <[email protected]>
When building a microvm from a snapshot, there is no need to specify a
guest kernel. Remove the parameter from
SnapshotRestoreTest.sample_latency.

Signed-off-by: Patrick Roy <[email protected]>
This was the result of a find and replace gone awry it seems.

Signed-off-by: Patrick Roy <[email protected]>
So that we can later refer to it in tests, for example to parse its
logs.

Signed-off-by: Patrick Roy <[email protected]>
Add a utility function that allows building multiple microvms from the
same snapshot sequentially.

Signed-off-by: Patrick Roy <[email protected]>
Replace snapshot creation loops in test_5_snapshots, test_vmgenid and
test_restore_latency with calls to build_n_from_snapshot

Signed-off-by: Patrick Roy <[email protected]>
Move the initialization of the metrics logger dimensions into
`configure_vm`, as later on there will be multiple tests having to do
exactly the same sequence of configure + setup metrics.

Signed-off-by: Patrick Roy <[email protected]>
No need to defer the starting. Also rename it to `boot_vm` to more
accurately reflect what it does now.

Signed-off-by: Patrick Roy <[email protected]>
This function only has one possible call-site, and that's
test_restore_latency. Inlining it into its call site makes things a bit
simpler (no need for a temporary array).

Signed-off-by: Patrick Roy <[email protected]>
Add a snapshot latency test that output metrics on how long it takes to
memset a 128MB memory region post-snapshot restore.

We only collect these latencies for a single configuration (1024MB,
2vCPU), since they will be the same no matter the size of the memory.

Run with an on-demand uffd handler and compare to mmap-ing the snapshot
file as a baseline. Also use the fault_all handler, which will allow us
to measure latencies due to "fast" page faults, e.g. those that are just
for setting up stage-2 page able entries but where otherwise memory is
already faulted into userspace.

Signed-off-by: Patrick Roy <[email protected]>
Runs the new test_post_restore_latency test also for hugepages-backed
VMs

Signed-off-by: Patrick Roy <[email protected]>
While the post_restore_latency test measures the time on-demand faulting
takes from the guest's perspective, this test measures the actual time
it takes to fault in guest memory. We cannot collect this data by simply
using the fault_all handler in the post-restore test, because the
handler will get triggered way before the fast_page_fault_helper script
continues running (because the first page fault will be triggered by
sshd or the kernel, so by the time out helper runs, the uffd handler
will already be done and we won't notice any latency). Therefore, have
the fault_all handler print the time it took to populate all of guest
memory to its log file, and parse this number.

Signed-off-by: Patrick Roy <[email protected]>
We were calling a non-existing function on an error path. This is why I
prefer compiled languages smh.

Signed-off-by: Patrick Roy <[email protected]>
When using metrics.put_dimensions from python code, then the resulting
EMF messages will have a list of lists in their "Dimensions" key.
Currently, ab_test.py assumes that this list of lists always has length
one, meaning it ignores additional dimensions added via put_dimensions.
Have it instead flatten the list of lists so that it always considers
all dimensions.

Signed-off-by: Patrick Roy <[email protected]>
This better describes what the handler actually does (technically,
fault_all_handler is also "valid"), and now that some tests emit the
name of the uffd handler as a metric, the names should be descriptive
from the get-go to avoid confusion and future renamings.

Signed-off-by: Patrick Roy <[email protected]>
To bring down the overhead of the testing script, only touch a single
byte on each page (as that's enough to fault in the entire page),
instead of memsetting everything.

Signed-off-by: Patrick Roy <[email protected]>
On m6a/6.1, post_populate with fault_all handler and 2M pages fails
because of the UFFD handler panicking during faulting, returning
`PartiallyCopied(1022MB)`. This is because we are depleting the huge
pages pool. Thus, allocate some more.

Admittedly, I do not understand why thisd only happens on m6a/6.1.

Signed-off-by: Patrick Roy <[email protected]>
@roypat roypat force-pushed the snapshot-latency-test branch from 229a1a2 to a2fb944 Compare March 7, 2025 14:55
@roypat roypat merged commit 333bc2d into firecracker-microvm:main Mar 10, 2025
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants