-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Improved Snapshot Latency Testing #5061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Snapshot Latency Testing #5061
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5061 +/- ##
=======================================
Coverage 83.18% 83.18%
=======================================
Files 247 247
Lines 26816 26816
=======================================
Hits 22306 22306
Misses 4510 4510
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
45bc481
to
9462885
Compare
8b33710
to
83da66e
Compare
91a1f09
to
d48693e
Compare
Do you have a preview of the metrics? I'm curious what's the current latency |
5e5ac74
to
fcc6c62
Compare
6fb8c5b
to
896101f
Compare
e3fd139
to
648347f
Compare
c3f22b4
to
83ebb6c
Compare
91afe9d
to
229a1a2
Compare
Instead use the property of the `UffdHandler` object that `spawn_uffd_handler` returns. Signed-off-by: Patrick Roy <[email protected]>
The memset causes page faults in the guest which will be handled by UFFD. By timing the memset from inside the guest, we can measure the latency of these page faults as observed by an application running inside the guest. Signed-off-by: Patrick Roy <[email protected]>
Make the example UFFD handler use a HashSet to only store address ranges that were actually removed due to expanding memory balloon. This eliminates the HashMap that previously held entries for every single page in guest memory, which was causing the example uffd handler to slow to a crawl (taking about 40s to fault in 128MB of memory). This is possible because the UFFD handler doesn't actually care about the 4 states a page can supposedly be in, it only needs to know whether a page needs to be served from the memory snapshot file, or zeroed out. Since zero-ing out is the rarer of the two (only happens if a balloon device caused some previously faulted in memory to be madvise'd away), we only store information about pages that we got remove messages about, and simply assume every other page needs to be served from the snapshot file. This means that in the happy path of "no balloon device", we never need to store any information about the memory at all. This one simple optimization brings the fault time down to 500ms (from 40s) for the 128MB scenario, which makes the example handler usable for performance tests for page fault latencies. Signed-off-by: Patrick Roy <[email protected]>
Just store the binary dir inside the microvm class instead of separate paths to firecracker and jailer binaries. This centralizes the existance assertions, but means we cant theoretically use jailer and firecracker binaries from different locations anymore (however, nothing in the test framework was doing this, as MicroVMFactory already did not support this and hardcoded the binary names relative to the --binary-dir option). Signed-off-by: Patrick Roy <[email protected]>
Instead of get_binary assuming it always operates on a cargo build directory, have it directly operate on a "binary directory" in the sense of the --binary-dir parameter. Nothing was using the workspace_dir arg, so no functional changes. Signed-off-by: Patrick Roy <[email protected]>
To allow specifying a different directory in which to look for the uffd example binaries, so that they can be used in A/B-tests. Signed-off-by: Patrick Roy <[email protected]>
Also copy the example binaries into the target directory when building some arbitrary revision (for example for A/B-tests). Signed-off-by: Patrick Roy <[email protected]>
When building a microvm from a snapshot, there is no need to specify a guest kernel. Remove the parameter from SnapshotRestoreTest.sample_latency. Signed-off-by: Patrick Roy <[email protected]>
This was the result of a find and replace gone awry it seems. Signed-off-by: Patrick Roy <[email protected]>
So that we can later refer to it in tests, for example to parse its logs. Signed-off-by: Patrick Roy <[email protected]>
Add a utility function that allows building multiple microvms from the same snapshot sequentially. Signed-off-by: Patrick Roy <[email protected]>
Replace snapshot creation loops in test_5_snapshots, test_vmgenid and test_restore_latency with calls to build_n_from_snapshot Signed-off-by: Patrick Roy <[email protected]>
Move the initialization of the metrics logger dimensions into `configure_vm`, as later on there will be multiple tests having to do exactly the same sequence of configure + setup metrics. Signed-off-by: Patrick Roy <[email protected]>
No need to defer the starting. Also rename it to `boot_vm` to more accurately reflect what it does now. Signed-off-by: Patrick Roy <[email protected]>
This function only has one possible call-site, and that's test_restore_latency. Inlining it into its call site makes things a bit simpler (no need for a temporary array). Signed-off-by: Patrick Roy <[email protected]>
Add a snapshot latency test that output metrics on how long it takes to memset a 128MB memory region post-snapshot restore. We only collect these latencies for a single configuration (1024MB, 2vCPU), since they will be the same no matter the size of the memory. Run with an on-demand uffd handler and compare to mmap-ing the snapshot file as a baseline. Also use the fault_all handler, which will allow us to measure latencies due to "fast" page faults, e.g. those that are just for setting up stage-2 page able entries but where otherwise memory is already faulted into userspace. Signed-off-by: Patrick Roy <[email protected]>
Runs the new test_post_restore_latency test also for hugepages-backed VMs Signed-off-by: Patrick Roy <[email protected]>
While the post_restore_latency test measures the time on-demand faulting takes from the guest's perspective, this test measures the actual time it takes to fault in guest memory. We cannot collect this data by simply using the fault_all handler in the post-restore test, because the handler will get triggered way before the fast_page_fault_helper script continues running (because the first page fault will be triggered by sshd or the kernel, so by the time out helper runs, the uffd handler will already be done and we won't notice any latency). Therefore, have the fault_all handler print the time it took to populate all of guest memory to its log file, and parse this number. Signed-off-by: Patrick Roy <[email protected]>
We were calling a non-existing function on an error path. This is why I prefer compiled languages smh. Signed-off-by: Patrick Roy <[email protected]>
When using metrics.put_dimensions from python code, then the resulting EMF messages will have a list of lists in their "Dimensions" key. Currently, ab_test.py assumes that this list of lists always has length one, meaning it ignores additional dimensions added via put_dimensions. Have it instead flatten the list of lists so that it always considers all dimensions. Signed-off-by: Patrick Roy <[email protected]>
This better describes what the handler actually does (technically, fault_all_handler is also "valid"), and now that some tests emit the name of the uffd handler as a metric, the names should be descriptive from the get-go to avoid confusion and future renamings. Signed-off-by: Patrick Roy <[email protected]>
To bring down the overhead of the testing script, only touch a single byte on each page (as that's enough to fault in the entire page), instead of memsetting everything. Signed-off-by: Patrick Roy <[email protected]>
On m6a/6.1, post_populate with fault_all handler and 2M pages fails because of the UFFD handler panicking during faulting, returning `PartiallyCopied(1022MB)`. This is because we are depleting the huge pages pool. Thus, allocate some more. Admittedly, I do not understand why thisd only happens on m6a/6.1. Signed-off-by: Patrick Roy <[email protected]>
229a1a2
to
a2fb944
Compare
Add tests that measure post-snapshot-restore latency, mainly around page faults. We have two flavors of tests:
While doing this, do some optimizations to our example UFFD handlers to reduce page fault latency by ~98.75%, as otherwise the overhead of bookkeeping inside the UFFD handler would dominate all other interesting effects (and the tests would report taking around 40s, which is less than useless).
Note that these won't work until CI artifacts have been rebuilt from this branch, as the fast_page_fault_helper tool needs to be recompiled into the rootfs.
I've validated these by playing around with MAP_SHARED vs MAP_PRIVATE anon mappings, and the population latency test for example very clearly exhibited the added latency of using shared memory
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
tools/devtool checkstyle
to verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md
.Runbook for Firecracker API changes.
integration tests.
TODO
.rust-vmm
.