- 
                Notifications
    You must be signed in to change notification settings 
- Fork 151
Use page tracking for snapshot and restore #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use page tracking for snapshot and restore #683
Conversation
7e85120    to
    014eab3      
    Compare
  
    014eab3    to
    56fd983      
    Compare
  
    04995b0    to
    894f5fe      
    Compare
  
    0808c97    to
    ed8a9d2      
    Compare
  
    456d533    to
    58418bb      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished reviewing it yet, but these are my comments so far
| if base_pfn == u64::MAX { | ||
| base_pfn = mshv_region.guest_pfn; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively obtaining the guest_pfn of the first region.
Is there a guarantee on the order of the regions?
Why does this correspond to the guest page number?
Should this maybe be the smallest guest_pfn? (although I don't understand why 😅)
base_pfn = base_pfn.min(mshv_region.guest_pfn);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is no guarantee of the order currently, but it happens to be true that they are in order and are contiguous because of the way we build them. It's probably something we want to enforce at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we use a constant else where to specify the base address that we map to? In which case can we not compute the PFN based on the constant value (I think it may actually be zero) but the point is that if we base it on the constant and that ever changes or gets removed this code that depends on that will be impacted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first guest region's guest address is 0 (BASE_ADDRESS). The page frame number is obtained by right-shifting the guest address by 12. Note 1 << 12 == page size. I think it is better to calculate it directly like this, rather than assume that it is derived from BASE_ADDRESS, which may or may not be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Thanks for working on this!
I left some more comments, but in general it's great.
| } | ||
| } | ||
| }; | ||
| for page_idx in bit_index_iterator(&bitmap) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to use a dependency like bitvec.
Then we can do something like this:
let mut sandbox_dirty_pages = BitVec::new();
for (i, mem_region) = self.mem_regions.iter().take(n_contiguos).enumerate() {
    let num_pages = mem_region.guest_region.len() / PAGE_SIZE_USIZE;
    let bitmap = ...;
    let mut bitmap = BitVec::from_vec(bitmap);
    bitmap.truncate(num_pages);
    dirty.extend_from_bitslice(bitmap.as_bitslice());
}
Ok(sandbox_dirty_pages)| current_dirty_pages: &[u64], | ||
| most_recent_snapshot: &Option<Arc<SharedMemorySnapshot>>, | ||
| ) -> Result<u64> { | ||
| if self.sandbox_size() != shared_mem.mem_size() { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think we need to check more than just the sandbox size. It has to have the same layout.
What happens if we have a sandbox with 1M heap, 1k io, and 4k stack, and restore on it a sandbox with 1M stack, 4k io, and 1M heap? the main memory size would be the same, but then we would use the wrong Layout, right?
I think this was instroduced by the new snapshot API, we should create an issue to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think the fact that we don't support restoring out of branch is the key that avoid this issue for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you're right, this case is possible if we're restoring a brand new sandbox, with no previous snapshot, to a given snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be an option not supporting that case, just like we do with out of branch snapshots?
Or would things fundamentally break? like... not being able to restore the first ever snapshot
If it's an option, I'm happy for us to create an issue and defer fixing it.
28b4d42    to
    6a5458a      
    Compare
  
    | // Regions part of the original sandbox | ||
| mem_regions: Vec<MemoryRegion>, | ||
| n_initial_regions: usize, | ||
| // Size of the (contioous) sandbox mem_regions | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Size of the (contioous) sandbox mem_regions | |
| // Size of the (contiguous) sandbox mem_regions | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be contiguous not continuous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the comment :-)
6f17e8b    to
    06e3edb      
    Compare
  
    | group.finish(); | ||
| } | ||
|  | ||
| // Sandbox creation with different heap sizes | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reorientate the benchmarks, we should have a set of tests that we perform on sandboxes of differing sizes , for each sized sandbox we should:
measure the create time
measure the drop time
have guest calls with a varying number of parameters that does not reset the sandbox
have guest calls with a varying number of parameters that does reset the sandbox.
repeat the above where there are 1-n host function calls made by the guest function
measure the cost of snapshotting  when modifying varying proportions of the sandbox memory
measure the cost of restoring when modifying varying proportions of the sandbox memory.
measure the cost of calling with the largest parameters that sized sandbox can support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #722 for this
| // Regions part of the original sandbox | ||
| mem_regions: Vec<MemoryRegion>, | ||
| n_initial_regions: usize, | ||
| // Size of the (contioous) sandbox mem_regions | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be contiguous not continuous?
| if base_pfn == u64::MAX { | ||
| base_pfn = mshv_region.guest_pfn; | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we use a constant else where to specify the base address that we map to? In which case can we not compute the PFN based on the constant value (I think it may actually be zero) but the point is that if we base it on the constant and that ever changes or gets removed this code that depends on that will be impacted.
| vm_fd.get_dirty_log(base_pfn, total_size, CLEAR_DIRTY_BIT_FLAG)?; | ||
| #[cfg(mshv3)] | ||
| vm_fd.get_dirty_log(base_pfn, total_size, MSHV_GPAP_ACCESS_OP_CLEAR as u8)?; | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having looked at the perf here I wonder if we should at least measure what it would look like to track dirty pages in the guest, I think this should be reasonably doable now that the foundation has been done in the guest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simongdavies I am already working on (a superset of)) this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime, I wonder if a useful perf optimization is a handle to disable dirty page tracking because we expect that this sandbox will only ever be restored to a totally different one. Not sure if that would be useful ATM, but depending on how exactly we do snapshotting for AFD it could make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simongdavies I am already working on (a superset of)) this
To clarify: in order to achieve our security properties (i.e. an untrusted guest cannot corrupt the execution of or learn information from another guest, even one restored into the same sandbox), we can't trust the guest for dirty page tracking unless the guest is also managing copy-on-write with a hypervisor-enforced readonly underlying store. Therefore, I am looking at making the latter happen.
| self.interrupt_handle.clone() | ||
| } | ||
|  | ||
| // TODO: Implement getting additional host-mapped dirty pages. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this TODO mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this here so we won't forget about it once additionally mmaped memory is not read-only, and we potentially want to be able to reset it as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest this comment be made a little bit more explicit about e.g. "TODO: When regions of memory other than the main guest memory can be writable, ensure we don't miss any dirty pages in them" or something like that?
| self.interrupt_handle.clone() | ||
| } | ||
|  | ||
| // TODO: Implement getting additional host-mapped dirty pages. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this TODO mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this here so we won't forget about it once additionally mmaped memory is not read-only, and we potentially want to be able to reset it as well
| /// Get dirty pages as a bitmap (Vec<u64>). | ||
| /// Each bit in a u64 represents a page. | ||
| /// This also clears the bitflags, marking the pages as non-dirty. | ||
| /// TODO: Implement getting additional host-mapped dirty pages. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is about the additonal mmaped regions then I think at the moment these are read only so what is needed is metadata to recreate them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this here so we won't forget about it once additionally mmaped memory is not read-only, and we potentially want to be able to reset it as well
| /// | ||
| /// Additionally, writes to the returned slice will not mark pages as dirty. | ||
| /// User must call `mark_pages_dirty` manually to mark pages as dirty. | ||
| fn as_mut_slice(&mut self) -> &mut [u8] { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get rid of this? This is one of the reasons that I preferred the abstraction that signal handling gave, I think that this has the potential for creating difficult to find bugs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, most other methods on SharedMemory use this to write memory. At least it's private now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mark it as unsafe since there is an internal invariant that it enforces, to make callers read the docs, etc? Not sure if it is strictly unsafe in the Rust sense, but I suppose if you wrote other code that depended on the snapshot behaviour having what it ought to you could perhaps construct something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, replace it with a version that takes offset/len and does the dirty marking itself... but that's probably not super ideal? Idk.
| /// # Safety | ||
| /// This function is unsafe because it does not mark the pages as dirty. | ||
| /// Only use this if you are certain that the pages do not need to be marked as dirty. | ||
| #[instrument(err(Debug), skip_all, parent = Span::current(), level= "Trace")] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What circumstances is this used under?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need these two for restoring a snapshot. Restoring a sandbox to a snapshotted state should not itself dirty the pages
f04c42e    to
    abdb96d      
    Compare
  
    …parameters Signed-off-by: Ludvig Liljenberg <[email protected]>
…ctions. Signed-off-by: Ludvig Liljenberg <[email protected]>
…ges in the host Signed-off-by: Ludvig Liljenberg <[email protected]>
Signed-off-by: Ludvig Liljenberg <[email protected]>
Signed-off-by: Ludvig Liljenberg <[email protected]>
Signed-off-by: Ludvig Liljenberg <[email protected]>
abdb96d    to
    afa652e      
    Compare
  
    | We may still do more or less this, but it will be implemented differently. | 
TODO: mark restore_snapshot
unsafe, needs caller to verify sandbox_id atches.This PR makes it so sandbox Snapshots no longer contain a full copy of memory at the time of snapshot. Instead, only pages that have been dirtied since the last snapshot are included. This improves performance due to less memory copying when restoring a snapshot, and also (usually) reduces memory consumption due to not having to keep full copies around.
The PR first introduces tracking of dirty pages. It's done on two levels:
The union is taken of these two sets to determine which memory has been dirtied since last snapshot.
Since snapshots no longer contain a full copy of memory, restoring memory from a snapshot is a bit more complicated.
The algorithm this PR uses goes something like this:
MSHV3 microbenchmarks vs main branch (on a VM)
KVM microbenchmarks vs main branch (on a wsl)
OLD PR Description
This pull request introduces snapshotting and restoring of sandbox state using dirty page tracking rather than copying the entire memory state each time, in many/most scenarios where sandboxes have larger than default amounts of memory this results in better perforamnce and reduced memory usage.However, due to inefficiencies in the way dirty page tracking works on mshv and the fact that the implementation of tracking is not done for Windows this is not universally true.
Included below are screenshots showing the difference in time taken for some benchmarks with and without dirty page tracking enabled, along with some explanations of the differences seen.
The changes comprise:
Host Shared Memory Dirty Page tracking
SIGSEGVto support dirty page tracking for host memory mapped into a VM. Updated documentation to reflect this change and added debugging instructions for handlingSIGSEGVin GDB and LLDB.VM Dirty page tracking
Snapshot Management
Benchmarking Enhancements
guest_call_benchmark_large_paramfunction to support benchmarks for multiple parameter sizes and added a newsandbox_heap_size_benchmarkfunction to measure sandbox creation performance with varying heap sizes.guest_call_heap_size_benchmarkto evaluate guest function call performance with different heap sizes.Performance Changes
KVM Before
KVM After
KVM shows massive positive changes in performance when creating large sandboxes and calling functions in sandboxes with large memory configurations , although not measured the amount of memory consumed should have reduced considerably. In the scenario where large parameters are passed to the sandbox there is a degradation of performance, this has not been investigated yet, it may be due to the page based mechanism for saving/restoring data being more expensive than copying and restoring all the data. Other work to make large parameter passing more efficient will likely have a large positive impact here as well.
There is some regression in performance for small/default sandbox sizes which is more than likely caused by the overhead of tracking and building/restoring page based snapshots for small memory vs. copying and restoring the entire memory.
mshv2 Before
mshv2 After
mshv3 Before
mshv3 After
Both mshv2 and mshv3 show similar patterns to KVM in that the larger sandbox sizes show large improvements, the default/small sandboxes show regressions and the large parameter sizes show regressions, possibly for the same reason that the KVM one does , again no investigation done here yet.
The biggest difference between KVM and mshv is for two reasons, first when enabling dirty page tracking on mshv the first call to get dirty pages results in a returned bitmap showing all pages dirty, since this would cause us to snapshot all memory as a baseline after enabling dirty page tracking this PR gets the dirty pages immediately and then discards the result. This approach means that we have to make 2 calls to get dirty pages when we really only need one. The impact of this is quite large, especially with larger memory configurations since the call to get dirty pages seems to be O(n) where n is the number of pages in the memory configuration (regardless of if the pages are dirty or not), for a 950mb VM I observed ~1.4ms response for this call, however, this approach with larger sandboxes is still much quicker than copying all the memory. Fixing these issues in mshv will probably bring the performance much closer to KVM.
Windows 2025 Before
Windows 2025 After
Windows is largely either the same performance or has regressed, this is because the Windows implementation has not been done yet, at the moment each time dirt pages are requested Windows reports that all pages are dirty and the snapshots/restores are done on that basis, there is some overhead of this approach especially when restoring