Skip to content

Conversation

ShadowCurse
Copy link
Contributor

@ShadowCurse ShadowCurse commented Sep 18, 2024

Changes

This is a fix for a fix introduced in #4796
The issue was in vsock device hanging after snapshot restoration due to the guest not being notified about the termination packet. But there was bug in the fix: we were saving the vsock state before the notification was sent, thus discarding all modifications made to the queue in order to sent the notification.

The reason original fix worked, is because we were only testing with 1 iteration of snap/restore. This way even though we lose synchronization with the guest in the event queue state, it worked fine once. But doing more iterations causes vsock to hang as before.

This PR fixes the issue by saving the vsock state after the notification is sent and modifies the vsock test to run multiple iterations of snap/restore.

Reason

Fixes: #4811

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@ShadowCurse ShadowCurse self-assigned this Sep 18, 2024
Copy link

codecov bot commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@5d762a8). Learn more about missing BASE report.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4812   +/-   ##
=======================================
  Coverage        ?   84.32%           
=======================================
  Files           ?      249           
  Lines           ?    27512           
  Branches        ?        0           
=======================================
  Hits            ?    23200           
  Misses          ?     4312           
  Partials        ?        0           
Flag Coverage Δ
5.10-c5n.metal 84.54% <100.00%> (?)
5.10-m5n.metal 84.53% <100.00%> (?)
5.10-m6a.metal 83.82% <100.00%> (?)
5.10-m6g.metal 80.90% <100.00%> (?)
5.10-m6i.metal 84.53% <100.00%> (?)
5.10-m7g.metal 80.90% <100.00%> (?)
6.1-c5n.metal 84.54% <100.00%> (?)
6.1-m5n.metal 84.53% <100.00%> (?)
6.1-m6a.metal 83.82% <100.00%> (?)
6.1-m6g.metal 80.90% <100.00%> (?)
6.1-m6i.metal 84.52% <100.00%> (?)
6.1-m7g.metal 80.90% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ShadowCurse ShadowCurse marked this pull request as ready for review September 18, 2024 13:20
@ShadowCurse ShadowCurse added Status: Awaiting review Indicates that a pull request is ready to be reviewed Type: Fix Indicates a fix to existing code labels Sep 18, 2024
pb8o
pb8o previously approved these changes Sep 18, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
@ShadowCurse ShadowCurse merged commit 3acf37d into firecracker-microvm:main Sep 18, 2024
9 checks passed
@ShadowCurse ShadowCurse deleted the vsock_fix_v2 branch September 18, 2024 16:27
CompuIves added a commit to codesandbox/firecracker that referenced this pull request Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed Type: Fix Indicates a fix to existing code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Ongoing read Syscalls on Vsocks Don't Get Interrupted after the Second Snapshot Resume/Restore

3 participants