Skip to content

Conversation

@bchalios
Copy link
Contributor

Changes

Add logic to our UFFD handlers to retry the negotiation with Firecracker up to 5 times before giving up. This helps making them (slightly) more robust. Also, we add some logging in the receive logic so that we can inspect failures post-mortem.

Reason

According to our UFFD protocol, UFFD handlers negotiate with Firecracker during initialization and wait for it to send over a UDS the UFFD file descriptor along with the memory mappings that are being handled over the UFFD.

During this handshake, our (testing only/not production grade) UFFD handlers issue what essentially is a recvmsg that should return with the UFFD fd and the mappings. Some times instead of the file descriptor, the recvmsg wrapper returns a None value for the file descriptor. When this happens, the UFFD handler crashes and Firecracker process hangs.

According to man recv(2):

Datagram sockets in various domains (e.g., the UNIX and Internet
domains) permit zero-length datagrams.  When such a datagram is
received, the return value is 0.

which means it is possible to receive a zero-length message (we are communicating with Firecracker over a UDS).

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@bchalios bchalios self-assigned this Mar 19, 2025
@bchalios bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Mar 19, 2025
@codecov
Copy link

codecov bot commented Mar 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.14%. Comparing base (07c07bd) to head (4c5359a).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5097   +/-   ##
=======================================
  Coverage   83.14%   83.14%           
=======================================
  Files         248      248           
  Lines       26925    26925           
=======================================
  Hits        22388    22388           
  Misses       4537     4537           
Flag Coverage Δ
5.10-c5n.metal 83.53% <ø> (ø)
5.10-m5n.metal 83.52% <ø> (+<0.01%) ⬆️
5.10-m6a.metal 82.71% <ø> (ø)
5.10-m6g.metal 79.57% <ø> (ø)
5.10-m6i.metal 83.52% <ø> (+<0.01%) ⬆️
5.10-m7a.metal-48xl 82.71% <ø> (?)
5.10-m7g.metal 79.57% <ø> (ø)
6.1-c5n.metal 83.58% <ø> (ø)
6.1-m5n.metal 83.57% <ø> (ø)
6.1-m6a.metal 82.75% <ø> (ø)
6.1-m6g.metal 79.56% <ø> (-0.01%) ⬇️
6.1-m6i.metal 83.55% <ø> (-0.01%) ⬇️
6.1-m7a.metal-48xl 82.76% <ø> (?)
6.1-m7g.metal 79.57% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ShadowCurse
ShadowCurse previously approved these changes Mar 19, 2025
According to our UFFD protocol, UFFD handlers negotiate with Firecracker
during initialization and wait for it to send over a UDS the UFFD file
descriptor along with the memory mappings that are being handled over
the UFFD.

During this handshake, our (testing only/not production grade) UFFD
handlers issue what essentially is a `recvmsg` that should return with
the UFFD fd and the mappings. Some times instead of the file descriptor,
the `recvmsg` wrapper returns a `None` value for the file descriptor.
When this happens, the UFFD handler crashes and Firecracker process
hangs.

According to `man recv(2)`:

```
Datagram sockets in various domains (e.g., the UNIX and Internet
domains) permit zero-length datagrams.  When such a datagram is
received, the return value is 0.
```

which means it is possible to receive a zero-length message (we are
communicating with Firecracker over a UDS).

Add logic to our UFFD handlers to retry the negotiation with Firecracker
up to 5 times before giving up. This helps making them (slightly) more
robust. Also, we add some logging in the receive logic so that we can
inspect failures post-mortem.

Signed-off-by: Babis Chalios <[email protected]>
Copy link
Contributor

@roypat roypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no chance that the body and the fds simply arrive in two distinct packets?

@bchalios
Copy link
Contributor Author

There's no chance that the body and the fds simply arrive in two distinct packets?

I would assume that recvmsg would return us a complete message regardless of how packets have been arriving. In any case, if something like this happens we would now see it in the logs.

@bchalios bchalios enabled auto-merge (rebase) March 19, 2025 16:32
@bchalios bchalios merged commit fc85170 into firecracker-microvm:main Mar 19, 2025
6 of 7 checks passed
@bchalios bchalios deleted the more_robust_uffd_handlers branch March 19, 2025 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants