Skip to content

Conversation

bchalios
Copy link
Contributor

@bchalios bchalios commented Oct 11, 2024

Changes

Substitute this mechanism with calling epoll() on the pidfd of the process instead. This should deterministically block until the process exits. If there's something else wrong, we will hit the pytest timeout.

Reason

Currently, we use psutil.pid_exists in a loop with a timeout of 10 seconds. This is racy and indeed some times we hit it in our CI.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@bchalios bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 11, 2024
Copy link

codecov bot commented Oct 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.96%. Comparing base (c00d5ed) to head (3eeb00c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4847   +/-   ##
=======================================
  Coverage   83.96%   83.96%           
=======================================
  Files         250      250           
  Lines       27756    27756           
=======================================
  Hits        23304    23304           
  Misses       4452     4452           
Flag Coverage Δ
5.10-c5n.metal 84.58% <ø> (ø)
5.10-m5n.metal 84.57% <ø> (ø)
5.10-m6a.metal 83.85% <ø> (-0.01%) ⬇️
5.10-m6g.metal 80.51% <ø> (ø)
5.10-m6i.metal 84.56% <ø> (-0.01%) ⬇️
5.10-m7g.metal 80.51% <ø> (ø)
6.1-c5n.metal 84.58% <ø> (ø)
6.1-m5n.metal 84.56% <ø> (-0.01%) ⬇️
6.1-m6a.metal 83.85% <ø> (-0.01%) ⬇️
6.1-m6g.metal 80.51% <ø> (ø)
6.1-m6i.metal 84.55% <ø> (-0.01%) ⬇️
6.1-m7g.metal 80.51% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bchalios bchalios force-pushed the ci_use_pidfd_to_wait_for_firecracker_exit branch 3 times, most recently from e2f619f to b9452d1 Compare October 11, 2024 13:37
@bchalios bchalios force-pushed the ci_use_pidfd_to_wait_for_firecracker_exit branch 2 times, most recently from 3c6811d to bcf1564 Compare October 14, 2024 07:42
ShadowCurse
ShadowCurse previously approved these changes Oct 14, 2024
Currently, we use psutil.pid_exists in a loop with a timeout of 10
seconds. This is racy and indeed some times we hit it in our CI.

Substitute this mechanism with calling epoll() on the pidfd of the
process instead. This should deterministically block until the process
exits. If there's something else wrong, we will hit the pytest timeout.

Signed-off-by: Babis Chalios <[email protected]>
@bchalios bchalios force-pushed the ci_use_pidfd_to_wait_for_firecracker_exit branch from bcf1564 to 3eeb00c Compare October 14, 2024 08:43
@bchalios
Copy link
Contributor Author

@ShadowCurse I had to revert back to using epoll(). select() hits the FD_SETSIZE issue (1024 descriptors only).

@ShadowCurse
Copy link
Contributor

@ShadowCurse I had to revert back to using epoll(). select() hits the FD_SETSIZE issue (1024 descriptors only).

How can it hit the limit if you are waiting for 1 fd?

@bchalios
Copy link
Contributor Author

How can it hit the limit if you are waiting for 1 fd?

It's not about the numbers of PIDs you're waiting on. It's about the maximum PID value it can handle. Reading from man select(2):

   select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation.  See BUGS.

@roypat
Copy link
Contributor

roypat commented Oct 14, 2024

Since we only use utils.wait_for_termination in two locations, and both of those are in Microvm, can we move this function into Microvm and create the pidfd already when we spawn the Firecracker process? Otherwise, how does this behave if the Firecracker process dies unexpectedly and the pid gets freed significantly before we get to the wait_for_termination call? 🤔

@bchalios
Copy link
Contributor Author

Otherwise, how does this behave if the Firecracker process dies unexpectedly and the pid gets freed significantly before we get to the wait_for_termination call? 🤔

If the process with pid PID is already dead when creating the pidfd, then the system call will raise a ProcessLookupError exception and get_process_pidfd will return None, so wait_for_termination will return immediately.

can we move this function into Microvm and create the pidfd already when we spawn the Firecracker process?

I thought of doing that, but I think it makes the error handling much more complicated.

@ShadowCurse
Copy link
Contributor

Ok, but to avoid issue with 2 calls: register and epoll, maybe we should use simple poll?

@bchalios
Copy link
Contributor Author

select.poll() returns you a "polling" object to which you register file descriptors.

@bchalios bchalios merged commit d7fbf9b into firecracker-microvm:main Oct 14, 2024
7 checks passed
@bchalios bchalios deleted the ci_use_pidfd_to_wait_for_firecracker_exit branch October 14, 2024 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Awaiting review Indicates that a pull request is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants