fix(ci): epoll on pidfd to wait for Firecracker exit #4847

bchalios · 2024-10-11T12:58:45Z

Changes

Substitute this mechanism with calling epoll() on the pidfd of the process instead. This should deterministically block until the process exits. If there's something else wrong, we will hit the pytest timeout.

Reason

Currently, we use psutil.pid_exists in a loop with a timeout of 10 seconds. This is racy and indeed some times we hit it in our CI.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this
PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet
contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2024-10-11T13:03:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.96%. Comparing base (c00d5ed) to head (3eeb00c).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4847   +/-   ##
=======================================
  Coverage   83.96%   83.96%           
=======================================
  Files         250      250           
  Lines       27756    27756           
=======================================
  Hits        23304    23304           
  Misses       4452     4452

Flag	Coverage Δ
5.10-c5n.metal	`84.58% <ø> (ø)`
5.10-m5n.metal	`84.57% <ø> (ø)`
5.10-m6a.metal	`83.85% <ø> (-0.01%)`	⬇️
5.10-m6g.metal	`80.51% <ø> (ø)`
5.10-m6i.metal	`84.56% <ø> (-0.01%)`	⬇️
5.10-m7g.metal	`80.51% <ø> (ø)`
6.1-c5n.metal	`84.58% <ø> (ø)`
6.1-m5n.metal	`84.56% <ø> (-0.01%)`	⬇️
6.1-m6a.metal	`83.85% <ø> (-0.01%)`	⬇️
6.1-m6g.metal	`80.51% <ø> (ø)`
6.1-m6i.metal	`84.55% <ø> (-0.01%)`	⬇️
6.1-m7g.metal	`80.51% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/framework/utils.py

Currently, we use psutil.pid_exists in a loop with a timeout of 10 seconds. This is racy and indeed some times we hit it in our CI. Substitute this mechanism with calling epoll() on the pidfd of the process instead. This should deterministically block until the process exits. If there's something else wrong, we will hit the pytest timeout. Signed-off-by: Babis Chalios <[email protected]>

bchalios · 2024-10-14T08:44:49Z

@ShadowCurse I had to revert back to using epoll(). select() hits the FD_SETSIZE issue (1024 descriptors only).

ShadowCurse · 2024-10-14T08:55:57Z

@ShadowCurse I had to revert back to using epoll(). select() hits the FD_SETSIZE issue (1024 descriptors only).

How can it hit the limit if you are waiting for 1 fd?

bchalios · 2024-10-14T08:57:26Z

How can it hit the limit if you are waiting for 1 fd?

It's not about the numbers of PIDs you're waiting on. It's about the maximum PID value it can handle. Reading from man select(2):

   select() can monitor only file descriptors numbers that are less than FD_SETSIZE; poll(2) and epoll(7) do not have this limitation.  See BUGS.

roypat · 2024-10-14T09:01:58Z

Since we only use utils.wait_for_termination in two locations, and both of those are in Microvm, can we move this function into Microvm and create the pidfd already when we spawn the Firecracker process? Otherwise, how does this behave if the Firecracker process dies unexpectedly and the pid gets freed significantly before we get to the wait_for_termination call? 🤔

bchalios · 2024-10-14T09:05:53Z

Otherwise, how does this behave if the Firecracker process dies unexpectedly and the pid gets freed significantly before we get to the wait_for_termination call? 🤔

If the process with pid PID is already dead when creating the pidfd, then the system call will raise a ProcessLookupError exception and get_process_pidfd will return None, so wait_for_termination will return immediately.

can we move this function into Microvm and create the pidfd already when we spawn the Firecracker process?

I thought of doing that, but I think it makes the error handling much more complicated.

ShadowCurse · 2024-10-14T09:07:05Z

Ok, but to avoid issue with 2 calls: register and epoll, maybe we should use simple poll?

bchalios · 2024-10-14T09:20:09Z

select.poll() returns you a "polling" object to which you register file descriptors.

bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 11, 2024

bchalios force-pushed the ci_use_pidfd_to_wait_for_firecracker_exit branch 3 times, most recently from e2f619f to b9452d1 Compare October 11, 2024 13:37

kalyazin reviewed Oct 11, 2024

View reviewed changes

tests/framework/utils.py Show resolved Hide resolved

bchalios force-pushed the ci_use_pidfd_to_wait_for_firecracker_exit branch 2 times, most recently from 3c6811d to bcf1564 Compare October 14, 2024 07:42

ShadowCurse previously approved these changes Oct 14, 2024

View reviewed changes

bchalios dismissed ShadowCurse’s stale review via 3eeb00c October 14, 2024 08:43

bchalios force-pushed the ci_use_pidfd_to_wait_for_firecracker_exit branch from bcf1564 to 3eeb00c Compare October 14, 2024 08:43

roypat approved these changes Oct 14, 2024

View reviewed changes

ShadowCurse approved these changes Oct 14, 2024

View reviewed changes

kalyazin approved these changes Oct 14, 2024

View reviewed changes

bchalios merged commit d7fbf9b into firecracker-microvm:main Oct 14, 2024
5 checks passed

bchalios deleted the ci_use_pidfd_to_wait_for_firecracker_exit branch October 14, 2024 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ci): epoll on pidfd to wait for Firecracker exit #4847

fix(ci): epoll on pidfd to wait for Firecracker exit #4847

Uh oh!

bchalios commented Oct 11, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

ShadowCurse commented Oct 14, 2024

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

roypat commented Oct 14, 2024

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

ShadowCurse commented Oct 14, 2024

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(ci): epoll on pidfd to wait for Firecracker exit #4847

fix(ci): epoll on pidfd to wait for Firecracker exit #4847

Uh oh!

Conversation

bchalios commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Reason

License Acceptance

PR Checklist

Uh oh!

codecov bot commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

ShadowCurse commented Oct 14, 2024

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

roypat commented Oct 14, 2024

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

ShadowCurse commented Oct 14, 2024

Uh oh!

bchalios commented Oct 14, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bchalios commented Oct 11, 2024 •

edited

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading