-
Notifications
You must be signed in to change notification settings - Fork 63
setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -631,6 +631,18 @@ register_networkd_reloader() { | |
| local -r lockfile="${lockdir}/${iface}" | ||
| local old_opts=$- | ||
|
|
||
| # If the existing lock owner is no longer alive, remove the stale lockfile | ||
| # so subsequent invocations don't spin for up to 1000 seconds waiting on a | ||
| # process that will never release it. | ||
| if [ -f "${lockfile}" ]; then | ||
| local existing_pid | ||
| existing_pid=$(cat "${lockfile}" 2>/dev/null) | ||
| if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then | ||
| debug "Removing stale lock from dead process $existing_pid for ${iface}" | ||
| rm -f "${lockfile}" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could there be a race condtion when two PIDs clash and a lock file is removed by accident?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the I dont think an atomic operation is possible purely with shell code. What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale. Something like: Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this approach but I am also okay with not adding more complication since the PID space is quite large. |
||
| fi | ||
| fi | ||
|
|
||
| # Disable -o errexit in the following block so we can capture | ||
| # nonzero exit codes from a redirect without considering them | ||
| # fatal errors | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.