setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts#149
Conversation
| existing_pid=$(cat "${lockfile}" 2>/dev/null) | ||
| if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then | ||
| debug "Removing stale lock from dead process $existing_pid for ${iface}" | ||
| rm -f "${lockfile}" |
There was a problem hiding this comment.
Could there be a race condtion when two PIDs clash and a lock file is removed by accident?
There was a problem hiding this comment.
Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.
I dont think an atomic operation is possible purely with shell code.
What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.
Something like:
local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) ))
if [ "$lock_age" -gt 300 ]; then
debug "Removing stale lock from dead process $existing_pid for ${iface}"
rm -f "${lockfile}"
fi
Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout
There was a problem hiding this comment.
I like this approach but I am also okay with not adding more complication since the PID space is quite large.
| # nonzero exit codes from a redirect without considering them | ||
| # fatal errors | ||
| set +e | ||
| while [ $cnt -lt $max ]; do |
There was a problem hiding this comment.
Can we tune this max to a lower number such that it doesn't get stuck in 1000s loop if we get into this block?
There was a problem hiding this comment.
Yes, we should also lower this value to match the max_wait time in setup-policy-routes.sh. It wouldn't make sense to spin longer than the lock can be held. All three value should be synced:
max_wait=3000i.e. 300s due tosleep 0.1(setup-policy-routes.sh)max=3000i.e. 300s due tosleep 0.1(lib.sh)"$lock_age" -gt 300(lib.sh)
There was a problem hiding this comment.
yeah, I agree. IMO lock_age is not needed.
There was a problem hiding this comment.
Sounds good. I pushed the update to max value in register_networkd_reloader(), and the test results I just posted included this change.
|
I ran this new script on my a host yesterday. |
Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detectionHost:
Setup# Fix unresolved build-time placeholder in 2.7.1
sed -i 's|AMAZON_EC2_NET_UTILS_LIBDIR|/usr/share/amazon-ec2-net-utils|' /usr/bin/setup-policy-routes
# Lower timeouts from 300s to 1s for testing (restore after)
sed -i 's/max_wait=3000/max_wait=10/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=3000/local -i max=10/' /usr/share/amazon-ec2-net-utils/lib.sh
FAKE_IFACE="ecse00TEST1"
LOCKDIR="/run/amazon-ec2-net-utils/setup-policy-routes"Test 1: Sysfs wait timeoutPurpose: /usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"Output: Journal: Test 2: Stale lock detectionPurpose: A lockfile owned by a dead PID is detected and removed. The new invocation acquires the lock and proceeds rather than spinning for up to 300s. mkdir -p "$LOCKDIR"
echo "99999" | tee "$LOCKDIR/$FAKE_IFACE"
/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"Output: Journal: The process got past Test 3: Full race (start + concurrent refresh)Purpose: /usr/bin/setup-policy-routes "$FAKE_IFACE" start &
START_PID=$!
sleep 0.5
/usr/bin/setup-policy-routes "$FAKE_IFACE" refresh &
wait
echo "both done"Output: Journal:
Restoresed -i 's/max_wait=10/max_wait=3000/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=10/local -i max=3000/' /usr/share/amazon-ec2-net-utils/lib.sh
rm -f "$LOCKDIR/$FAKE_IFACE" |
|
I re-read your issue again: #148 I am wondering if there is a way to call |
|
I am still seeing orphaned process even after exit 1 is being executed. setup-policy-routes is coming back up due to |
|
The reason I said "udev remove event does not fire" is because I observed the accumulation of With the current PR, the |
|
I guess we can just add the same if ((counter >= max_wait)); then
error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
/usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
exit 1
fi |
Issue #, if available:
#148
Description of changes:
Fixes infinite process accumulation on ECS hosts caused by setup-policy-routes start looping forever when an ENI is detached before its sysfs node appears (can repeatedly occur during rapid ENI attach/detach cycles, i.e. ECS task churn)
Two changes:
See #148 for full root cause analysis, reproduction steps, and evidence from affected hosts.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.