setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts by ddermendzhiev · Pull Request #149 · amazonlinux/amazon-ec2-net-utils

ddermendzhiev · 2026-03-31T23:10:34Z

Issue #, if available:

Description of changes:

Fixes infinite process accumulation on ECS hosts caused by setup-policy-routes start looping forever when an ENI is detached before its sysfs node appears (can repeatedly occur during rapid ENI attach/detach cycles, i.e. ECS task churn)

Two changes:

bin/setup-policy-routes.sh: add a 5-minute timeout to the sysfs wait loop in the start action so stuck processes eventually exit instead of holding the per-ENI lockfile indefinitely
lib/lib.sh: add a stale lock check in register_networkd_reloader(). If the lock owner PID is no longer alive, remove the lockfile before spinning

See #148 for full root cause analysis, reproduction steps, and evidence from affected hosts.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

ericsu66888

Thanks for the PR

joeysk2012 · 2026-04-01T19:34:15Z

lib/lib.sh

+        existing_pid=$(cat "${lockfile}" 2>/dev/null)
+        if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
+            debug "Removing stale lock from dead process $existing_pid for ${iface}"
+            rm -f "${lockfile}"


Could there be a race condtion when two PIDs clash and a lock file is removed by accident?

Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.

I dont think an atomic operation is possible purely with shell code.

What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.

Something like:

local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) )) if [ "$lock_age" -gt 300 ]; then debug "Removing stale lock from dead process $existing_pid for ${iface}" rm -f "${lockfile}" fi

Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout

I like this approach but I am also okay with not adding more complication since the PID space is quite large.

joeysk2012 · 2026-04-01T19:34:58Z

lib/lib.sh

    # nonzero exit codes from a redirect without considering them
    # fatal errors
    set +e
    while [ $cnt -lt $max ]; do


Can we tune this max to a lower number such that it doesn't get stuck in 1000s loop if we get into this block?

Yes, we should also lower this value to match the max_wait time in setup-policy-routes.sh. It wouldn't make sense to spin longer than the lock can be held. All three value should be synced:

max_wait=3000 i.e. 300s due to sleep 0.1 (setup-policy-routes.sh)

max=3000 i.e. 300s due to sleep 0.1 (lib.sh)

"$lock_age" -gt 300 (lib.sh)

yeah, I agree. IMO lock_age is not needed.

Sounds good. I pushed the update to max value in register_networkd_reloader(), and the test results I just posted included this change.

bin/setup-policy-routes.sh

joeysk2012 · 2026-04-02T17:48:56Z

I ran this new script on my a host yesterday.
For the most part I feel good about it.
Running some more tests to see if any other issues.
I would like to get this merged and deployed into AL23 soon.
Please post any test results or logs if you have them.

…ch sysfs wait timeout

ddermendzhiev · 2026-04-02T19:22:38Z

Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detection

Host: host-instance
Package: amazon-ec2-net-utils 2.7.1-1.amzn2023.0.1
Patched files:

/usr/bin/setup-policy-routes
/usr/share/amazon-ec2-net-utils/lib.sh

Setup

# Fix unresolved build-time placeholder in 2.7.1
sed -i 's|AMAZON_EC2_NET_UTILS_LIBDIR|/usr/share/amazon-ec2-net-utils|' /usr/bin/setup-policy-routes

# Lower timeouts from 300s to 1s for testing (restore after)
sed -i 's/max_wait=3000/max_wait=10/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=3000/local -i max=10/' /usr/share/amazon-ec2-net-utils/lib.sh

FAKE_IFACE="ecse00TEST1"
LOCKDIR="/run/amazon-ec2-net-utils/setup-policy-routes"

Test 1: Sysfs wait timeout

Purpose: start exits after max_wait instead of looping forever when the sysfs node never appears.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

exit code: 1

Journal:

Apr 02 18:17:14 host-instance ec2net[111890]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:17:15 host-instance ec2net[111890]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

Test 2: Stale lock detection

Purpose: A lockfile owned by a dead PID is detected and removed. The new invocation acquires the lock and proceeds rather than spinning for up to 300s.

mkdir -p "$LOCKDIR"
echo "99999" | tee "$LOCKDIR/$FAKE_IFACE"
/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

99999
exit code: 1

Journal:

Apr 02 18:19:37 host-instance ec2net[112210]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:19:38 host-instance ec2net[112210]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

The process got past register_networkd_reloader and entered the sysfs wait loop — proving the stale lock was removed. It then timed out and exited cleanly.

Test 3: Full race (start + concurrent refresh)

Purpose: start acquires the lock and enters the sysfs wait loop. refresh arrives concurrently. With the fix, start times out and exits, refresh acquires the lock, finds the ENI missing from sysfs, and exits — both within ~1 second instead of spinning for 300s.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start &
START_PID=$!
sleep 0.5
/usr/bin/setup-policy-routes "$FAKE_IFACE" refresh &
wait
echo "both done"

Output:

[1] 126139
[2] 126149
[1]-  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" start
[2]+  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" refresh
both done

Journal:

Apr 02 18:26:29 host-instance ec2net[126139]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:26:30 host-instance ec2net[126139]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

start timed out and exited. refresh acquired the lock, hit [ -e "/sys/class/net/${iface}" ] || exit 0, and exited immediately — no journal output expected for that path.

Restore

sed -i 's/max_wait=10/max_wait=3000/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=10/local -i max=3000/' /usr/share/amazon-ec2-net-utils/lib.sh
rm -f "$LOCKDIR/$FAKE_IFACE"

joeysk2012 · 2026-04-02T19:49:25Z

I re-read your issue again: #148
Please help me understand the scenario better.
It says that udev remove event does not fire fails to trigger.
Which means the refresh-policy-routes@$name.timer service will be a leaked unit and will continue to run every 60s.
So this means we can still have potentially hundreds of still non-working timers still:
This PR will fix the infinite loop issue & spinning for lockfile but it does not seem to address this issue?
This means we will still consume more CPU than needed but not as much if we didn't have the PR.

04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy52.timer  refresh-policy-routes@dummy52.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:42 UTC 40s ago refresh-policy-routes@dummy63.timer  refresh-policy-routes@dummy63.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy90.timer  refresh-policy-routes@dummy90.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy20.timer  refresh-policy-routes@dummy20.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy24.timer  refresh-policy-routes@dummy24.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy37.timer  refresh-policy-routes@dummy37.service
...

I am wondering if there is a way to call /usr/bin/systemctl disable --now refresh-policy-routes@$name.timer policy-routes@$name.service if we get into this state outside of udev rules.

joeysk2012 · 2026-04-02T21:23:22Z

I am still seeing orphaned process even after exit 1 is being executed. setup-policy-routes is coming back up due to
Restart=on-failure

ddermendzhiev · 2026-04-02T21:37:08Z

The reason I said "udev remove event does not fire" is because I observed the accumulation of refresh-policy-routes@$name.timer and policy-routes@$name.service units. As ECS task churn continued and attached+detached new ENIs, the leaked units accumulated, each with a stuck setup-policy-routes %i start proc and a setup-policy-routes %i refresh proc spinning trying to acquire the lock, exiting, then being respawned by the timer.

With the current PR, the start proc would timeout, but you are correct that refresh would continue to be respawned because the systemd unit is still active. Also noted on Restart=on-failure on the start rule.

ddermendzhiev · 2026-04-02T21:40:00Z

I guess we can just add the same remove rule command inside the timeout block. This what you were implying:

if ((counter >= max_wait)); then
    error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
    /usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
    exit 1
fi

setup-policy-routes: add sysfs wait timeout and stale lock detection

26a67ce

ericsu66888 approved these changes Apr 1, 2026

View reviewed changes

joeysk2012 reviewed Apr 1, 2026

View reviewed changes

joeysk2012 reviewed Apr 2, 2026

View reviewed changes

bin/setup-policy-routes.sh Outdated Show resolved Hide resolved

Dinko Dermendzhiev added 2 commits April 2, 2026 13:49

falsy ((counter++)) needs || true when counter = 0

9fad70b

lib/lib.sh: lower register_networkd_reloader() max (spin time) to mat…

097d289

…ch sysfs wait timeout

Conversation

ddermendzhiev commented Mar 31, 2026

Issue #, if available:

Description of changes:

Uh oh!

ericsu66888 left a comment

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ddermendzhiev Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

ddermendzhiev Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

joeysk2012 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ddermendzhiev Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joeysk2012 commented Apr 2, 2026

Uh oh!

ddermendzhiev commented Apr 2, 2026

Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detection

Setup

Test 1: Sysfs wait timeout

Test 2: Stale lock detection

Test 3: Full race (start + concurrent refresh)

Restore

Uh oh!

joeysk2012 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeysk2012 commented Apr 2, 2026

Uh oh!

ddermendzhiev commented Apr 2, 2026

Uh oh!

ddermendzhiev commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joeysk2012 commented Apr 2, 2026 •

edited

Loading