Skip to content

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts#149

Open
ddermendzhiev wants to merge 3 commits intoamazonlinux:mainfrom
ddermendzhiev:fix/setup-policy-routes-sysfs-timeout
Open

setup-policy-routes start: infinite sysfs wait loop causes unbounded process accumulation on ECS hosts#149
ddermendzhiev wants to merge 3 commits intoamazonlinux:mainfrom
ddermendzhiev:fix/setup-policy-routes-sysfs-timeout

Conversation

@ddermendzhiev
Copy link
Copy Markdown

Issue #, if available:

#148

Description of changes:

Fixes infinite process accumulation on ECS hosts caused by setup-policy-routes start looping forever when an ENI is detached before its sysfs node appears (can repeatedly occur during rapid ENI attach/detach cycles, i.e. ECS task churn)

Two changes:

  • bin/setup-policy-routes.sh: add a 5-minute timeout to the sysfs wait loop in the start action so stuck processes eventually exit instead of holding the per-ENI lockfile indefinitely
  • lib/lib.sh: add a stale lock check in register_networkd_reloader(). If the lock owner PID is no longer alive, remove the lockfile before spinning

See #148 for full root cause analysis, reproduction steps, and evidence from affected hosts.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Copy Markdown

@ericsu66888 ericsu66888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR

existing_pid=$(cat "${lockfile}" 2>/dev/null)
if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
debug "Removing stale lock from dead process $existing_pid for ${iface}"
rm -f "${lockfile}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a race condtion when two PIDs clash and a lock file is removed by accident?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.

I dont think an atomic operation is possible purely with shell code.

What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.

Something like:

        local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) ))
        if [ "$lock_age" -gt 300 ]; then
            debug "Removing stale lock from dead process $existing_pid for ${iface}"
            rm -f "${lockfile}"
        fi

Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach but I am also okay with not adding more complication since the PID space is quite large.

# nonzero exit codes from a redirect without considering them
# fatal errors
set +e
while [ $cnt -lt $max ]; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we tune this max to a lower number such that it doesn't get stuck in 1000s loop if we get into this block?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should also lower this value to match the max_wait time in setup-policy-routes.sh. It wouldn't make sense to spin longer than the lock can be held. All three value should be synced:

  • max_wait=3000 i.e. 300s due to sleep 0.1 (setup-policy-routes.sh)
  • max=3000 i.e. 300s due to sleep 0.1 (lib.sh)
  • "$lock_age" -gt 300 (lib.sh)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree. IMO lock_age is not needed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I pushed the update to max value in register_networkd_reloader(), and the test results I just posted included this change.

@joeysk2012
Copy link
Copy Markdown
Contributor

I ran this new script on my a host yesterday.
For the most part I feel good about it.
Running some more tests to see if any other issues.
I would like to get this merged and deployed into AL23 soon.
Please post any test results or logs if you have them.

@ddermendzhiev
Copy link
Copy Markdown
Author

Fix Validation: amazon-ec2-net-utils sysfs wait timeout and stale lock detection

Host: host-instance
Package: amazon-ec2-net-utils 2.7.1-1.amzn2023.0.1
Patched files:

  • /usr/bin/setup-policy-routes
  • /usr/share/amazon-ec2-net-utils/lib.sh

Setup

# Fix unresolved build-time placeholder in 2.7.1
sed -i 's|AMAZON_EC2_NET_UTILS_LIBDIR|/usr/share/amazon-ec2-net-utils|' /usr/bin/setup-policy-routes

# Lower timeouts from 300s to 1s for testing (restore after)
sed -i 's/max_wait=3000/max_wait=10/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=3000/local -i max=10/' /usr/share/amazon-ec2-net-utils/lib.sh

FAKE_IFACE="ecse00TEST1"
LOCKDIR="/run/amazon-ec2-net-utils/setup-policy-routes"

Test 1: Sysfs wait timeout

Purpose: start exits after max_wait instead of looping forever when the sysfs node never appears.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

exit code: 1

Journal:

Apr 02 18:17:14 host-instance ec2net[111890]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:17:15 host-instance ec2net[111890]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

Test 2: Stale lock detection

Purpose: A lockfile owned by a dead PID is detected and removed. The new invocation acquires the lock and proceeds rather than spinning for up to 300s.

mkdir -p "$LOCKDIR"
echo "99999" | tee "$LOCKDIR/$FAKE_IFACE"
/usr/bin/setup-policy-routes "$FAKE_IFACE" start
echo "exit code: $?"

Output:

99999
exit code: 1

Journal:

Apr 02 18:19:37 host-instance ec2net[112210]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:19:38 host-instance ec2net[112210]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

The process got past register_networkd_reloader and entered the sysfs wait loop — proving the stale lock was removed. It then timed out and exited cleanly.


Test 3: Full race (start + concurrent refresh)

Purpose: start acquires the lock and enters the sysfs wait loop. refresh arrives concurrently. With the fix, start times out and exits, refresh acquires the lock, finds the ENI missing from sysfs, and exits — both within ~1 second instead of spinning for 300s.

/usr/bin/setup-policy-routes "$FAKE_IFACE" start &
START_PID=$!
sleep 0.5
/usr/bin/setup-policy-routes "$FAKE_IFACE" refresh &
wait
echo "both done"

Output:

[1] 126139
[2] 126149
[1]-  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" start
[2]+  Exit 1                  /usr/bin/setup-policy-routes "$FAKE_IFACE" refresh
both done

Journal:

Apr 02 18:26:29 host-instance ec2net[126139]: Waiting for sysfs node to exist for ecse00TEST1 (iteration 0)
Apr 02 18:26:30 host-instance ec2net[126139]: Timed out waiting for sysfs node for ecse00TEST1 after 1 seconds

start timed out and exited. refresh acquired the lock, hit [ -e "/sys/class/net/${iface}" ] || exit 0, and exited immediately — no journal output expected for that path.


Restore

sed -i 's/max_wait=10/max_wait=3000/' /usr/bin/setup-policy-routes
sed -i 's/local -i max=10/local -i max=3000/' /usr/share/amazon-ec2-net-utils/lib.sh
rm -f "$LOCKDIR/$FAKE_IFACE"

@joeysk2012
Copy link
Copy Markdown
Contributor

joeysk2012 commented Apr 2, 2026

I re-read your issue again: #148
Please help me understand the scenario better.
It says that udev remove event does not fire fails to trigger.
Which means the refresh-policy-routes@$name.timer service will be a leaked unit and will continue to run every 60s.
So this means we can still have potentially hundreds of still non-working timers still:
This PR will fix the infinite loop issue & spinning for lockfile but it does not seem to address this issue?
This means we will still consume more CPU than needed but not as much if we didn't have the PR.

04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy52.timer  refresh-policy-routes@dummy52.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:42 UTC 40s ago refresh-policy-routes@dummy63.timer  refresh-policy-routes@dummy63.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy90.timer  refresh-policy-routes@dummy90.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy20.timer  refresh-policy-routes@dummy20.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy24.timer  refresh-policy-routes@dummy24.service
Thu 2026-04-02 19:43:43 UTC 19s left Thu 2026-04-02 19:42:31 UTC 52s ago refresh-policy-routes@dummy37.timer  refresh-policy-routes@dummy37.service
...

I am wondering if there is a way to call /usr/bin/systemctl disable --now refresh-policy-routes@$name.timer policy-routes@$name.service if we get into this state outside of udev rules.

@joeysk2012
Copy link
Copy Markdown
Contributor

I am still seeing orphaned process even after exit 1 is being executed. setup-policy-routes is coming back up due to
Restart=on-failure

@ddermendzhiev
Copy link
Copy Markdown
Author

The reason I said "udev remove event does not fire" is because I observed the accumulation of refresh-policy-routes@$name.timer and policy-routes@$name.service units. As ECS task churn continued and attached+detached new ENIs, the leaked units accumulated, each with a stuck setup-policy-routes %i start proc and a setup-policy-routes %i refresh proc spinning trying to acquire the lock, exiting, then being respawned by the timer.

With the current PR, the start proc would timeout, but you are correct that refresh would continue to be respawned because the systemd unit is still active. Also noted on Restart=on-failure on the start rule.

@ddermendzhiev
Copy link
Copy Markdown
Author

I guess we can just add the same remove rule command inside the timeout block. This what you were implying:

if ((counter >= max_wait)); then
    error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
    /usr/bin/systemctl disable --now "refresh-policy-routes@${iface}.timer" "policy-routes@${iface}.service" 2>/dev/null || true
    exit 1
fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants