Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions bin/setup-policy-routes.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,17 @@ refresh)
start)
register_networkd_reloader
counter=0
max_wait=3000 # 5 minute timeout to avoid infinite loop if sysfs node never appears
while [ ! -e "/sys/class/net/${iface}" ]; do
if ((counter % 1000 == 0)); then
debug "Waiting for sysfs node to exist for ${iface} (iteration $counter)"
fi
sleep 0.1
((counter++))
if ((counter >= max_wait)); then
error "Timed out waiting for sysfs node for ${iface} after $((counter / 10)) seconds"
exit 1
fi
done
debug "Starting configuration for $iface"
debug /lib/systemd/systemd-networkd-wait-online -i "$iface"
Expand Down
12 changes: 12 additions & 0 deletions lib/lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,18 @@ register_networkd_reloader() {
local -r lockfile="${lockdir}/${iface}"
local old_opts=$-

# If the existing lock owner is no longer alive, remove the stale lockfile
# so subsequent invocations don't spin for up to 1000 seconds waiting on a
# process that will never release it.
if [ -f "${lockfile}" ]; then
local existing_pid
existing_pid=$(cat "${lockfile}" 2>/dev/null)
if [ -n "$existing_pid" ] && ! kill -0 "$existing_pid" 2>/dev/null; then
debug "Removing stale lock from dead process $existing_pid for ${iface}"
rm -f "${lockfile}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a race condtion when two PIDs clash and a lock file is removed by accident?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. If the PID is reassigned to a another setup-policy-routes process which acquires the lock after the ! kill -0 "$existing_pid" check, this code would then delete a valid lockfile. This is very unlikely, but let's consider it.

I dont think an atomic operation is possible purely with shell code.

What if we also add a check on the lockfile age? We can use the same value that we set for the sysfs wait timeout (original is 300s) as a stale threshold? Only if the lockfile is older than that timeout, can we consider it stale.

Something like:

        local lock_age=$(( $(date +%s) - $(stat -c %Y "${lockfile}" 2>/dev/null || echo 0) ))
        if [ "$lock_age" -gt 300 ]; then
            debug "Removing stale lock from dead process $existing_pid for ${iface}"
            rm -f "${lockfile}"
        fi

Note: the threshold should stay in sync with max_wait * 0.1 from the sysfs wait timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach but I am also okay with not adding more complication since the PID space is quite large.

fi
fi

# Disable -o errexit in the following block so we can capture
# nonzero exit codes from a redirect without considering them
# fatal errors
Expand Down