Skip to content

fix(localdns): wait for resolv.conf update after networkctl reload to prevent race condition#7749

Merged
cameronmeissner merged 14 commits intomainfrom
sakwa/fix/localdns-resolv-conf-race
Feb 3, 2026
Merged

fix(localdns): wait for resolv.conf update after networkctl reload to prevent race condition#7749
cameronmeissner merged 14 commits intomainfrom
sakwa/fix/localdns-resolv-conf-race

Conversation

@saewoni
Copy link
Contributor

@saewoni saewoni commented Jan 28, 2026

What this PR does / why we need it:
Immediately after networkctl reload, DNS settings may not have propagated from systemd-networkd (via DHCP) to systemd-resolved yet. As a result, /run/systemd/resolve/resolv.conf can still reflect the previous upstream DNS servers when replace_azurednsip_in_corefile runs.

This happens because networkctl reload only triggers a reload request over D-Bus; it does not wait for systemd-networkd to finish reprocessing configuration, re-acquire DHCP leases, or update systemd-resolved.

Which issue(s) this PR fixes:

Fixes #
to test: shellspec --shell bash --format d spec/parts/linux/cloud-init/artifacts/localdns_spec.sh

copied the script to the localdns-enabled node with replacing systemd-notify WATCHDOG=1 with echo systemd-notify WATCHDOG=1 to have the watchdog restart localdns (simulating what cx had before their race condition problem). I have 2 custom vnet dns servers
image

Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 systemd[1]: localdns.service: Watchdog timeout (limit 1min)!
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 systemd[1]: localdns.service: Killing process 443992 (localdns.sh) with signal SIGABRT.
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Error occurred. Cleaning up...
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Cleaning up any existing localdns iptables rules...
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Found existing localdns iptables rules, removing them...
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from OUTPUT chain (rule 4).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from OUTPUT chain (rule 3).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from OUTPUT chain (rule 2).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from OUTPUT chain (rule 1).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from PREROUTING chain (rule 4).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from PREROUTING chain (rule 3).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from PREROUTING chain (rule 2).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed existing localdns iptables rule from PREROUTING chain (rule 1).
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Removing network drop-in file /run/systemd/network/10-netplan-eth0.network.d/70-localdns.conf.
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed network drop-in file.
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Attempt to reload network configuration.
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Reloading network configuration succeeded.
Feb 02 18:26:57 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Sleeping 5 seconds to allow connections to terminate.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Sending SIGINT to localdns and waiting for it to terminate.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully sent SIGINT to localdns.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[444051]: [INFO] SIGINT: Shutting down
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Localdns terminated successfully.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Removing localdns dummy interface.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed localdns dummy interface.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully cleanup localdns related configurations.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Executing cleanup function.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Cleaning up any existing localdns iptables rules...
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: No existing localdns iptables rules found.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Removing network drop-in file /run/systemd/network/10-netplan-eth0.network.d/70-localdns.conf.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully removed network drop-in file.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Attempt to reload network configuration.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Reloading network configuration succeeded.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[443992]: Successfully cleanup localdns related configurations.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 systemd[1]: localdns.service: Main process exited, code=exited, status=216/GROUP
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 systemd[1]: localdns.service: Failed with result 'watchdog'.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 systemd[1]: localdns.service: Scheduled restart job, restart counter is at 1.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 systemd[1]: Stopped Localdns service.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 systemd[1]: Starting Localdns service...
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Cleaning up any existing localdns iptables rules...
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: No existing localdns iptables rules found.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Removing network drop-in file /run/systemd/network/10-netplan-eth0.network.d/70-localdns.conf.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Successfully removed network drop-in file.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Attempt to reload network configuration.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Reloading network configuration succeeded.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Waiting for localdns (169.254.10.10) to be removed from resolv.conf...
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: DNS configuration refreshed successfully. Current DNS: 8.8.8.8 8.8.4.4
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Found upstream VNET DNS servers: 8.8.8.8 8.8.4.4
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Replacing Azure DNS IP 168.63.129.16 with upstream VNET DNS servers 8.8.8.8 8.8.4.4 in corefile /opt/azure/containers/localdns/updated.localdns.corefile
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Successfully updated /opt/azure/containers/localdns/updated.localdns.corefile
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Setting up localdns dummy interface with IPs 169.254.10.10 and 169.254.10.11.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Adding iptables rules to skip conntrack for queries to localdns.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Starting localdns: systemd-cat --identifier=localdns-coredns --stderr-priority=3 -- /opt/azure/containers/localdns/binary/coredns -conf /opt/azure/containers/localdns/updated.localdns.corefile -pidfile /run/localdns.pid.
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: .:53 on 169.254.10.10
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: cluster.local.:53 on 169.254.10.10
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: health-check.localdns.local.:53 on 169.254.10.10
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: .:53 on 169.254.10.11
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: cluster.local.:53 on 169.254.10.11
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: health-check.localdns.local.:53 on 169.254.10.11
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: CoreDNS-1.13.1
Feb 02 18:27:02 aks-agentpool-33772812-vmss000001 localdns-coredns[445204]: linux/amd64, go1.25.3, 1db4568df6aaacda6ebbce87717156bd855f8103
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Localdns PID is 445204.
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Waiting for localdns to start and be able to serve traffic.
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Localdns is online and ready to serve traffic.
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Updating network DNS configuration to point to localdns via /run/systemd/network/10-netplan-eth0.network.d/70-localdns.conf.
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Startup complete - serving node and pod DNS traffic.
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 systemd[1]: Started Localdns service.
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: Starting watchdog loop at 12 second intervals.
Feb 02 18:27:03 aks-agentpool-33772812-vmss000001 localdns.sh[445145]: systemd-notify WATCHDOG=1

updated.localdns.corefile has the correct custom vnet dns servers.

root@aks-agentpool-33772812-vmss000001:/opt/azure/containers/localdns# cat updated.localdns.corefile 
# ***********************************************************************************
# WARNING: Changes to this file will be overwritten and not persisted.
# ***********************************************************************************
# whoami (used for health check of DNS)
health-check.localdns.local:53 {
    bind 169.254.10.10 169.254.10.11
    whoami
}
# VnetDNS overrides apply to DNS traffic from pods with dnsPolicy:default or kubelet (referred to as VnetDNS traffic).
.:53 {
    errors
    bind 169.254.10.10
    forward . 8.8.8.8 8.8.4.4 {
        policy sequential
        max_concurrent 1000
    }
    ready 169.254.10.10:8181
    cache 3600 {
        success 9984
        denial 9984
        serve_stale 3600s immediate
        servfail 0
    }
    loop
    nsid localdns
    prometheus :9253
    template ANY ANY internal.cloudapp.net {
        match "^(?:[^.]+\.){4,}internal\.cloudapp\.net\.$"
        rcode NXDOMAIN
        fallthrough
    }
    template ANY ANY reddog.microsoft.com {
        rcode NXDOMAIN
    }
}
cluster.local:53 {
    errors
    bind 169.254.10.10
    forward . 10.0.0.10 {
        force_tcp
        policy sequential
        max_concurrent 1000
    }
    ready 169.254.10.10:8181
    cache 3600 {
        success 9984
        denial 9984
        serve_stale 3600s immediate
        servfail 0
    }
    loop
    nsid localdns
    prometheus :9253
}
# KubeDNS overrides apply to DNS traffic from pods with dnsPolicy:ClusterFirst (referred to as KubeDNS traffic).
.:53 {
    errors
    bind 169.254.10.11
    forward . 10.0.0.10 {
        policy sequential
        max_concurrent 1000
    }
    ready 169.254.10.11:8181
    cache 3600 {
        success 9984
        denial 9984
        serve_stale 3600s immediate
        servfail 0
    }
    loop
    nsid localdns-pod
    prometheus :9253
    template ANY ANY internal.cloudapp.net {
        match "^(?:[^.]+\.){4,}internal\.cloudapp\.net\.$"
        rcode NXDOMAIN

        fallthrough

    }
    template ANY ANY reddog.microsoft.com {
        rcode NXDOMAIN
    }
}
cluster.local:53 {
    errors
    bind 169.254.10.11
    forward . 10.0.0.10 {
        force_tcp
        policy sequential
        max_concurrent 1000
    }
    ready 169.254.10.11:8181
    cache 3600 {
        success 9984
        denial 9984
        serve_stale 3600s immediate
        servfail 0
    }
    loop
    nsid localdns-pod
    prometheus :9253
}root@aks-agentpool-33772812-vmss000001:/opt/azure/containers/localdns# 

Copilot AI review requested due to automatic review settings January 28, 2026 23:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition that occurs after calling networkctl reload to update DNS configuration. Previously, the code would proceed immediately after the reload command without waiting for systemd-resolved to actually update the /run/systemd/resolve/resolv.conf file, potentially causing subsequent operations to work with stale DNS information.

Changes:

  • Added wait_for_dns_config_applied() function that polls resolv.conf to verify DNS configuration changes have been applied
  • Integrated the wait function after both networkctl reload calls to ensure DNS changes are complete before proceeding
  • Added comprehensive test coverage for the new function

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
parts/linux/cloud-init/artifacts/localdns.sh Implements the new wait_for_dns_config_applied() function and integrates it after networkctl reload calls in disable_dhcp_use_clusterlistener and cleanup_iptables_and_dns
spec/parts/linux/cloud-init/artifacts/localdns_spec.sh Adds comprehensive test coverage for wait_for_dns_config_applied with tests for success cases, timeout cases, edge cases, and partial IP matching

@saewoni saewoni marked this pull request as ready for review January 29, 2026 00:46
@saewoni saewoni changed the title fix(localdns): wait for resolv.conf update after networkctl reload to… fix(localdns): wait for resolv.conf update after networkctl reload to prevent race condition Jan 29, 2026
Update log messages to use Error: prefix when wait_for_dns_config_applied fails, since these are failure conditions (return 1), not warnings. Updated corresponding test assertion.
Copilot AI review requested due to automatic review settings January 29, 2026 21:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot AI review requested due to automatic review settings January 29, 2026 22:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings January 30, 2026 20:07
Copilot AI review requested due to automatic review settings January 30, 2026 22:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Copilot AI review requested due to automatic review settings February 2, 2026 17:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@cameronmeissner cameronmeissner merged commit de6908f into main Feb 3, 2026
17 of 27 checks passed
@cameronmeissner cameronmeissner deleted the sakwa/fix/localdns-resolv-conf-race branch February 3, 2026 00:31
Devinwong pushed a commit that referenced this pull request Feb 3, 2026
… prevent race condition (#7749)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Devin Wong <wongsiosun@outlook.com>
mxj220 pushed a commit that referenced this pull request Feb 5, 2026
… prevent race condition (#7749)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
mxj220 added a commit that referenced this pull request Feb 6, 2026
…eload to prevent race condition (#7749)"

This reverts commit bcdecfa.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants