Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
3bf2f16
initial changes
saewoni Jan 8, 2026
c3519a7
update vhdbuilder/packer/
saewoni Jan 8, 2026
cdafe40
testdata files edited bc the gzipped content have changed
saewoni Jan 9, 2026
6e84c80
file updates
saewoni Jan 9, 2026
6f6b9ae
feat: add MCR DNS caching to LocalDNS for improved reliability
saewoni Jan 9, 2026
389d6a6
Fix: Restrict MCR DNS hosts caching to root domain only
saewoni Jan 12, 2026
eb8f0dc
feat: expand hosts caching to all critical AKS FQDNs
saewoni Feb 2, 2026
5d2601f
fix: refactor aks-hosts-setup tests to source actual script
saewoni Feb 2, 2026
310025d
feat: run aks-hosts-setup before kubelet and use /etc/localdns/hosts
saewoni Feb 2, 2026
4b0ab1c
fixes the Describe block name from shouldEnableMCRHostsSetup to shoul…
saewoni Feb 2, 2026
4eceb3d
remove whitespace
saewoni Feb 2, 2026
aa9903d
use nslookup instead of dig
saewoni Feb 2, 2026
094aa2e
refactor: rename shouldEnableAKSHostsSetup to enableAKSHostsSetup for…
saewoni Feb 2, 2026
ddaf3cd
fix: restore enableLocalDNSForScriptless function name to match main
saewoni Feb 2, 2026
4edf5d6
test: add IPv6 filtering test for aks-hosts-setup.sh
saewoni Feb 2, 2026
df80d7e
Apply suggestion from @Copilot
saewoni Feb 2, 2026
b83a004
fix: use POSIX-compatible [ ] syntax in aks-hosts-setup.sh
saewoni Feb 2, 2026
30673b7
remove get-docker.sh
saewoni Feb 3, 2026
5657d33
add design doc
saewoni Feb 4, 2026
2f0398f
update design doc
saewoni Feb 4, 2026
006eb78
update code and md file with hardcoded ips
saewoni Feb 4, 2026
473ad7a
add that we have hardcoded ips
saewoni Feb 4, 2026
339332a
feat(localdns): extend hosts plugin to KubeDNS listener
saewoni Feb 4, 2026
30adc57
feat(localdns): support AKS-RP provided critical hosts entries
saewoni Feb 4, 2026
555f754
chore: regenerate protos and test snapshots for localdns hosts plugin
saewoni Feb 4, 2026
a23a560
chore: regenerate protos and test snapshots for localdns hosts plugin
saewoni Feb 4, 2026
eb6a0cc
feat: add localdns hosts entries support for non-scriptless path
saewoni Feb 5, 2026
965d35d
Merge main into sakwa/localdns_poc
kwaksaewon Feb 6, 2026
a6011a2
Regenerate testdata after merge with main
kwaksaewon Feb 6, 2026
493993b
add e2e tests
kwaksaewon Feb 6, 2026
51b1594
Merge origin/main, accepted theirs for Flatcar CustomData
saewoni Feb 10, 2026
4a538c8
make generate customdata for flatcar
saewoni Feb 10, 2026
e25e91e
add e2e test for hosts plugin
saewoni Feb 12, 2026
fb4029d
Merge remote-tracking branch 'origin/main' into sakwa/localdns_poc
saewoni Feb 14, 2026
0bcc1dd
Merge origin/main, accept theirs for testdata and regenerate
saewoni Feb 14, 2026
e586737
remove criticalhostsentry
saewoni Feb 14, 2026
740014f
delete docs
saewoni Feb 14, 2026
848a672
remove pb.go files
saewoni Feb 14, 2026
aeebf07
remove criticalhostsentries from e2e
saewoni Feb 14, 2026
7960676
update LocalDnsProfile to accept EnableHostsPlugin
saewoni Feb 14, 2026
012f876
make grep -q to -qF to match fixed string
saewoni Feb 14, 2026
9cf8457
update packer_source.sh
saewoni Feb 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ message LocalDnsProfile {

// KubeDns overrides apply to DNS traffic from pods with dnsPolicy:ClusterFirst (referred to as KubeDns traffic).
map<string, LocalDnsOverrides> kube_dns_overrides = 5;

// Field 6 was critical_hosts_entries, removed.
reserved 6;
Comment on lines +22 to +24
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proto source file marks field 6 (critical_hosts_entries) as reserved, indicating it was removed. However, the generated protobuf code (localdns_config.pb.go) still contains the CriticalHostsEntries field and related methods. This mismatch indicates the protobuf code was not regenerated after modifying the .proto file.

You need to run make generate or the proto code generation command to regenerate the .pb.go files from the updated .proto files.

Suggested change
// Field 6 was critical_hosts_entries, removed.
reserved 6;

Copilot uses AI. Check for mistakes.
}

// Represents DNS override settings for both VnetDNS and KubeDNS traffic.
Expand Down
12 changes: 11 additions & 1 deletion e2e/validation.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,20 @@ func ValidateCommonLinux(ctx context.Context, s *Scenario) {
ValidateKubeletNodeIP(ctx, s)
}

// localdns is not supported on scriptless, privatekube and VHDUbuntu2204Gen2ContainerdAirgappedK8sNotCached.
// localdns is not supported on FIPS VHDs, older VHDs (privatekube, airgapped), and AzureLinux OSGuard.
if !s.VHD.UnsupportedLocalDns {
ValidateLocalDNSService(ctx, s, "enabled")
ValidateLocalDNSResolution(ctx, s, "169.254.10.10")
// Validate aks-hosts-setup service ran successfully and timer is active
ValidateAKSHostsSetupService(ctx, s)
// Validate hosts file contains resolved IPs for critical FQDNs (IPs resolved dynamically)
ValidateLocalDNSHostsFile(ctx, s, []string{
"mcr.microsoft.com",
"login.microsoftonline.com",
"acs-mirror.azureedge.net",
})
// Validate localdns resolves fake FQDN from hosts file (proves hosts plugin bypass)
ValidateLocalDNSHostsPluginBypass(ctx, s)
}

ValidateInspektorGadget(ctx, s)
Expand Down
166 changes: 166 additions & 0 deletions e2e/validators.go
Original file line number Diff line number Diff line change
Expand Up @@ -1384,6 +1384,172 @@ func ValidateLocalDNSResolution(ctx context.Context, s *Scenario, server string)
assert.Contains(s.T, execResult.stdout, fmt.Sprintf("SERVER: %s", server))
}

// ValidateLocalDNSHostsFile checks that /etc/localdns/hosts contains entries for critical FQDNs.
// It dynamically resolves IPs on the VM and verifies they match what's in the hosts file.
// This avoids hardcoding IPs that can change over time.
func ValidateLocalDNSHostsFile(ctx context.Context, s *Scenario, fqdns []string) {
s.T.Helper()

// Build script that resolves each FQDN and checks it exists in hosts file
script := fmt.Sprintf(`set -euo pipefail
hosts_file="/etc/localdns/hosts"
fqdns=(%s)

echo "=== Validating /etc/localdns/hosts contains resolved IPs for critical FQDNs ==="
echo ""
echo "Current hosts file contents:"
cat "$hosts_file"
echo ""

errors=0
for fqdn in "${fqdns[@]}"; do
echo "Checking FQDN: $fqdn"

# Resolve IPv4 addresses using the Azure DNS (168.63.129.16)
ipv4_addrs=$(nslookup -type=A "$fqdn" 168.63.129.16 2>/dev/null | awk '/^Address: / && !/^Address: .*#/ {print $2}' | grep -E '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' || true)

if [ -z "$ipv4_addrs" ]; then
echo " WARNING: Could not resolve IPv4 for $fqdn, skipping IP validation"
# At minimum, check the FQDN exists in the file
if ! grep -qF "$fqdn" "$hosts_file"; then
echo " ERROR: FQDN $fqdn not found in hosts file at all"
errors=$((errors + 1))
fi
continue
fi

# Check each resolved IP exists in the hosts file for this FQDN
for ip in $ipv4_addrs; do
expected_entry="$ip $fqdn"
if grep -qF "$expected_entry" "$hosts_file"; then
echo " OK: Found '$expected_entry' in hosts file"
Comment on lines 1422 to 1425
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These greps treat the expected entry as a regex, so dots in FQDNs can match any character and cause false positives (e.g., "mcr.microsoft.com" matches "mcrXmicrosoftYcom"). Use fixed-string matching (e.g., grep -F / -qF) and consider anchoring/whitespace handling to avoid passing when the exact "IP FQDN" line isn’t present.

Copilot uses AI. Check for mistakes.
else
echo " ERROR: Expected entry '$expected_entry' not found in hosts file"
errors=$((errors + 1))
fi
done
Comment on lines 1421 to 1430
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validator compares the current Azure DNS A-record set to the contents of /etc/localdns/hosts and expects every currently-resolved IP to be present. Since the hosts file is a periodic snapshot (timer runs every 15 minutes), DNS answers can legitimately change between the last refresh and the test execution (or return additional IPs due to load balancing), making this check flaky. Consider validating that (a) the FQDN exists in the hosts file and (b) at least one resolved IP (or a minimum count) is present, or rerun aks-hosts-setup immediately before validating to make the snapshot consistent.

Suggested change
# Check each resolved IP exists in the hosts file for this FQDN
for ip in $ipv4_addrs; do
expected_entry="$ip $fqdn"
if grep -q "$expected_entry" "$hosts_file"; then
echo " OK: Found '$expected_entry' in hosts file"
else
echo " ERROR: Expected entry '$expected_entry' not found in hosts file"
errors=$((errors + 1))
fi
done
# Check that at least one resolved IP exists in the hosts file for this FQDN
found_match=0
for ip in $ipv4_addrs; do
expected_entry="$ip $fqdn"
if grep -q "$expected_entry" "$hosts_file"; then
echo " OK: Found '$expected_entry' in hosts file"
found_match=1
break
fi
done
if [ "$found_match" -eq 0 ]; then
echo " ERROR: None of the resolved IPs for $fqdn were found in hosts file"
errors=$((errors + 1))
fi

Copilot uses AI. Check for mistakes.
done

echo ""
if [ $errors -gt 0 ]; then
echo "FAILED: $errors entries missing from hosts file"
exit 1
else
echo "SUCCESS: All critical FQDNs have correct entries in hosts file"
exit 0
fi
`, quoteFQDNsForBash(fqdns))

execScriptOnVMForScenarioValidateExitCode(ctx, s, script, 0,
"hosts file should contain resolved IPs for critical FQDNs")
}

// quoteFQDNsForBash converts a slice of FQDNs to a bash array string
func quoteFQDNsForBash(fqdns []string) string {
quoted := make([]string, len(fqdns))
for i, fqdn := range fqdns {
quoted[i] = fmt.Sprintf("%q", fqdn)
}
return strings.Join(quoted, " ")
}

// ValidateAKSHostsSetupService checks that aks-hosts-setup.service ran successfully
// and the aks-hosts-setup.timer is active to ensure periodic refresh of /etc/localdns/hosts.
func ValidateAKSHostsSetupService(ctx context.Context, s *Scenario) {
s.T.Helper()

// Check that aks-hosts-setup.service completed successfully (oneshot service)
serviceScript := `set -euo pipefail
svc="aks-hosts-setup.service"
# For oneshot services, check if it ran successfully (exit code 0)
result=$(systemctl show -p Result "$svc" --value 2>/dev/null || echo "unknown")
echo "aks-hosts-setup.service result: $result"
if [ "$result" != "success" ]; then
echo "ERROR: aks-hosts-setup.service did not complete successfully"
systemctl status "$svc" --no-pager || true
journalctl -u "$svc" --no-pager -n 50 || true
exit 1
fi
`
execScriptOnVMForScenarioValidateExitCode(ctx, s, serviceScript, 0,
"aks-hosts-setup.service should have completed successfully")

// Check that aks-hosts-setup.timer is active for periodic refresh
timerScript := `set -euo pipefail
timer="aks-hosts-setup.timer"
active=$(systemctl is-active "$timer" 2>/dev/null || true)
echo "aks-hosts-setup.timer: active=$active"
if [ "$active" != "active" ]; then
echo "ERROR: aks-hosts-setup.timer is not active"
systemctl status "$timer" --no-pager || true
exit 1
fi
`
execScriptOnVMForScenarioValidateExitCode(ctx, s, timerScript, 0,
"aks-hosts-setup.timer should be active for periodic hosts file refresh")
}

// ValidateLocalDNSHostsPluginBypass verifies that localdns resolves FQDNs from /etc/localdns/hosts
// without querying the upstream DNS server. This confirms the hosts plugin is working correctly.
// It injects a fake FQDN (that doesn't exist in public DNS) into the hosts file and verifies
// localdns can resolve it - proving the hosts plugin is functioning.
func ValidateLocalDNSHostsPluginBypass(ctx context.Context, s *Scenario) {
s.T.Helper()

// Use a fake FQDN that doesn't exist in public DNS and a TEST-NET-3 IP (RFC 5737)
// If this resolves, it MUST be coming from the hosts file
fakeFQDN := "fake-mcr.microsoft.com"
fakeIP := "203.0.113.42"

script := fmt.Sprintf(`set -euo pipefail
fake_fqdn=%q
fake_ip=%q
hosts_file="/etc/localdns/hosts"

echo "=== Testing localdns hosts plugin bypass ==="
echo "Injecting fake entry: $fake_ip $fake_fqdn"

# Add fake entry to hosts file
echo "$fake_ip $fake_fqdn" | sudo tee -a "$hosts_file" > /dev/null

echo "Current hosts file contents:"
sudo cat "$hosts_file"

# Give localdns a moment to pick up the change (hosts plugin reloads periodically)
sleep 2

echo ""
echo "Querying localdns for fake FQDN: $fake_fqdn"
echo "If this resolves to $fake_ip, it proves the hosts plugin is working"

# Query localdns at the cluster listener IP
result=$(dig "$fake_fqdn" @169.254.10.10 +short +timeout=5 +tries=2 2>/dev/null || true)
echo "DNS response: $result"

# Check if the fake IP is in the response
if echo "$result" | grep -q "$fake_ip"; then
echo ""
echo "SUCCESS: localdns resolved fake FQDN $fake_fqdn to $fake_ip from hosts file!"
echo "This proves the hosts plugin is correctly bypassing upstream DNS."
exit 0
else
echo ""
echo "ERROR: Expected fake IP $fake_ip not found in response for $fake_fqdn"
echo "The hosts plugin may not be working correctly."
echo ""
echo "Full dig output:"
dig "$fake_fqdn" @169.254.10.10 +timeout=5 +tries=2 || true
echo ""
echo "localdns service status:"
systemctl status localdns --no-pager -n 10 || true
exit 1
fi
`, fakeFQDN, fakeIP)

execScriptOnVMForScenarioValidateExitCode(ctx, s, script, 0,
"localdns should resolve fake FQDN from hosts file (proving hosts plugin bypass)")
}

// ValidateJournalctlOutput checks if specific content exists in the systemd service logs
func ValidateJournalctlOutput(ctx context.Context, s *Scenario, serviceName string, expectedContent string) {
s.T.Helper()
Expand Down
Binary file added localdns-hosts-plugin-design.docx
Binary file not shown.
12 changes: 12 additions & 0 deletions parts/linux/cloud-init/artifacts/aks-hosts-setup.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[Unit]
Description=Populate /etc/localdns/hosts with critical AKS FQDN addresses
After=network-online.target
Wants=network-online.target
Before=kubelet.service localdns.service

[Service]
Type=oneshot
ExecStart=/opt/azure/containers/aks-hosts-setup.sh
Comment on lines +7 to +9
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This oneshot service has no explicit timeout, so a hung DNS resolution (e.g., nslookup blocking) can delay/fail boot-time setup. Consider setting a TimeoutStartSec= on the unit and/or using timeout in the script invocation to bound runtime.

Copilot uses AI. Check for mistakes.

[Install]
WantedBy=multi-user.target kubelet.service
98 changes: 98 additions & 0 deletions parts/linux/cloud-init/artifacts/aks-hosts-setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
#!/bin/bash
set -uo pipefail

# aks-hosts-setup.sh
# Resolves A and AAAA records for critical AKS FQDNs and populates /etc/localdns/hosts

HOSTS_FILE="/etc/localdns/hosts"

# Ensure the directory exists
mkdir -p "$(dirname "$HOSTS_FILE")"

# Critical AKS FQDNs that should be cached for DNS reliability
CRITICAL_FQDNS=(
"acs-mirror.azureedge.net"
"eastus.data.mcr.microsoft.com"
"login.microsoftonline.com"
"management.azure.com"
"mcr.microsoft.com"
"packages.aks.azure.com"
"packages.microsoft.com"
)

# Function to resolve IPv4 addresses for a domain
# Filters output to only include valid IPv4 addresses (rejects NXDOMAIN, SERVFAIL, hostnames, etc.)
resolve_ipv4() {
local domain="$1"
local output
output=$(nslookup -type=A "${domain}" 2>/dev/null) || return 0
# Parse Address lines (skip server address with #), validate IPv4 format (4 octets of 1-3 digits)
echo "${output}" | awk '/^Address: / && !/^Address: .*#/ {print $2}' | grep -E '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$' || return 0
}

# Function to resolve IPv6 addresses for a domain
# Filters output to only include valid IPv6 addresses (rejects NXDOMAIN, SERVFAIL, hostnames, etc.)
resolve_ipv6() {
local domain="$1"
local output
output=$(nslookup -type=AAAA "${domain}" 2>/dev/null) || return 0
# Parse Address lines (skip server address with #), validate IPv6 format (must contain : and only hex/colons, min 3 chars)
echo "${output}" | awk '/^Address: / && !/^Address: .*#/ {print $2}' | grep -E '^[0-9a-fA-F:]{3,}$' | grep ':' || return 0
}

echo "Starting AKS critical FQDN hosts resolution at $(date)"

# Track if we resolved at least one address
RESOLVED_ANY=false

# Start building the hosts file content
HOSTS_CONTENT="# AKS critical FQDN addresses resolved at $(date)
# This file is automatically generated by aks-hosts-setup.service
"

# Resolve each FQDN
for DOMAIN in "${CRITICAL_FQDNS[@]}"; do
echo "Resolving addresses for ${DOMAIN}..."

# Get IPv4 and IPv6 addresses using helper functions
IPV4_ADDRS=$(resolve_ipv4 "${DOMAIN}")
IPV6_ADDRS=$(resolve_ipv6 "${DOMAIN}")

# Check if we got any results for this domain
if [ -z "${IPV4_ADDRS}" ] && [ -z "${IPV6_ADDRS}" ]; then
echo " WARNING: No IP addresses resolved for ${DOMAIN}"
continue
fi

RESOLVED_ANY=true
HOSTS_CONTENT+="
# ${DOMAIN}"

if [ -n "${IPV4_ADDRS}" ]; then
for addr in ${IPV4_ADDRS}; do
HOSTS_CONTENT+="
${addr} ${DOMAIN}"
done
fi

if [ -n "${IPV6_ADDRS}" ]; then
for addr in ${IPV6_ADDRS}; do
HOSTS_CONTENT+="
${addr} ${DOMAIN}"
done
fi
done

# Check if we resolved at least one domain
if [ "${RESOLVED_ANY}" != "true" ]; then
echo "WARNING: No IP addresses resolved for any domain at $(date)"
echo "This is likely a temporary DNS issue. Timer will retry later."
# Keep existing hosts file intact and exit successfully so systemd doesn't mark unit as failed
exit 0
fi

# Write the hosts file
echo "Writing addresses to ${HOSTS_FILE}..."
echo "${HOSTS_CONTENT}" > "${HOSTS_FILE}"
Comment on lines +94 to +96
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hosts file is written in-place with a single echo ... > "$HOSTS_FILE", which can briefly leave /etc/localdns/hosts empty or partially written while CoreDNS is reloading or reading it. This can cause transient DNS resolution failures for the cached FQDNs; to make updates safer, consider writing to a temporary file in the same directory and then atomically renaming it over the existing hosts file (and optionally setting explicit file permissions) so readers never observe a truncated file.

Copilot uses AI. Check for mistakes.

echo "AKS critical FQDN hosts resolution completed at $(date)"
16 changes: 16 additions & 0 deletions parts/linux/cloud-init/artifacts/aks-hosts-setup.timer
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[Unit]
Description=Run AKS hosts setup periodically
Before=localdns.service

[Timer]
# Run immediately on boot
OnBootSec=0
# Run 15 minutes after the last activation (AKS critical FQDN IPs don't change frequently)
OnUnitActiveSec=15min
# Timer accuracy (how much systemd can delay)
AccuracySec=1s
# Add randomization to avoid thundering herd if multiple nodes boot simultaneously
RandomizedDelaySec=60s

[Install]
WantedBy=timers.target
17 changes: 17 additions & 0 deletions parts/linux/cloud-init/artifacts/cse_config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -1194,6 +1194,23 @@ enableLocalDNS() {
echo "Enable localdns succeeded."
}

# This function enables and starts the aks-hosts-setup timer.
# The timer periodically resolves critical AKS FQDN DNS records and populates /etc/localdns/hosts.
enableAKSHostsSetup() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not make this fail. just log the error and make an empty host file.
When enabling parameter is passed in, fail when host file is empty.

local hosts_file="/etc/localdns/hosts"

# Run the script once immediately to resolve live DNS before kubelet starts
echo "Running initial aks-hosts-setup to resolve DNS..."
mkdir -p "$(dirname "${hosts_file}")"
/opt/azure/containers/aks-hosts-setup.sh || echo "Warning: Initial hosts setup failed"

# Enable the timer for periodic refresh (every 15 minutes)
# This will update the hosts file with fresh IPs from live DNS
echo "Enabling aks-hosts-setup timer..."
systemctlEnableAndStart aks-hosts-setup.timer 30 || exit $ERR_SYSTEMCTL_START_FAIL
echo "aks-hosts-setup timer enabled successfully."
Comment on lines +1207 to +1211
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableAKSHostsSetup calls systemctlEnableAndStart ... || exit $ERR_SYSTEMCTL_START_FAIL inside a helper function. This will terminate the entire CSE script instead of returning an error to the caller (which already handles failures via || return / logs_to_events ... || exit). Use return (or propagate the systemctlEnableAndStart status) rather than exit inside this function.

Copilot uses AI. Check for mistakes.
Comment on lines +1208 to +1211
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableAKSHostsSetup assumes the VHD already contains aks-hosts-setup.timer/.service. This will break provisioning when CSE runs on older VHDs (within the ~6 month support window) that don’t have these units yet, because systemctlEnableAndStart aks-hosts-setup.timer will fail and abort LocalDNS enablement. Add a presence check (e.g., unit file exists / systemctl cat aks-hosts-setup.timer) and no-op with a warning when the unit isn’t available.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +1202 to +1211
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableAKSHostsSetup assumes /opt/azure/containers/aks-hosts-setup.sh exists and that aks-hosts-setup.timer is installed. If those artifacts are missing on a VHD (or in a given build mode), systemctlEnableAndStart will fail and this will exit with ERR_SYSTEMCTL_START_FAIL, breaking provisioning. Consider adding a guard that checks for the script/unit files and skips with a clear log message when absent, or ensure VHD build installs these artifacts for every image where SHOULD_ENABLE_LOCALDNS can be true.

Copilot uses AI. Check for mistakes.
}

configureManagedGPUExperience() {
if [ "${GPU_NODE}" != "true" ] || [ "${skip_nvidia_driver_install}" = "true" ]; then
return
Expand Down
5 changes: 5 additions & 0 deletions parts/linux/cloud-init/artifacts/cse_main.sh
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,11 @@ EOF

# This is to enable localdns using scriptless.
if [ "${SHOULD_ENABLE_LOCALDNS}" = "true" ]; then
# Write hosts file BEFORE starting LocalDNS so it has entries to serve
# Enable aks-hosts-setup timer to periodically resolve and cache critical AKS FQDN DNS addresses
logs_to_events "AKS.CSE.enableAKSHostsSetup" enableAKSHostsSetup || exit $ERR_SYSTEMCTL_START_FAIL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should always enable host systemd unit, so the host file is always available. and mount the host file in corefile if enableHostplugin == true is passed in.


# Start LocalDNS after hosts file is populated
Comment on lines +302 to +304
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call makes aks-hosts-setup a hard requirement whenever SHOULD_ENABLE_LOCALDNS is true. If the timer/service units aren’t installed on the VHD (or if systemd can’t start them), provisioning will exit with ERR_SYSTEMCTL_START_FAIL before localdns starts. To avoid hard failures in mixed-version scenarios, consider checking for the unit/script presence and skipping gracefully when missing, or ensure all supported VHDs that can hit this path ship the aks-hosts-setup artifacts.

Suggested change
logs_to_events "AKS.CSE.enableAKSHostsSetup" enableAKSHostsSetup || exit $ERR_SYSTEMCTL_START_FAIL
# Start LocalDNS after hosts file is populated
aks_hosts_setup_supported="false"
if command -v systemctl >/dev/null 2>&1; then
if systemctl list-unit-files 2>/dev/null | grep -q '^aks-hosts-setup.service'; then
if systemctl list-unit-files 2>/dev/null | grep -q '^aks-hosts-setup.timer'; then
aks_hosts_setup_supported="true"
fi
fi
fi
if [ "${aks_hosts_setup_supported}" = "true" ]; then
logs_to_events "AKS.CSE.enableAKSHostsSetup" enableAKSHostsSetup || exit $ERR_SYSTEMCTL_START_FAIL
else
echo "aks-hosts-setup systemd units not found or systemctl unavailable; skipping AKS hosts setup"
fi
# Start LocalDNS after hosts file is populated (or skipped gracefully)

Copilot uses AI. Check for mistakes.
logs_to_events "AKS.CSE.enableLocalDNS" enableLocalDNS || exit $ERR_LOCALDNS_FAIL
fi

Expand Down
12 changes: 12 additions & 0 deletions pkg/agent/baker.go
Original file line number Diff line number Diff line change
Expand Up @@ -1888,6 +1888,12 @@ health-check.localdns.local:53 {
{{- end }}
bind {{$.NodeListenerIP}}
{{- if $isRootDomain}}
# Check /etc/localdns/hosts first for critical AKS FQDNs (mcr.microsoft.com, packages.aks.azure.com, etc.)
hosts /etc/localdns/hosts {
fallthrough
Comment on lines +1891 to +1893
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LocalDNS Corefile now references /etc/localdns/hosts via the hosts plugin, but there is no guarantee that this file exists before localdns.service is first started: in cse_main.sh we call enableLocalDNS (which starts the localdns unit) before enableAKSHostsSetup, and the hosts file is only created by aks-hosts-setup.sh when the timer fires. This means the first localdns start will use a Corefile pointing at a non-existent hosts file, which can lead to CoreDNS startup errors or at least noisy logs depending on the hosts plugin’s behavior. To make this robust, either ensure /etc/localdns/hosts is created (even as an empty file) before starting localdns.service, or change the ordering so aks-hosts-setup runs at least once before enableLocalDNS is invoked.

Copilot uses AI. Check for mistakes.
}
{{- end}}
{{- if $isRootDomain}}
Comment on lines +1895 to +1896
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two consecutive if $isRootDomain blocks here: one wrapping the new hosts /etc/localdns/hosts plugin and another immediately before the forward directive. The second condition is redundant because $isRootDomain is already checked just above, which makes the template slightly harder to read and maintain. Consider collapsing these into a single if $isRootDomain block that contains both the hosts and forward configuration to avoid duplicated conditions.

Suggested change
{{- end}}
{{- if $isRootDomain}}

Copilot uses AI. Check for mistakes.
Comment on lines 1890 to +1896
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables the CoreDNS hosts plugin for the root domain unconditionally, but the PR also introduces LocalDNSProfile.EnableHostsPlugin and a separate service that populates /etc/localdns/hosts. As written, the plugin will be enabled even when the hosts file/population service isn’t present or the feature is meant to be disabled, which risks localdns startup/runtime errors and makes the new EnableHostsPlugin flag ineffective. Suggest gating this block on EnableHostsPlugin (and/or ensuring an empty hosts file is always created before localdns starts).

Copilot uses AI. Check for mistakes.
forward . {{$.AzureDNSIP}} {
{{- else}}
{{- if $fwdToClusterCoreDNS}}
Expand Down Expand Up @@ -1948,6 +1954,12 @@ health-check.localdns.local:53 {
log
{{- end }}
bind {{$.ClusterListenerIP}}
{{- if $isRootDomain}}
# Check /etc/localdns/hosts first for critical AKS FQDNs (mcr.microsoft.com, packages.aks.azure.com, etc.)
hosts /etc/localdns/hosts {
fallthrough
}
{{- end}}
{{- if $fwdToClusterCoreDNS}}
forward . {{$.CoreDNSServiceIP}} {
{{- else}}
Expand Down
Loading
Loading