|
| 1 | +--- |
| 2 | +title: Basic troubleshooting of DNS resolution problems in AKS |
| 3 | +description: Learn how to create a troubleshooting workflow to fix DNS resolution problems in Azure Kubernetes Service (AKS). |
| 4 | +author: sturrent |
| 5 | +ms.author: seturren |
| 6 | +ms.date: 05/29/2025 |
| 7 | +ms.reviewer: v-rekhanain, v-leedennis, josebl, v-weizhu, qasimsarfraz |
| 8 | +editor: v-jsitser |
| 9 | +ms.service: azure-kubernetes-service |
| 10 | +ms.custom: sap:Connectivity |
| 11 | +ms.topic: troubleshooting-general |
| 12 | +#Customer intent: As an Azure Kubernetes user, I want to learn how to create a troubleshooting workflow so that I can fix DNS resolution problems in Azure Kubernetes Service (AKS). |
| 13 | +--- |
| 14 | +# Troubleshoot issues with LocalDNS on Azure Kubernetes Service (AKS) |
| 15 | +This article discusses how to create a troubleshooting workflow to fix Domain Name System (DNS) resolution problems in Microsoft Azure Kubernetes Service (AKS), specifically when using LocalDNS. To learn more about LocalDNS, you can read our overview in [DNS Resolution in Azure Kubernetes Service (AKS)](https://learn.microsoft.com/en-us/azure/aks/dns-concepts#localdns-in-azure-kubernetes-service-preview). |
| 16 | + |
| 17 | +## Prerequisites |
| 18 | + |
| 19 | +- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) command-line tool |
| 20 | + |
| 21 | + **Note:** To install kubectl by using [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command. |
| 22 | + |
| 23 | +- The [systemctl](https://man7.org/linux/man-pages/man1/systemctl.1.html) command-line tool. |
| 24 | + |
| 25 | +- The [journalctl](https://www.man7.org/linux/man-pages/man1/journalctl.1.html) command-line tool. |
| 26 | + |
| 27 | +## Identifying patterns in DNS failures |
| 28 | +Before you begin diagnosing the issues seen with LocalDNS, identify potential patterns with your DNS failures. Some patterns include: |
| 29 | +1. DNS resolution failure - is this happening allways or intermittently |
| 30 | +2. Are you seeing the DNS issues from all the nodes, a specific nodepool or subset of nodes or just a single node? |
| 31 | +3. Are you seeing DNS issues from nodes in a specific Azure Zone? Or from all the zones? |
| 32 | +4. What protocols are failing? Is it both TCP and UDP? or just one of them? |
| 33 | +5. What zones are failing? Is it all zones? or a specific zone traffic? |
| 34 | + |
| 35 | + **Note:** "zones" here refers to the DNS zones like *cluster.local* and *"."* (root) and not to physical zones in Azure. |
| 36 | + |
| 37 | +## Diagnose LocalDNS with a test DNSUtil pod |
| 38 | + |
| 39 | +### Step 1: Deploy a test dnsutils pod |
| 40 | +Option 1 - Deploy a test pod to your cluster using the following command: |
| 41 | + ``` bash |
| 42 | + kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml |
| 43 | + ``` |
| 44 | + |
| 45 | +Option 2 - If you are seeing DNS issues in specific nodes, you can control the deployment of the test pod using nodeSelector: |
| 46 | + |
| 47 | + ```bash |
| 48 | + cat <<EOF | kubectl create -f - |
| 49 | + apiVersion: v1 |
| 50 | + kind: Pod |
| 51 | + metadata: |
| 52 | + name: dnsutils2 |
| 53 | + namespace: default |
| 54 | + spec: |
| 55 | + nodeSelector: |
| 56 | + kubernetes.io/hostname: <NODE> |
| 57 | + containers: |
| 58 | + - name: dnsutils |
| 59 | + image: registry.k8s.io/e2e-test-images/agnhost:2.39 |
| 60 | + command: |
| 61 | + - sleep |
| 62 | + - "infinity" |
| 63 | + imagePullPolicy: IfNotPresent |
| 64 | + restartPolicy: Always |
| 65 | + EOF |
| 66 | + ``` |
| 67 | + |
| 68 | +Option 3 - If you run both linux and windows nodes in your cluster, you can configure the test pod to deploy to all linux nodes |
| 69 | + |
| 70 | + ```bash |
| 71 | + cat <<EOF | kubectl create -f - |
| 72 | + apiVersion: apps/v1 |
| 73 | + kind: DaemonSet |
| 74 | + metadata: |
| 75 | + name: dnsutils |
| 76 | + namespace: default |
| 77 | + spec: |
| 78 | + selector: |
| 79 | + matchLabels: |
| 80 | + app: dnsutils |
| 81 | + template: |
| 82 | + metadata: |
| 83 | + labels: |
| 84 | + app: dnsutils |
| 85 | + spec: |
| 86 | + nodeSelector: |
| 87 | + kubernetes.io/os: linux |
| 88 | + containers: |
| 89 | + - name: dnsutils |
| 90 | + image: registry.k8s.io/e2e-test-images/agnhost:2.39 |
| 91 | + command: |
| 92 | + - sleep |
| 93 | + - "infinity" |
| 94 | + imagePullPolicy: IfNotPresent |
| 95 | + EOF |
| 96 | + ``` |
| 97 | + |
| 98 | +### Enable Query logging for LocalDNS |
| 99 | + |
| 100 | +Most use cases require query logging to be turned off in production because of its high memory usage and performance implications. However, for troubleshooting purposes, you should enable query logging in your localDNS configuration to root cause the source of your errors. Once the analyses is complete, you can turn this back off. |
| 101 | + |
| 102 | +Option 1 - Enable Query logging on all nodes |
| 103 | + |
| 104 | +You can modify your LocalDNS configuration to reflect *Querylogging:Log* for a single or multiple DNS zones. |
| 105 | + |
| 106 | +```json |
| 107 | +{ |
| 108 | + "mode": "Required", |
| 109 | + "vnetDNSOverrides": { |
| 110 | + ".": { |
| 111 | + "queryLogging": "Log", |
| 112 | + "protocol": "PreferUDP", |
| 113 | + "forwardDestination": "VnetDNS", |
| 114 | + "forwardPolicy": "Sequential", |
| 115 | + "maxConcurrent": 1000, |
| 116 | + "cacheDurationInSeconds": 3600, |
| 117 | + "serveStaleDurationInSeconds": 3600, |
| 118 | + "serveStale": "Immediate" |
| 119 | + }, |
| 120 | + "cluster.local": { |
| 121 | + "queryLogging": "Log", |
| 122 | + "protocol": "ForceTCP", |
| 123 | + "forwardDestination": "ClusterCoreDNS", |
| 124 | + "forwardPolicy": "Sequential", |
| 125 | + "maxConcurrent": 1000, |
| 126 | + "cacheDurationInSeconds": 3600, |
| 127 | + "serveStaleDurationInSeconds": 3600, |
| 128 | + "serveStale": "Immediate" |
| 129 | + } |
| 130 | + }, |
| 131 | + "kubeDNSOverrides": { |
| 132 | + ".": { |
| 133 | + "queryLogging": "Log", |
| 134 | + "protocol": "PreferUDP", |
| 135 | + "forwardDestination": "ClusterCoreDNS", |
| 136 | + "forwardPolicy": "Sequential", |
| 137 | + "maxConcurrent": 1000, |
| 138 | + "cacheDurationInSeconds": 3600, |
| 139 | + "serveStaleDurationInSeconds": 3600, |
| 140 | + "serveStale": "Immediate" |
| 141 | + }, |
| 142 | + "cluster.local": { |
| 143 | + "queryLogging": "Log", |
| 144 | + "protocol": "ForceTCP", |
| 145 | + "forwardDestination": "ClusterCoreDNS", |
| 146 | + "forwardPolicy": "Sequential", |
| 147 | + "maxConcurrent": 1000, |
| 148 | + "cacheDurationInSeconds": 3600, |
| 149 | + "serveStaleDurationInSeconds": 3600, |
| 150 | + "serveStale": "Immediate" |
| 151 | + } |
| 152 | + } |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +This can be enabled on the node pool using the Azure CLI |
| 157 | + |
| 158 | +```bash |
| 159 | +az aks nodepool update --name mynodepool1 --cluster-name myAKSCluster --resource-group myResourceGroup --localdns-config ./localdnsconfig.json |
| 160 | +``` |
| 161 | + |
| 162 | +**Note:** Making changes to the LocalDNS configuration will trigger a reimage operation on the nodes in the given node pool. |
| 163 | + |
| 164 | +Option 2 - Enable Query logging on a specific node |
| 165 | + |
| 166 | +Diagnosing LocalDNS issues on a specific node can be done by temporarily rewriting the LocalDNS configuration on that specific node. You can [connect to the node](https://learn.microsoft.com/en-us/azure/aks/node-access#connect-using-kubectl-debug) manually and update the corefile used by localdns, only restarting the specific localdns service. |
| 167 | + |
| 168 | +**Note:** The changes made this way are ephemeral in nature and will not be persisted once the troubleshooting is complete. |
| 169 | + |
| 170 | +```bash |
| 171 | +# You need to connect to the node before running the following commands |
| 172 | + |
| 173 | +## open the configuration file for LocalDNS |
| 174 | +vi /opt/azure/containers/localdns/localdns.corefile |
| 175 | + |
| 176 | +<Manually change errors to log for a zone or all zones> |
| 177 | + |
| 178 | +# *********************************************************************************** |
| 179 | +# WARNING: Changes to this file will be overwritten and not persisted. |
| 180 | +# *********************************************************************************** |
| 181 | +# whoami (used for health check of DNS) |
| 182 | +health-check.localdns.local:53 { |
| 183 | + bind 169.254.10.10 169.254.10.11 |
| 184 | + whoami |
| 185 | +} |
| 186 | +# VnetDNS overrides apply to DNS traffic from pods with dnsPolicy:default or kubelet (referred to as VnetDNS traffic). |
| 187 | +.:53 { |
| 188 | + errors |
| 189 | + bind 169.254.10.10 |
| 190 | + forward . 168.63.129.16 { |
| 191 | + policy sequential |
| 192 | + max_concurrent 1000 |
| 193 | + } |
| 194 | + ready 169.254.10.10:8181 |
| 195 | + cache 3600s { |
| 196 | + success 9984 |
| 197 | + denial 9984 |
| 198 | + serve_stale 3600s verify |
| 199 | + servfail 0 |
| 200 | + } |
| 201 | + loop |
| 202 | + nsid localdns |
| 203 | + prometheus :9253 |
| 204 | + template ANY ANY internal.cloudapp.net { |
| 205 | + match "^(?:[^.]+\.){4,}internal\.cloudapp\.net\.$" |
| 206 | + rcode NXDOMAIN |
| 207 | + fallthrough |
| 208 | + } |
| 209 | + template ANY ANY reddog.microsoft.com { |
| 210 | + rcode NXDOMAIN |
| 211 | + } |
| 212 | +} |
| 213 | +cluster.local:53 { |
| 214 | + errors |
| 215 | + bind 169.254.10.10 |
| 216 | + forward . 10.0.0.10 { |
| 217 | + force_tcp |
| 218 | + policy sequential |
| 219 | + max_concurrent 1000 |
| 220 | + } |
| 221 | + ready 169.254.10.10:8181 |
| 222 | + cache 3600s { |
| 223 | + success 9984 |
| 224 | + denial 9984 |
| 225 | + serve_stale 3600s verify |
| 226 | + servfail 0 |
| 227 | + } |
| 228 | + loop |
| 229 | + nsid localdns |
| 230 | + prometheus :9253 |
| 231 | +} |
| 232 | +... |
| 233 | + |
| 234 | +... |
| 235 | +<Save the changes> |
| 236 | + |
| 237 | +<Restart localdns service> |
| 238 | +systemctl restart localdns |
| 239 | +``` |
| 240 | + |
| 241 | +Once restarted, LocalDNS should begin collecting all logs for the chosen zones. |
| 242 | + |
| 243 | +### Generater traffic from dnsutils pod |
| 244 | + |
| 245 | +The next step would be to trigger some DNS traffic on LocalDNS. LocalDNS has two IPs - The KubeDNS traffic goes to the ClusterListenerIP - 169.254.10.11, while VnetDNSTraffic goes to the NodeListenerIP - 169.254.10.10#53 |
| 246 | + |
| 247 | +#### Test KubeDNS zone traffic |
| 248 | + |
| 249 | +```bash |
| 250 | +kubectl exec dnsutils -- dig bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6 |
| 251 | + |
| 252 | +; <<>> DiG 9.16.27 <<>> bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6 |
| 253 | +;; global options: +cmd |
| 254 | +;; Got answer: |
| 255 | +;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7452 |
| 256 | +;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 |
| 257 | + |
| 258 | +;; QUESTION SECTION: |
| 259 | +;bing.com. IN A |
| 260 | + |
| 261 | +;; ANSWER SECTION: |
| 262 | +bing.com. 30 IN A 150.171.27.10 |
| 263 | +bing.com. 30 IN A 150.171.28.10 |
| 264 | + |
| 265 | +;; Query time: 3 msec |
| 266 | +;; SERVER: 169.254.10.11#53(169.254.10.11) |
| 267 | +;; WHEN: Thu Jul 03 16:57:42 UTC 2025 |
| 268 | +;; MSG SIZE rcvd: 74 |
| 269 | +``` |
| 270 | + |
| 271 | +#### Test VnetDNS zone traffic |
| 272 | + |
| 273 | +```bash |
| 274 | +kubectl exec dnsutils -- dig bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6 @169.254.10.10 |
| 275 | + |
| 276 | +; <<>> DiG 9.16.27 <<>> bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6 @169.254.10.10 |
| 277 | +;; global options: +cmd |
| 278 | +;; Got answer: |
| 279 | +;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3580 |
| 280 | +;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 |
| 281 | + |
| 282 | +;; QUESTION SECTION: |
| 283 | +;bing.com. IN A |
| 284 | + |
| 285 | +;; ANSWER SECTION: |
| 286 | +bing.com. 1315 IN A 150.171.28.10 |
| 287 | +bing.com. 1315 IN A 150.171.27.10 |
| 288 | + |
| 289 | +;; Query time: 7 msec |
| 290 | +;; SERVER: 169.254.10.10#53(169.254.10.10) |
| 291 | +;; WHEN: Thu Jul 03 16:59:07 UTC 2025 |
| 292 | +;; MSG SIZE rcvd: 74 |
| 293 | +``` |
| 294 | + |
| 295 | +### View LocalDNS logs collected |
| 296 | + |
| 297 | +Lastly, you can now view the logs from your LocalDNS instances. Connect to the specific node and run the following commands to view the logs |
| 298 | + |
| 299 | +```bash |
| 300 | +# view the logs for the aks-local-dns service |
| 301 | +journalctl -u localdns |
| 302 | + |
| 303 | +# To view logs in reverse chronological order (latest logs first) |
| 304 | +journalctl -u localdns --reverse |
| 305 | + |
| 306 | +# To continuously follow the logs. |
| 307 | +journalctl -u localdns -f |
| 308 | + |
| 309 | +# sample output using journalctl for the bing.com responses |
| 310 | +journalctl -u localdns | grep bing.com |
| 311 | +Jul 03 16:57:42 aks-userpool-24995383-vmss000000 localdns-coredns[2491520]: [INFO] 10.244.0.95:41796 - 7452 "A IN bing.com. udp 26 false 512" NOERROR qr,rd,ra 74 0.004490668s |
| 312 | +Jul 03 16:59:07 aks-userpool-24995383-vmss000000 localdns-coredns[2491520]: [INFO] 10.244.0.95:58454 - 3580 "A IN bing.com. udp 26 false 512" NOERROR qr,rd,ra 74 0.001570158s |
| 313 | +``` |
| 314 | + |
| 315 | +If you see logs for your traffic, the pod has successfully been able to reach the localdns service. |
| 316 | + |
| 317 | +## Next steps |
| 318 | +If the above logs fail to help root cause the issue, you can enable [Querylogging for CoreDNS](https://learn.microsoft.com/en-us/azure/aks/coredns-custom#enable-dns-query-logging) to validate if CoreDNS is working as intended. |
| 319 | + |
| 320 | +[!INCLUDE [Azure Help Support](../../../../includes/azure-help-support.md)] |
| 321 | + |
| 322 | +[!INCLUDE [Third-party disclaimer](../../../../includes/third-party-disclaimer.md)] |
| 323 | + |
| 324 | +[!INCLUDE [Third-party contact disclaimer](../../../../includes/third-party-contact-disclaimer.md)] |
| 325 | + |
0 commit comments