Skip to content

Commit 8367181

Browse files
author
Vaibhav Arora
committed
tsg for localdns
1 parent 1516007 commit 8367181

File tree

1 file changed

+325
-0
lines changed

1 file changed

+325
-0
lines changed
Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
---
2+
title: Basic troubleshooting of DNS resolution problems in AKS
3+
description: Learn how to create a troubleshooting workflow to fix DNS resolution problems in Azure Kubernetes Service (AKS).
4+
author: sturrent
5+
ms.author: seturren
6+
ms.date: 05/29/2025
7+
ms.reviewer: v-rekhanain, v-leedennis, josebl, v-weizhu, qasimsarfraz
8+
editor: v-jsitser
9+
ms.service: azure-kubernetes-service
10+
ms.custom: sap:Connectivity
11+
ms.topic: troubleshooting-general
12+
#Customer intent: As an Azure Kubernetes user, I want to learn how to create a troubleshooting workflow so that I can fix DNS resolution problems in Azure Kubernetes Service (AKS).
13+
---
14+
# Troubleshoot issues with LocalDNS on Azure Kubernetes Service (AKS)
15+
This article discusses how to create a troubleshooting workflow to fix Domain Name System (DNS) resolution problems in Microsoft Azure Kubernetes Service (AKS), specifically when using LocalDNS. To learn more about LocalDNS, you can read our overview in [DNS Resolution in Azure Kubernetes Service (AKS)](https://learn.microsoft.com/en-us/azure/aks/dns-concepts#localdns-in-azure-kubernetes-service-preview).
16+
17+
## Prerequisites
18+
19+
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) command-line tool
20+
21+
**Note:** To install kubectl by using [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
22+
23+
- The [systemctl](https://man7.org/linux/man-pages/man1/systemctl.1.html) command-line tool.
24+
25+
- The [journalctl](https://www.man7.org/linux/man-pages/man1/journalctl.1.html) command-line tool.
26+
27+
## Identifying patterns in DNS failures
28+
Before you begin diagnosing the issues seen with LocalDNS, identify potential patterns with your DNS failures. Some patterns include:
29+
1. DNS resolution failure - is this happening allways or intermittently
30+
2. Are you seeing the DNS issues from all the nodes, a specific nodepool or subset of nodes or just a single node?
31+
3. Are you seeing DNS issues from nodes in a specific Azure Zone? Or from all the zones?
32+
4. What protocols are failing? Is it both TCP and UDP? or just one of them?
33+
5. What zones are failing? Is it all zones? or a specific zone traffic?
34+
35+
**Note:** "zones" here refers to the DNS zones like *cluster.local* and *"."* (root) and not to physical zones in Azure.
36+
37+
## Diagnose LocalDNS with a test DNSUtil pod
38+
39+
### Step 1: Deploy a test dnsutils pod
40+
Option 1 - Deploy a test pod to your cluster using the following command:
41+
``` bash
42+
kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
43+
```
44+
45+
Option 2 - If you are seeing DNS issues in specific nodes, you can control the deployment of the test pod using nodeSelector:
46+
47+
```bash
48+
cat <<EOF | kubectl create -f -
49+
apiVersion: v1
50+
kind: Pod
51+
metadata:
52+
name: dnsutils2
53+
namespace: default
54+
spec:
55+
nodeSelector:
56+
kubernetes.io/hostname: <NODE>
57+
containers:
58+
- name: dnsutils
59+
image: registry.k8s.io/e2e-test-images/agnhost:2.39
60+
command:
61+
- sleep
62+
- "infinity"
63+
imagePullPolicy: IfNotPresent
64+
restartPolicy: Always
65+
EOF
66+
```
67+
68+
Option 3 - If you run both linux and windows nodes in your cluster, you can configure the test pod to deploy to all linux nodes
69+
70+
```bash
71+
cat <<EOF | kubectl create -f -
72+
apiVersion: apps/v1
73+
kind: DaemonSet
74+
metadata:
75+
name: dnsutils
76+
namespace: default
77+
spec:
78+
selector:
79+
matchLabels:
80+
app: dnsutils
81+
template:
82+
metadata:
83+
labels:
84+
app: dnsutils
85+
spec:
86+
nodeSelector:
87+
kubernetes.io/os: linux
88+
containers:
89+
- name: dnsutils
90+
image: registry.k8s.io/e2e-test-images/agnhost:2.39
91+
command:
92+
- sleep
93+
- "infinity"
94+
imagePullPolicy: IfNotPresent
95+
EOF
96+
```
97+
98+
### Enable Query logging for LocalDNS
99+
100+
Most use cases require query logging to be turned off in production because of its high memory usage and performance implications. However, for troubleshooting purposes, you should enable query logging in your localDNS configuration to root cause the source of your errors. Once the analyses is complete, you can turn this back off.
101+
102+
Option 1 - Enable Query logging on all nodes
103+
104+
You can modify your LocalDNS configuration to reflect *Querylogging:Log* for a single or multiple DNS zones.
105+
106+
```json
107+
{
108+
"mode": "Required",
109+
"vnetDNSOverrides": {
110+
".": {
111+
"queryLogging": "Log",
112+
"protocol": "PreferUDP",
113+
"forwardDestination": "VnetDNS",
114+
"forwardPolicy": "Sequential",
115+
"maxConcurrent": 1000,
116+
"cacheDurationInSeconds": 3600,
117+
"serveStaleDurationInSeconds": 3600,
118+
"serveStale": "Immediate"
119+
},
120+
"cluster.local": {
121+
"queryLogging": "Log",
122+
"protocol": "ForceTCP",
123+
"forwardDestination": "ClusterCoreDNS",
124+
"forwardPolicy": "Sequential",
125+
"maxConcurrent": 1000,
126+
"cacheDurationInSeconds": 3600,
127+
"serveStaleDurationInSeconds": 3600,
128+
"serveStale": "Immediate"
129+
}
130+
},
131+
"kubeDNSOverrides": {
132+
".": {
133+
"queryLogging": "Log",
134+
"protocol": "PreferUDP",
135+
"forwardDestination": "ClusterCoreDNS",
136+
"forwardPolicy": "Sequential",
137+
"maxConcurrent": 1000,
138+
"cacheDurationInSeconds": 3600,
139+
"serveStaleDurationInSeconds": 3600,
140+
"serveStale": "Immediate"
141+
},
142+
"cluster.local": {
143+
"queryLogging": "Log",
144+
"protocol": "ForceTCP",
145+
"forwardDestination": "ClusterCoreDNS",
146+
"forwardPolicy": "Sequential",
147+
"maxConcurrent": 1000,
148+
"cacheDurationInSeconds": 3600,
149+
"serveStaleDurationInSeconds": 3600,
150+
"serveStale": "Immediate"
151+
}
152+
}
153+
}
154+
```
155+
156+
This can be enabled on the node pool using the Azure CLI
157+
158+
```bash
159+
az aks nodepool update --name mynodepool1 --cluster-name myAKSCluster --resource-group myResourceGroup --localdns-config ./localdnsconfig.json
160+
```
161+
162+
**Note:** Making changes to the LocalDNS configuration will trigger a reimage operation on the nodes in the given node pool.
163+
164+
Option 2 - Enable Query logging on a specific node
165+
166+
Diagnosing LocalDNS issues on a specific node can be done by temporarily rewriting the LocalDNS configuration on that specific node. You can [connect to the node](https://learn.microsoft.com/en-us/azure/aks/node-access#connect-using-kubectl-debug) manually and update the corefile used by localdns, only restarting the specific localdns service.
167+
168+
**Note:** The changes made this way are ephemeral in nature and will not be persisted once the troubleshooting is complete.
169+
170+
```bash
171+
# You need to connect to the node before running the following commands
172+
173+
## open the configuration file for LocalDNS
174+
vi /opt/azure/containers/localdns/localdns.corefile
175+
176+
<Manually change errors to log for a zone or all zones>
177+
178+
# ***********************************************************************************
179+
# WARNING: Changes to this file will be overwritten and not persisted.
180+
# ***********************************************************************************
181+
# whoami (used for health check of DNS)
182+
health-check.localdns.local:53 {
183+
bind 169.254.10.10 169.254.10.11
184+
whoami
185+
}
186+
# VnetDNS overrides apply to DNS traffic from pods with dnsPolicy:default or kubelet (referred to as VnetDNS traffic).
187+
.:53 {
188+
errors
189+
bind 169.254.10.10
190+
forward . 168.63.129.16 {
191+
policy sequential
192+
max_concurrent 1000
193+
}
194+
ready 169.254.10.10:8181
195+
cache 3600s {
196+
success 9984
197+
denial 9984
198+
serve_stale 3600s verify
199+
servfail 0
200+
}
201+
loop
202+
nsid localdns
203+
prometheus :9253
204+
template ANY ANY internal.cloudapp.net {
205+
match "^(?:[^.]+\.){4,}internal\.cloudapp\.net\.$"
206+
rcode NXDOMAIN
207+
fallthrough
208+
}
209+
template ANY ANY reddog.microsoft.com {
210+
rcode NXDOMAIN
211+
}
212+
}
213+
cluster.local:53 {
214+
errors
215+
bind 169.254.10.10
216+
forward . 10.0.0.10 {
217+
force_tcp
218+
policy sequential
219+
max_concurrent 1000
220+
}
221+
ready 169.254.10.10:8181
222+
cache 3600s {
223+
success 9984
224+
denial 9984
225+
serve_stale 3600s verify
226+
servfail 0
227+
}
228+
loop
229+
nsid localdns
230+
prometheus :9253
231+
}
232+
...
233+
234+
...
235+
<Save the changes>
236+
237+
<Restart localdns service>
238+
systemctl restart localdns
239+
```
240+
241+
Once restarted, LocalDNS should begin collecting all logs for the chosen zones.
242+
243+
### Generater traffic from dnsutils pod
244+
245+
The next step would be to trigger some DNS traffic on LocalDNS. LocalDNS has two IPs - The KubeDNS traffic goes to the ClusterListenerIP - 169.254.10.11, while VnetDNSTraffic goes to the NodeListenerIP - 169.254.10.10#53
246+
247+
#### Test KubeDNS zone traffic
248+
249+
```bash
250+
kubectl exec dnsutils -- dig bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6
251+
252+
; <<>> DiG 9.16.27 <<>> bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6
253+
;; global options: +cmd
254+
;; Got answer:
255+
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7452
256+
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
257+
258+
;; QUESTION SECTION:
259+
;bing.com. IN A
260+
261+
;; ANSWER SECTION:
262+
bing.com. 30 IN A 150.171.27.10
263+
bing.com. 30 IN A 150.171.28.10
264+
265+
;; Query time: 3 msec
266+
;; SERVER: 169.254.10.11#53(169.254.10.11)
267+
;; WHEN: Thu Jul 03 16:57:42 UTC 2025
268+
;; MSG SIZE rcvd: 74
269+
```
270+
271+
#### Test VnetDNS zone traffic
272+
273+
```bash
274+
kubectl exec dnsutils -- dig bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6 @169.254.10.10
275+
276+
; <<>> DiG 9.16.27 <<>> bing.com +ignore +noedns +search +noshowsearch +time=10 +tries=6 @169.254.10.10
277+
;; global options: +cmd
278+
;; Got answer:
279+
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3580
280+
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
281+
282+
;; QUESTION SECTION:
283+
;bing.com. IN A
284+
285+
;; ANSWER SECTION:
286+
bing.com. 1315 IN A 150.171.28.10
287+
bing.com. 1315 IN A 150.171.27.10
288+
289+
;; Query time: 7 msec
290+
;; SERVER: 169.254.10.10#53(169.254.10.10)
291+
;; WHEN: Thu Jul 03 16:59:07 UTC 2025
292+
;; MSG SIZE rcvd: 74
293+
```
294+
295+
### View LocalDNS logs collected
296+
297+
Lastly, you can now view the logs from your LocalDNS instances. Connect to the specific node and run the following commands to view the logs
298+
299+
```bash
300+
# view the logs for the aks-local-dns service
301+
journalctl -u localdns
302+
303+
# To view logs in reverse chronological order (latest logs first)
304+
journalctl -u localdns --reverse
305+
306+
# To continuously follow the logs.
307+
journalctl -u localdns -f
308+
309+
# sample output using journalctl for the bing.com responses
310+
journalctl -u localdns | grep bing.com
311+
Jul 03 16:57:42 aks-userpool-24995383-vmss000000 localdns-coredns[2491520]: [INFO] 10.244.0.95:41796 - 7452 "A IN bing.com. udp 26 false 512" NOERROR qr,rd,ra 74 0.004490668s
312+
Jul 03 16:59:07 aks-userpool-24995383-vmss000000 localdns-coredns[2491520]: [INFO] 10.244.0.95:58454 - 3580 "A IN bing.com. udp 26 false 512" NOERROR qr,rd,ra 74 0.001570158s
313+
```
314+
315+
If you see logs for your traffic, the pod has successfully been able to reach the localdns service.
316+
317+
## Next steps
318+
If the above logs fail to help root cause the issue, you can enable [Querylogging for CoreDNS](https://learn.microsoft.com/en-us/azure/aks/coredns-custom#enable-dns-query-logging) to validate if CoreDNS is working as intended.
319+
320+
[!INCLUDE [Azure Help Support](../../../../includes/azure-help-support.md)]
321+
322+
[!INCLUDE [Third-party disclaimer](../../../../includes/third-party-disclaimer.md)]
323+
324+
[!INCLUDE [Third-party contact disclaimer](../../../../includes/third-party-contact-disclaimer.md)]
325+

0 commit comments

Comments
 (0)