@@ -615,6 +615,243 @@ conditions:
615615 message: "Domain ldap.corp.internal: 8/10 queries succeeded (80.0% success rate)"
616616` ` `
617617
618+ # ### Best Practices for DNS Monitoring
619+
620+ # #### Identifying Critical DNS Dependencies
621+
622+ Before configuring DNS monitoring, identify domains critical to your infrastructure :
623+
624+ | Dependency Type | Examples | Risk Level |
625+ |-----------------|----------|------------|
626+ | **Authentication** | LDAP/AD servers, OAuth providers | Critical - auth failures block users |
627+ | **Databases** | PostgreSQL, MySQL, MongoDB hostnames | Critical - app failures |
628+ | **Service Mesh** | Consul, Istio service discovery | High - service routing failures |
629+ | **External APIs** | Payment gateways, third-party services | High - feature degradation |
630+ | **Container Registries** | gcr.io, docker.io, custom registries | Medium - deployment failures |
631+ | **Cluster Services** | kubernetes.default.svc.cluster.local | Critical - pod communication |
632+
633+ # #### Common DNS Issues in Kubernetes Clusters
634+
635+ | Issue | Symptoms | Detection Method |
636+ |-------|----------|------------------|
637+ | **Upstream DNS overload** | Intermittent timeouts in large clusters (100+ nodes) | Success rate tracking, consistency checking |
638+ | **Custom TLD misconfiguration** | NXDOMAIN for .local, .internal, .test domains | Error type classification shows NXDOMAIN |
639+ | **Split-horizon DNS** | Different results from different nameservers | Per-nameserver testing |
640+ | **CoreDNS pod failures** | Cluster DNS fails, external works | Compare clusterDomains vs externalDomains results |
641+ | **DNS cache TTL issues** | Stale IPs after service migration | Consistency checking shows IP variation |
642+ | **Network policy blocking** | Timeout to specific nameservers | Per-nameserver testing with error classification |
643+
644+ # #### Configuration Examples by Use Case
645+
646+ **1. Basic External Connectivity Check:**
647+ ` ` ` yaml
648+ monitors:
649+ - name: external-dns
650+ type: network-dns-check
651+ interval: 60s
652+ config:
653+ clusterDomains: [] # Skip cluster DNS
654+ externalDomains:
655+ - google.com
656+ - cloudflare.com
657+ latencyThreshold: 2s
658+ failureCountThreshold: 3
659+ ` ` `
660+
661+ **2. Custom TLD Monitoring (.internal, .local):**
662+ ` ` ` yaml
663+ monitors:
664+ - name: internal-dns
665+ type: network-dns-check
666+ interval: 30s
667+ config:
668+ clusterDomains: []
669+ externalDomains: []
670+ customQueries:
671+ - domain: "app.internal.corp"
672+ recordType: "A"
673+ testEachNameserver: true # Find which DNS server fails
674+ - domain: "db.internal.corp"
675+ recordType: "A"
676+ testEachNameserver: true
677+ nameserverCheckEnabled: true
678+ failureCountThreshold: 2
679+ ` ` `
680+
681+ **3. High-Availability DNS Validation:**
682+ ` ` ` yaml
683+ monitors:
684+ - name: ha-dns-check
685+ type: network-dns-check
686+ interval: 15s # More frequent checks
687+ config:
688+ customQueries:
689+ - domain: "api-gateway.prod.svc.cluster.local"
690+ consistencyCheck: true # Detect intermittent failures
691+ testEachNameserver: true # Check all DNS servers
692+ successRateTracking:
693+ enabled: true
694+ windowSize: 20
695+ failureRateThreshold: 5 # 5% threshold for critical services
696+ minSamplesRequired: 10
697+ consistencyChecking:
698+ enabled: true
699+ queriesPerCheck: 10
700+ intervalBetweenQueries: 50ms # Aggressive testing
701+ ` ` `
702+
703+ **4. Database Hostname Monitoring:**
704+ ` ` ` yaml
705+ monitors:
706+ - name: database-dns
707+ type: network-dns-check
708+ interval: 30s
709+ config:
710+ customQueries:
711+ - domain: "postgres-primary.db.svc.cluster.local"
712+ recordType: "A"
713+ consistencyCheck: true
714+ - domain: "postgres-replica.db.svc.cluster.local"
715+ recordType: "A"
716+ consistencyCheck: true
717+ - domain: "redis-master.cache.svc.cluster.local"
718+ recordType: "A"
719+ latencyThreshold: 100ms # Low latency for DB connections
720+ failureCountThreshold: 1 # Alert immediately
721+ consistencyChecking:
722+ enabled: true
723+ queriesPerCheck: 5
724+ ` ` `
725+
726+ # #### Remediation Guidance
727+
728+ When DNS issues are detected, consider these remediation steps :
729+
730+ | Condition | Cause | Remediation |
731+ |-----------|-------|-------------|
732+ | `ClusterDNSDown` | CoreDNS pods unhealthy | Check `kubectl -n kube-system get pods -l k8s-app=kube-dns` |
733+ | `DNSResolutionDegraded` (partial nameserver failure) | One upstream DNS server failing | Update `/etc/resolv.conf` or CoreDNS upstream servers |
734+ | `DNSResolutionIntermittent` | Overloaded DNS servers | Increase CoreDNS replicas, enable DNS caching |
735+ | NXDOMAIN errors | Missing DNS record or zone | Add record to DNS zone, check CoreDNS stub domains |
736+ | High latency | Network congestion, distant DNS | Use local caching DNS, reduce TTL for faster updates |
737+ | `DNSResolutionInconsistent` (varying IPs) | DNS load balancing, stale cache | Verify expected behavior, check TTL settings |
738+
739+ **CoreDNS Stub Domain Configuration:**
740+
741+ If custom TLDs (.internal, .corp) are failing, configure CoreDNS stub domains :
742+
743+ ` ` ` yaml
744+ apiVersion: v1
745+ kind: ConfigMap
746+ metadata:
747+ name: coredns
748+ namespace: kube-system
749+ data:
750+ Corefile: |
751+ .:53 {
752+ errors
753+ health
754+ kubernetes cluster.local in-addr.arpa ip6.arpa {
755+ pods insecure
756+ fallthrough in-addr.arpa ip6.arpa
757+ }
758+ prometheus :9153
759+ forward . /etc/resolv.conf
760+ cache 30
761+ loop
762+ reload
763+ loadbalance
764+ }
765+ corp.internal:53 {
766+ errors
767+ cache 30
768+ forward . 10.0.0.53 10.0.0.54 # Internal DNS servers
769+ }
770+ ` ` `
771+
772+ **Temporary `/etc/hosts` Workaround:**
773+
774+ For immediate mitigation while DNS is fixed :
775+
776+ ` ` ` yaml
777+ apiVersion: v1
778+ kind: Pod
779+ spec:
780+ hostAliases:
781+ - ip: "10.0.1.100"
782+ hostnames:
783+ - "ldap.corp.internal"
784+ - ip: "10.0.1.101"
785+ hostnames:
786+ - "auth.corp.internal"
787+ ` ` `
788+
789+ # #### Integration with Kubernetes Events
790+
791+ DNS conditions appear as node conditions viewable via `kubectl` :
792+
793+ ` ` ` bash
794+ # View DNS-related node conditions
795+ kubectl describe node <node-name> | grep -A5 "DNS"
796+
797+ # Example output:
798+ # DNSResolutionDegraded True PartialNameserverFailure
799+ # Message: Domain ldap.corp.internal: 2/3 nameservers responding
800+ ` ` `
801+
802+ **Alerting with Prometheus:**
803+
804+ Node Doctor exports DNS metrics that can be used for alerting :
805+
806+ ` ` ` yaml
807+ # Example Prometheus alert rules (using kube-state-metrics node conditions)
808+ groups:
809+ - name: dns-alerts
810+ rules:
811+ - alert: DNSResolutionIntermittent
812+ expr: |
813+ kube_node_status_condition{condition="DNSResolutionIntermittent", status="true"} == 1
814+ for: 5m
815+ labels:
816+ severity: warning
817+ annotations:
818+ summary: "Intermittent DNS resolution on {{ $labels.node }}"
819+ description: "DNS resolution is intermittent, indicating upstream DNS issues"
820+
821+ - alert: DNSResolutionDegraded
822+ expr: |
823+ kube_node_status_condition{condition="DNSResolutionDegraded", status="true"} == 1
824+ for: 2m
825+ labels:
826+ severity: critical
827+ annotations:
828+ summary: "Degraded DNS resolution on {{ $labels.node }}"
829+ description: "One or more DNS servers are failing for critical domains"
830+
831+ - alert: ClusterDNSDown
832+ expr: |
833+ kube_node_status_condition{condition="ClusterDNSDown", status="true"} == 1
834+ for: 1m
835+ labels:
836+ severity: critical
837+ annotations:
838+ summary: "Cluster DNS is down on {{ $labels.node }}"
839+ description: "Cluster DNS resolution has repeatedly failed"
840+ ` ` `
841+
842+ **SLO/SLA Tracking:**
843+
844+ Use success rate tracking for DNS SLOs :
845+
846+ ` ` ` yaml
847+ # Configuration for 99.9% DNS availability SLO
848+ successRateTracking:
849+ enabled: true
850+ windowSize: 100 # Track 100 checks
851+ failureRateThreshold: 0.1 # 0.1% = 99.9% availability
852+ minSamplesRequired: 50 # Need 50 samples before alerting
853+ ` ` `
854+
618855---
619856
620857# ## Gateway Monitor
0 commit comments