Skip to content

Commit c99f8d2

Browse files
committed
docs: Add DNS monitoring best practices guide
Comprehensive best practices documentation covering: - Identifying critical DNS dependencies (auth, databases, service mesh, APIs) - Common DNS issues in Kubernetes clusters with detection methods - Configuration examples for 4 use cases: - Basic external connectivity check - Custom TLD monitoring (.internal, .local) - High-availability DNS validation - Database hostname monitoring - Remediation guidance with: - Condition-to-cause-to-fix mapping table - CoreDNS stub domain configuration example - Temporary /etc/hosts workaround - Integration with Kubernetes: - kubectl commands for viewing conditions - Prometheus alerting rules for DNS conditions - SLO/SLA tracking configuration This completes Feature #390 (Enhanced DNS Resolution Monitoring).
1 parent b97d01b commit c99f8d2

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed

docs/monitors.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -615,6 +615,243 @@ conditions:
615615
message: "Domain ldap.corp.internal: 8/10 queries succeeded (80.0% success rate)"
616616
```
617617

618+
#### Best Practices for DNS Monitoring
619+
620+
##### Identifying Critical DNS Dependencies
621+
622+
Before configuring DNS monitoring, identify domains critical to your infrastructure:
623+
624+
| Dependency Type | Examples | Risk Level |
625+
|-----------------|----------|------------|
626+
| **Authentication** | LDAP/AD servers, OAuth providers | Critical - auth failures block users |
627+
| **Databases** | PostgreSQL, MySQL, MongoDB hostnames | Critical - app failures |
628+
| **Service Mesh** | Consul, Istio service discovery | High - service routing failures |
629+
| **External APIs** | Payment gateways, third-party services | High - feature degradation |
630+
| **Container Registries** | gcr.io, docker.io, custom registries | Medium - deployment failures |
631+
| **Cluster Services** | kubernetes.default.svc.cluster.local | Critical - pod communication |
632+
633+
##### Common DNS Issues in Kubernetes Clusters
634+
635+
| Issue | Symptoms | Detection Method |
636+
|-------|----------|------------------|
637+
| **Upstream DNS overload** | Intermittent timeouts in large clusters (100+ nodes) | Success rate tracking, consistency checking |
638+
| **Custom TLD misconfiguration** | NXDOMAIN for .local, .internal, .test domains | Error type classification shows NXDOMAIN |
639+
| **Split-horizon DNS** | Different results from different nameservers | Per-nameserver testing |
640+
| **CoreDNS pod failures** | Cluster DNS fails, external works | Compare clusterDomains vs externalDomains results |
641+
| **DNS cache TTL issues** | Stale IPs after service migration | Consistency checking shows IP variation |
642+
| **Network policy blocking** | Timeout to specific nameservers | Per-nameserver testing with error classification |
643+
644+
##### Configuration Examples by Use Case
645+
646+
**1. Basic External Connectivity Check:**
647+
```yaml
648+
monitors:
649+
- name: external-dns
650+
type: network-dns-check
651+
interval: 60s
652+
config:
653+
clusterDomains: [] # Skip cluster DNS
654+
externalDomains:
655+
- google.com
656+
- cloudflare.com
657+
latencyThreshold: 2s
658+
failureCountThreshold: 3
659+
```
660+
661+
**2. Custom TLD Monitoring (.internal, .local):**
662+
```yaml
663+
monitors:
664+
- name: internal-dns
665+
type: network-dns-check
666+
interval: 30s
667+
config:
668+
clusterDomains: []
669+
externalDomains: []
670+
customQueries:
671+
- domain: "app.internal.corp"
672+
recordType: "A"
673+
testEachNameserver: true # Find which DNS server fails
674+
- domain: "db.internal.corp"
675+
recordType: "A"
676+
testEachNameserver: true
677+
nameserverCheckEnabled: true
678+
failureCountThreshold: 2
679+
```
680+
681+
**3. High-Availability DNS Validation:**
682+
```yaml
683+
monitors:
684+
- name: ha-dns-check
685+
type: network-dns-check
686+
interval: 15s # More frequent checks
687+
config:
688+
customQueries:
689+
- domain: "api-gateway.prod.svc.cluster.local"
690+
consistencyCheck: true # Detect intermittent failures
691+
testEachNameserver: true # Check all DNS servers
692+
successRateTracking:
693+
enabled: true
694+
windowSize: 20
695+
failureRateThreshold: 5 # 5% threshold for critical services
696+
minSamplesRequired: 10
697+
consistencyChecking:
698+
enabled: true
699+
queriesPerCheck: 10
700+
intervalBetweenQueries: 50ms # Aggressive testing
701+
```
702+
703+
**4. Database Hostname Monitoring:**
704+
```yaml
705+
monitors:
706+
- name: database-dns
707+
type: network-dns-check
708+
interval: 30s
709+
config:
710+
customQueries:
711+
- domain: "postgres-primary.db.svc.cluster.local"
712+
recordType: "A"
713+
consistencyCheck: true
714+
- domain: "postgres-replica.db.svc.cluster.local"
715+
recordType: "A"
716+
consistencyCheck: true
717+
- domain: "redis-master.cache.svc.cluster.local"
718+
recordType: "A"
719+
latencyThreshold: 100ms # Low latency for DB connections
720+
failureCountThreshold: 1 # Alert immediately
721+
consistencyChecking:
722+
enabled: true
723+
queriesPerCheck: 5
724+
```
725+
726+
##### Remediation Guidance
727+
728+
When DNS issues are detected, consider these remediation steps:
729+
730+
| Condition | Cause | Remediation |
731+
|-----------|-------|-------------|
732+
| `ClusterDNSDown` | CoreDNS pods unhealthy | Check `kubectl -n kube-system get pods -l k8s-app=kube-dns` |
733+
| `DNSResolutionDegraded` (partial nameserver failure) | One upstream DNS server failing | Update `/etc/resolv.conf` or CoreDNS upstream servers |
734+
| `DNSResolutionIntermittent` | Overloaded DNS servers | Increase CoreDNS replicas, enable DNS caching |
735+
| NXDOMAIN errors | Missing DNS record or zone | Add record to DNS zone, check CoreDNS stub domains |
736+
| High latency | Network congestion, distant DNS | Use local caching DNS, reduce TTL for faster updates |
737+
| `DNSResolutionInconsistent` (varying IPs) | DNS load balancing, stale cache | Verify expected behavior, check TTL settings |
738+
739+
**CoreDNS Stub Domain Configuration:**
740+
741+
If custom TLDs (.internal, .corp) are failing, configure CoreDNS stub domains:
742+
743+
```yaml
744+
apiVersion: v1
745+
kind: ConfigMap
746+
metadata:
747+
name: coredns
748+
namespace: kube-system
749+
data:
750+
Corefile: |
751+
.:53 {
752+
errors
753+
health
754+
kubernetes cluster.local in-addr.arpa ip6.arpa {
755+
pods insecure
756+
fallthrough in-addr.arpa ip6.arpa
757+
}
758+
prometheus :9153
759+
forward . /etc/resolv.conf
760+
cache 30
761+
loop
762+
reload
763+
loadbalance
764+
}
765+
corp.internal:53 {
766+
errors
767+
cache 30
768+
forward . 10.0.0.53 10.0.0.54 # Internal DNS servers
769+
}
770+
```
771+
772+
**Temporary `/etc/hosts` Workaround:**
773+
774+
For immediate mitigation while DNS is fixed:
775+
776+
```yaml
777+
apiVersion: v1
778+
kind: Pod
779+
spec:
780+
hostAliases:
781+
- ip: "10.0.1.100"
782+
hostnames:
783+
- "ldap.corp.internal"
784+
- ip: "10.0.1.101"
785+
hostnames:
786+
- "auth.corp.internal"
787+
```
788+
789+
##### Integration with Kubernetes Events
790+
791+
DNS conditions appear as node conditions viewable via `kubectl`:
792+
793+
```bash
794+
# View DNS-related node conditions
795+
kubectl describe node <node-name> | grep -A5 "DNS"
796+
797+
# Example output:
798+
# DNSResolutionDegraded True PartialNameserverFailure
799+
# Message: Domain ldap.corp.internal: 2/3 nameservers responding
800+
```
801+
802+
**Alerting with Prometheus:**
803+
804+
Node Doctor exports DNS metrics that can be used for alerting:
805+
806+
```yaml
807+
# Example Prometheus alert rules (using kube-state-metrics node conditions)
808+
groups:
809+
- name: dns-alerts
810+
rules:
811+
- alert: DNSResolutionIntermittent
812+
expr: |
813+
kube_node_status_condition{condition="DNSResolutionIntermittent", status="true"} == 1
814+
for: 5m
815+
labels:
816+
severity: warning
817+
annotations:
818+
summary: "Intermittent DNS resolution on {{ $labels.node }}"
819+
description: "DNS resolution is intermittent, indicating upstream DNS issues"
820+
821+
- alert: DNSResolutionDegraded
822+
expr: |
823+
kube_node_status_condition{condition="DNSResolutionDegraded", status="true"} == 1
824+
for: 2m
825+
labels:
826+
severity: critical
827+
annotations:
828+
summary: "Degraded DNS resolution on {{ $labels.node }}"
829+
description: "One or more DNS servers are failing for critical domains"
830+
831+
- alert: ClusterDNSDown
832+
expr: |
833+
kube_node_status_condition{condition="ClusterDNSDown", status="true"} == 1
834+
for: 1m
835+
labels:
836+
severity: critical
837+
annotations:
838+
summary: "Cluster DNS is down on {{ $labels.node }}"
839+
description: "Cluster DNS resolution has repeatedly failed"
840+
```
841+
842+
**SLO/SLA Tracking:**
843+
844+
Use success rate tracking for DNS SLOs:
845+
846+
```yaml
847+
# Configuration for 99.9% DNS availability SLO
848+
successRateTracking:
849+
enabled: true
850+
windowSize: 100 # Track 100 checks
851+
failureRateThreshold: 0.1 # 0.1% = 99.9% availability
852+
minSamplesRequired: 50 # Need 50 samples before alerting
853+
```
854+
618855
---
619856

620857
### Gateway Monitor

0 commit comments

Comments
 (0)