-
Notifications
You must be signed in to change notification settings - Fork 0
EN_K8s_Troubleshooting
When CPU Limits are set, they are enforced via CFS (Completely Fair Scheduler) Quota. Throttling can occur even when actual CPU utilization is low.
# Check throttling via metrics
kubectl top pods --containers
# PromQL query
rate(container_cpu_cfs_throttled_seconds_total[5m]) /
rate(container_cpu_cfs_periods_total[5m]) > 0.25CFS enforces CPU Limits in 100ms periods. A container with cpu: "500m" can only use 50ms of CPU per 100ms period. Bursty workloads can exhaust the quota quickly, causing throttling even at low average utilization.
1. Remove Limits or set them high enough:
resources:
requests:
cpu: "500m"
limits:
cpu: "2000m" # Set much higher than requests2. Set only Requests (no Limits):
resources:
requests:
cpu: "500m"
# No limits — Burstable QoS class3. Tune JVM for Java applications:
env:
- name: JAVA_OPTS
value: "-XX:ActiveProcessorCount=2 -XX:+UseContainerSupport"- Set
requeststo actual average usage - Set
limitsto peak usage (spike tolerance) - Use
GuaranteedQoS (requests == limits) only for latency-sensitive workloads
The ndots:5 setting in a Pod's /etc/resolv.conf can trigger up to 6 DNS queries per lookup.
# /etc/resolv.conf in a Pod
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Querying example.com triggers this search order:
example.com.default.svc.cluster.localexample.com.svc.cluster.localexample.com.cluster.local-
example.com(final — only resolves here)
1. Use FQDN (add trailing dot):
# Code example
requests.get("http://api.example.com./v1/users")2. Lower ndots per Pod:
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- name: single-request-reopen # Prevent parallel A/AAAA race condition3. Enable CoreDNS autopath plugin:
# CoreDNS ConfigMap
data:
Corefile: |
.:53 {
autopath @kubernetes
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
cache 30
}4. Deploy NodeLocal DNSCache:
# Installs a DNS cache on each node, converting UDP to eBPF
kubectl apply -f https://k8s.io/examples/admin/dns/nodelocaldns.yaml# CoreDNS response time
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))
# DNS error rate
rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])
Q33. What is the difference between ImagePullBackOff and ErrImagePull, and how do you fix Private Registry authentication issues?
- ErrImagePull: First image pull failure
- ImagePullBackOff: Retrying with exponential backoff after repeated failures
1. Image does not exist:
kubectl describe pod <pod-name>
# Check: Failed to pull image "my-image:v1.0": ... not found2. Private Registry authentication failure:
# Create docker-registry Secret
kubectl create secret docker-registry regcred \
--docker-server=<registry-url> \
--docker-username=<username> \
--docker-password=<password> \
--docker-email=<email># Reference in Pod spec
spec:
imagePullSecrets:
- name: regcred# Add to ServiceAccount for automatic use
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
imagePullSecrets:
- name: regcred3. Docker Hub rate limit:
# Check rate limit status
curl --head -H "Authorization: Bearer <token>" \
https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest- Solution: Authenticate with Docker Hub account or set up a registry mirror
4. Network issue:
# Check DNS resolution inside Pod
kubectl run debug --image=busybox --rm -it -- nslookup <registry-url>
# Check node network connectivity
kubectl debug node/<node-name> -it --image=ubuntu5. Use local image (IfNotPresent):
spec:
containers:
- name: app
image: my-app:v1.0
imagePullPolicy: IfNotPresent # Use local cache if available- kubelet stopped/crashed
- Network partition
- Disk/memory resource exhaustion
- CNI plugin issue
- Certificate expiry
# Step 1: Check Node conditions
kubectl describe node <node-name>
# Look for: DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable
# Step 2: SSH to node and check kubelet
systemctl status kubelet
journalctl -u kubelet -f --since "10 minutes ago"
# Step 3: Check resource usage
df -h # Disk usage
free -m # Memory usage
df -i # Inode usage (often overlooked)
# Step 4: Check CNI
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium"
systemctl status containerdkubelet cannot start:
# Certificate issue
ls -la /etc/kubernetes/pki/
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# Fix: Rotate certificates
kubeadm certs renew all
systemctl restart kubeletDisk pressure:
# Clean up container images
crictl rmi --prune
# Remove unused Docker images
docker system prune -af
# Clean up log files
find /var/log -name "*.log" -mtime +7 -delete
journalctl --vacuum-size=500MCNI issue:
# Restart CNI pods
kubectl delete pod -n kube-system -l k8s-app=calico-node
# Reinitialize CNI
rm -rf /etc/cni/net.d/*
systemctl restart kubelet-
systemctl restart kubelet— attempt restart - Clean up disk space (logs, images)
- Restart CNI Pods
- If unrecoverable — replace the node
Pod rescheduling: After --pod-eviction-timeout (default 5 minutes), Pods are rescheduled to other nodes.
PV/PVC stuck in Terminating state is caused by Finalizers. Finalizers ensure cleanup work before resource deletion.
# Check PVC finalizers
kubectl get pvc <pvc-name> -o jsonpath='{.metadata.finalizers}'
# Output: ["kubernetes.io/pvc-protection"]
# Check PV finalizers
kubectl get pv <pv-name> -o jsonpath='{.metadata.finalizers}'
# Output: ["kubernetes.io/pv-protection"]# 1. Check if Pod is using the PVC
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") | .metadata.name'
# 2. Delete the Pod first if it exists
kubectl delete pod <pod-name>
# 3. Remove PVC Finalizer
kubectl patch pvc <pvc-name> -p '{"metadata":{"finalizers":null}}'
# 4. Remove PV Finalizer
kubectl patch pv <pv-name> -p '{"metadata":{"finalizers":null}}'# Check if volume is detached in AWS Console
aws ec2 describe-volumes --volume-ids <volume-id>
# If still attached, force detach
aws ec2 detach-volume --volume-id <volume-id> --force| Policy | Behavior after PVC deletion |
|---|---|
| Delete | PV and underlying storage automatically deleted |
| Retain | PV remains in Released state — manual cleanup required |
| Recycle | (Deprecated) Cleans data and makes available again |
# Reuse a PV with Retain policy
kubectl patch pv <pv-name> -p '{"spec":{"claimRef":null}}'apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
volumeBindingMode: WaitForFirstConsumer # PVC stays Pending until Pod is scheduledWhen using this mode, a PVC without a Pod will remain in Pending state — this is expected behavior.
Key Terms: CPU Throttling, CFS Quota, ndots, CoreDNS, ImagePullBackOff, Node NotReady, Finalizer — refer to the Troubleshooting glossary section for detailed explanations.