EN_K8s_Troubleshooting

Kubernetes Troubleshooting in Practice

Troubleshooting (Q31-Q35)

Q31. What is CPU Throttling and how do you optimize CFS Quota?

When CPU Limits are set, they are enforced via CFS (Completely Fair Scheduler) Quota. Throttling can occur even when actual CPU utilization is low.

Detection:

# Check throttling via metrics
kubectl top pods --containers

# PromQL query
rate(container_cpu_cfs_throttled_seconds_total[5m]) /
rate(container_cpu_cfs_periods_total[5m]) > 0.25

Root Cause:

CFS enforces CPU Limits in 100ms periods. A container with cpu: "500m" can only use 50ms of CPU per 100ms period. Bursty workloads can exhaust the quota quickly, causing throttling even at low average utilization.

Solutions:

1. Remove Limits or set them high enough:

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2000m"   # Set much higher than requests

2. Set only Requests (no Limits):

resources:
  requests:
    cpu: "500m"
  # No limits — Burstable QoS class

3. Tune JVM for Java applications:

env:
- name: JAVA_OPTS
  value: "-XX:ActiveProcessorCount=2 -XX:+UseContainerSupport"

Recommendation:

Set requests to actual average usage
Set limits to peak usage (spike tolerance)
Use Guaranteed QoS (requests == limits) only for latency-sensitive workloads

Q32. How do you resolve DNS resolution delays caused by ndots=5?

The ndots:5 setting in a Pod's /etc/resolv.conf can trigger up to 6 DNS queries per lookup.

How It Works:

# /etc/resolv.conf in a Pod
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Querying example.com triggers this search order:

example.com.default.svc.cluster.local
example.com.svc.cluster.local
example.com.cluster.local
example.com (final — only resolves here)

Solutions:

1. Use FQDN (add trailing dot):

# Code example
requests.get("http://api.example.com./v1/users")

2. Lower ndots per Pod:

spec:
  dnsConfig:
    options:
    - name: ndots
      value: "2"
    - name: single-request-reopen   # Prevent parallel A/AAAA race condition

3. Enable CoreDNS autopath plugin:

# CoreDNS ConfigMap
data:
  Corefile: |
    .:53 {
      autopath @kubernetes
      kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
      }
      cache 30
    }

4. Deploy NodeLocal DNSCache:

# Installs a DNS cache on each node, converting UDP to eBPF
kubectl apply -f https://k8s.io/examples/admin/dns/nodelocaldns.yaml

Monitoring:

# CoreDNS response time
histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m]))

# DNS error rate
rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])

Q33. What is the difference between ImagePullBackOff and ErrImagePull, and how do you fix Private Registry authentication issues?

ErrImagePull: First image pull failure
ImagePullBackOff: Retrying with exponential backoff after repeated failures

Causes and Solutions:

1. Image does not exist:

kubectl describe pod <pod-name>
# Check: Failed to pull image "my-image:v1.0": ... not found

2. Private Registry authentication failure:

# Create docker-registry Secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry-url> \
  --docker-username=<username> \
  --docker-password=<password> \
  --docker-email=<email>

# Reference in Pod spec
spec:
  imagePullSecrets:
  - name: regcred

# Add to ServiceAccount for automatic use
apiVersion: v1
kind: ServiceAccount
metadata:
  name: default
imagePullSecrets:
- name: regcred

3. Docker Hub rate limit:

# Check rate limit status
curl --head -H "Authorization: Bearer <token>" \
  https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest

Solution: Authenticate with Docker Hub account or set up a registry mirror

4. Network issue:

# Check DNS resolution inside Pod
kubectl run debug --image=busybox --rm -it -- nslookup <registry-url>

# Check node network connectivity
kubectl debug node/<node-name> -it --image=ubuntu

5. Use local image (IfNotPresent):

spec:
  containers:
  - name: app
    image: my-app:v1.0
    imagePullPolicy: IfNotPresent   # Use local cache if available

Q34. How do you diagnose the cause of a Node NotReady state and recover?

Common Causes:

kubelet stopped/crashed
Network partition
Disk/memory resource exhaustion
CNI plugin issue
Certificate expiry

Diagnosis Steps:

# Step 1: Check Node conditions
kubectl describe node <node-name>
# Look for: DiskPressure, MemoryPressure, PIDPressure, NetworkUnavailable

# Step 2: SSH to node and check kubelet
systemctl status kubelet
journalctl -u kubelet -f --since "10 minutes ago"

# Step 3: Check resource usage
df -h        # Disk usage
free -m      # Memory usage
df -i        # Inode usage (often overlooked)

# Step 4: Check CNI
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium"
systemctl status containerd

Common Error Patterns and Fixes:

kubelet cannot start:

# Certificate issue
ls -la /etc/kubernetes/pki/
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates

# Fix: Rotate certificates
kubeadm certs renew all
systemctl restart kubelet

Disk pressure:

# Clean up container images
crictl rmi --prune

# Remove unused Docker images
docker system prune -af

# Clean up log files
find /var/log -name "*.log" -mtime +7 -delete
journalctl --vacuum-size=500M

CNI issue:

# Restart CNI pods
kubectl delete pod -n kube-system -l k8s-app=calico-node

# Reinitialize CNI
rm -rf /etc/cni/net.d/*
systemctl restart kubelet

Recovery Procedure:

systemctl restart kubelet — attempt restart
Clean up disk space (logs, images)
Restart CNI Pods
If unrecoverable — replace the node

Pod rescheduling: After --pod-eviction-timeout (default 5 minutes), Pods are rescheduled to other nodes.

Q35. How do you resolve a Persistent Volume stuck in Terminating state?

PV/PVC stuck in Terminating state is caused by Finalizers. Finalizers ensure cleanup work before resource deletion.

Diagnosis:

# Check PVC finalizers
kubectl get pvc <pvc-name> -o jsonpath='{.metadata.finalizers}'
# Output: ["kubernetes.io/pvc-protection"]

# Check PV finalizers
kubectl get pv <pv-name> -o jsonpath='{.metadata.finalizers}'
# Output: ["kubernetes.io/pv-protection"]

Solution:

# 1. Check if Pod is using the PVC
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="<pvc-name>") | .metadata.name'

# 2. Delete the Pod first if it exists
kubectl delete pod <pod-name>

# 3. Remove PVC Finalizer
kubectl patch pvc <pvc-name> -p '{"metadata":{"finalizers":null}}'

# 4. Remove PV Finalizer
kubectl patch pv <pv-name> -p '{"metadata":{"finalizers":null}}'

AWS EBS-specific handling:

# Check if volume is detached in AWS Console
aws ec2 describe-volumes --volume-ids <volume-id>

# If still attached, force detach
aws ec2 detach-volume --volume-id <volume-id> --force

Reclaim Policy considerations:

Policy	Behavior after PVC deletion
Delete	PV and underlying storage automatically deleted
Retain	PV remains in Released state — manual cleanup required
Recycle	(Deprecated) Cleans data and makes available again

# Reuse a PV with Retain policy
kubectl patch pv <pv-name> -p '{"spec":{"claimRef":null}}'

WaitForFirstConsumer:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
volumeBindingMode: WaitForFirstConsumer  # PVC stays Pending until Pod is scheduled

When using this mode, a PVC without a Pod will remain in Pending state — this is expected behavior.

Key Terms: CPU Throttling, CFS Quota, ndots, CoreDNS, ImagePullBackOff, Node NotReady, Finalizer — refer to the Troubleshooting glossary section for detailed explanations.

EN_K8s_Troubleshooting

Kubernetes Troubleshooting in Practice

Troubleshooting (Q31-Q35)

Q31. What is CPU Throttling and how do you optimize CFS Quota?

Detection:

Root Cause:

Solutions:

Recommendation:

Q32. How do you resolve DNS resolution delays caused by ndots=5?

How It Works:

Solutions:

Monitoring:

Q33. What is the difference between ImagePullBackOff and ErrImagePull, and how do you fix Private Registry authentication issues?

Causes and Solutions:

Q34. How do you diagnose the cause of a Node NotReady state and recover?

Common Causes:

Diagnosis Steps:

Common Error Patterns and Fixes:

Recovery Procedure:

Q35. How do you resolve a Persistent Volume stuck in Terminating state?

Diagnosis:

Solution:

AWS EBS-specific handling:

Reclaim Policy considerations:

WaitForFirstConsumer:

Reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!