Skip to content

EN_K8s_Monitoring

somaz edited this page Mar 31, 2026 · 1 revision

Kubernetes Monitoring & Observability

Monitoring & Observability (Q41-Q45)


Q41. How does Prometheus Service Discovery work, and how do you use relabel_configs effectively?

Prometheus dynamically discovers targets through Kubernetes Service Discovery.

Key SD Mechanisms:

  • kubernetes_sd_configs: Automatically discovers Pods, Services, Endpoints, Nodes, and Ingresses.
  • relabel_configs: Transforms or filters labels before metrics collection.

Practical Patterns:

scrape_configs:
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  # Only scrape Pods with the annotation prometheus.io/scrape: "true"
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  # Use custom metrics path from annotation
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  # Use custom port from annotation
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  # Add namespace label
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace

Best practices:

  • Drop unnecessary metrics with action: drop to reduce cardinality
  • Auto-scrape based on Pod annotations
  • Filter by namespace for multi-tenancy support
  • Use metric_relabel_configs to drop high-cardinality labels post-collection

Q42. What is the role of OpenTelemetry Collector in Distributed Tracing, and how is it configured?

OpenTelemetry Collector is a vendor-neutral agent that collects, processes, and exports traces, metrics, and logs.

Core Components:

  • Receivers: Accept OTLP, Jaeger, Zipkin protocols
  • Processors: Batch processing, attribute enrichment, sampling
  • Exporters: Forward to Jaeger, Tempo, DataDog, Elastic APM

Configuration Example:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  attributes:
    actions:
    - key: environment
      value: production
      action: insert
  tail_sampling:
    decision_wait: 10s
    policies:
    - name: errors-policy
      type: status_code
      status_code: {status_codes: [ERROR]}
    - name: slow-traces-policy
      type: latency
      latency: {threshold_ms: 500}

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [prometheus]

Deployment Patterns:

  • Agent (DaemonSet): Deploy on each node — Application → Agent → Collector
  • Gateway (Deployment): Centralized deployment — Multiple Agents → Gateway → Backend

Sampling Strategies:

  • Head-based sampling: Decide at trace start (reduces total traffic but may miss important traces)
  • Tail-based sampling: Decide after trace completes (captures errors and slow traces, higher memory usage)

Context Propagation: Use W3C Trace Context standard to link traces across services.


Q43. How do you long-term retain and analyze Kubernetes Events?

Kubernetes Events are retained for only 1 hour by default, so a long-term retention strategy is required.

Option Comparison:

Event Exporter + Elasticsearch/Loki:

# Bitnami Event Exporter deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
spec:
  template:
    spec:
      containers:
      - name: event-exporter
        image: bitnami/kubernetes-event-exporter:latest
        args:
        - -conf=/data/config.yaml
  • Event Exporter watches Events and forwards to logging system
  • Index with Elasticsearch and visualize with Kibana
  • Or use Grafana Loki + LogQL for queries

Kubernetes Event Exporter → S3:

  • Store Events as JSON to S3 via Fluentd/Fluent Bit
  • Query with Athena — cost-effective

Convert to Prometheus Metrics:

  • kube-state-metrics exposes Events as metrics
  • Alert on important events with AlertManager

Practical Approach:

  • Collect only Warning/Error events separately
  • Analyze patterns: Pod OOMKilled, ImagePullBackOff, NodeNotReady
  • Set up alerts for critical events to trigger PagerDuty/Slack notifications

Useful Queries:

# Recent Warning events
kubectl get events --field-selector type=Warning --sort-by='.lastTimestamp'

# Events for a specific Pod
kubectl get events --field-selector involvedObject.name=<pod-name>

# Monitor events in real-time
kubectl get events -w

Q44. What are the differences between Metrics Server, Prometheus, and Custom Metrics API?

Kubernetes has three metrics systems, each serving a different purpose.

Metrics Server:

Feature Details
Purpose Lightweight resource metrics collection
Source kubelet (CPU/memory only)
Commands kubectl top nodes/pods
HPA use Resource Metrics source
Retention No history (15s intervals)
# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces

Prometheus:

Feature Details
Purpose Full-stack monitoring
Source All metrics via scraping
Query PromQL for complex queries
Retention Long-term storage possible
HPA use Custom Metrics source (via Prometheus Adapter)
# Prometheus Adapter configuration for HPA
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)_total"
    as: "${1}_per_second"
  metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

Custom Metrics API:

  • Standard interface for HPA to use custom metrics
  • Implementations: Prometheus Adapter, DataDog Cluster Agent
  • Enables scaling based on application metrics (RPS, Queue Length)
# HPA with Custom Metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

Recommendation: Metrics Server for basic monitoring, Prometheus for production, Custom Metrics API for business metric-based scaling.


Q45. What is the Dashboard as Code strategy with Grafana and Jsonnet?

Manage Grafana Dashboards as code for version control and automation.

Approach 1 — JSON Export/Import:

# Export existing dashboard
curl -H "Authorization: Bearer <token>" \
  http://grafana/api/dashboards/uid/<uid> | jq '.dashboard' > dashboard.json

# Auto-provision via ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  labels:
    grafana_dashboard: "1"
data:
  my-dashboard.json: |
    { ... dashboard JSON ... }

Simple but repetitive for large numbers of dashboards.

Approach 2 — Jsonnet + Grafonnet:

// dashboard.jsonnet
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new(
  'Kubernetes Pod Metrics',
  tags=['kubernetes', 'pods'],
  refresh='30s',
)
.addPanel(
  graphPanel.new(
    'CPU Usage',
    datasource='Prometheus',
  )
  .addTarget(
    prometheus.target(
      'sum(rate(container_cpu_usage_seconds_total{pod=~"$pod"}[5m])) by (pod)',
      legendFormat='{{pod}}',
    )
  ),
  gridPos={ x: 0, y: 0, w: 12, h: 8 }
)
# Compile Jsonnet to JSON
jsonnet -J vendor dashboard.jsonnet -o dashboard.json
  • Reuse templates across environments
  • Type-safe with Grafonnet library
  • Easily manage even 100+ dashboards

Approach 3 — Terraform Grafana Provider:

resource "grafana_dashboard" "kubernetes" {
  config_json = file("${path.module}/dashboards/kubernetes.json")
  folder      = grafana_folder.kubernetes.id
}

resource "grafana_alert_rule" "high_cpu" {
  name      = "High CPU Usage"
  folder_uid = grafana_folder.kubernetes.uid
  # ...
}

Manage dashboards, data sources, and alerts together with infrastructure.

Practical Pattern:

  1. Write template dashboards (Node, Pod, Ingress) in Jsonnet
  2. Inject environment-specific variables (dev/staging/prod)
  3. Auto-deploy via CI/CD pipeline
  4. Separate folders per team

Key Terms: Liveness Probe, Readiness Probe, Startup Probe, Prometheus, Metrics Server, Custom Metrics API, OpenTelemetry, Distributed Tracing — refer to the Monitoring & Observability glossary section for detailed explanations.



Reference

Clone this wiki locally