EN_K8s_Monitoring

Kubernetes Monitoring & Observability

Monitoring & Observability (Q41-Q45)

Q41. How does Prometheus Service Discovery work, and how do you use relabel_configs effectively?

Prometheus dynamically discovers targets through Kubernetes Service Discovery.

Key SD Mechanisms:

kubernetes_sd_configs: Automatically discovers Pods, Services, Endpoints, Nodes, and Ingresses.
relabel_configs: Transforms or filters labels before metrics collection.

Practical Patterns:

scrape_configs:
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  # Only scrape Pods with the annotation prometheus.io/scrape: "true"
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  # Use custom metrics path from annotation
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  # Use custom port from annotation
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  # Add namespace label
  - source_labels: [__meta_kubernetes_namespace]
    action: replace
    target_label: kubernetes_namespace

Best practices:

Drop unnecessary metrics with action: drop to reduce cardinality
Auto-scrape based on Pod annotations
Filter by namespace for multi-tenancy support
Use metric_relabel_configs to drop high-cardinality labels post-collection

Q42. What is the role of OpenTelemetry Collector in Distributed Tracing, and how is it configured?

OpenTelemetry Collector is a vendor-neutral agent that collects, processes, and exports traces, metrics, and logs.

Core Components:

Receivers: Accept OTLP, Jaeger, Zipkin protocols
Processors: Batch processing, attribute enrichment, sampling
Exporters: Forward to Jaeger, Tempo, DataDog, Elastic APM

Configuration Example:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  attributes:
    actions:
    - key: environment
      value: production
      action: insert
  tail_sampling:
    decision_wait: 10s
    policies:
    - name: errors-policy
      type: status_code
      status_code: {status_codes: [ERROR]}
    - name: slow-traces-policy
      type: latency
      latency: {threshold_ms: 500}

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [prometheus]

Deployment Patterns:

Agent (DaemonSet): Deploy on each node — Application → Agent → Collector
Gateway (Deployment): Centralized deployment — Multiple Agents → Gateway → Backend

Sampling Strategies:

Head-based sampling: Decide at trace start (reduces total traffic but may miss important traces)
Tail-based sampling: Decide after trace completes (captures errors and slow traces, higher memory usage)

Context Propagation: Use W3C Trace Context standard to link traces across services.

Q43. How do you long-term retain and analyze Kubernetes Events?

Kubernetes Events are retained for only 1 hour by default, so a long-term retention strategy is required.

Option Comparison:

Event Exporter + Elasticsearch/Loki:

# Bitnami Event Exporter deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
spec:
  template:
    spec:
      containers:
      - name: event-exporter
        image: bitnami/kubernetes-event-exporter:latest
        args:
        - -conf=/data/config.yaml

Event Exporter watches Events and forwards to logging system
Index with Elasticsearch and visualize with Kibana
Or use Grafana Loki + LogQL for queries

Kubernetes Event Exporter → S3:

Store Events as JSON to S3 via Fluentd/Fluent Bit
Query with Athena — cost-effective

Convert to Prometheus Metrics:

kube-state-metrics exposes Events as metrics
Alert on important events with AlertManager

Practical Approach:

Collect only Warning/Error events separately
Analyze patterns: Pod OOMKilled, ImagePullBackOff, NodeNotReady
Set up alerts for critical events to trigger PagerDuty/Slack notifications

Useful Queries:

# Recent Warning events
kubectl get events --field-selector type=Warning --sort-by='.lastTimestamp'

# Events for a specific Pod
kubectl get events --field-selector involvedObject.name=<pod-name>

# Monitor events in real-time
kubectl get events -w

Q44. What are the differences between Metrics Server, Prometheus, and Custom Metrics API?

Kubernetes has three metrics systems, each serving a different purpose.

Metrics Server:

Feature	Details
Purpose	Lightweight resource metrics collection
Source	kubelet (CPU/memory only)
Commands	`kubectl top nodes/pods`
HPA use	Resource Metrics source
Retention	No history (15s intervals)

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces

Prometheus:

Feature	Details
Purpose	Full-stack monitoring
Source	All metrics via scraping
Query	PromQL for complex queries
Retention	Long-term storage possible
HPA use	Custom Metrics source (via Prometheus Adapter)

# Prometheus Adapter configuration for HPA
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)_total"
    as: "${1}_per_second"
  metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

Custom Metrics API:

Standard interface for HPA to use custom metrics
Implementations: Prometheus Adapter, DataDog Cluster Agent
Enables scaling based on application metrics (RPS, Queue Length)

# HPA with Custom Metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

Recommendation: Metrics Server for basic monitoring, Prometheus for production, Custom Metrics API for business metric-based scaling.

Q45. What is the Dashboard as Code strategy with Grafana and Jsonnet?

Manage Grafana Dashboards as code for version control and automation.

Approach 1 — JSON Export/Import:

# Export existing dashboard
curl -H "Authorization: Bearer <token>" \
  http://grafana/api/dashboards/uid/<uid> | jq '.dashboard' > dashboard.json

# Auto-provision via ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  labels:
    grafana_dashboard: "1"
data:
  my-dashboard.json: |
    { ... dashboard JSON ... }

Simple but repetitive for large numbers of dashboards.

Approach 2 — Jsonnet + Grafonnet:

// dashboard.jsonnet
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new(
  'Kubernetes Pod Metrics',
  tags=['kubernetes', 'pods'],
  refresh='30s',
)
.addPanel(
  graphPanel.new(
    'CPU Usage',
    datasource='Prometheus',
  )
  .addTarget(
    prometheus.target(
      'sum(rate(container_cpu_usage_seconds_total{pod=~"$pod"}[5m])) by (pod)',
      legendFormat='{{pod}}',
    )
  ),
  gridPos={ x: 0, y: 0, w: 12, h: 8 }
)

# Compile Jsonnet to JSON
jsonnet -J vendor dashboard.jsonnet -o dashboard.json

Reuse templates across environments
Type-safe with Grafonnet library
Easily manage even 100+ dashboards

Approach 3 — Terraform Grafana Provider:

resource "grafana_dashboard" "kubernetes" {
  config_json = file("${path.module}/dashboards/kubernetes.json")
  folder      = grafana_folder.kubernetes.id
}

resource "grafana_alert_rule" "high_cpu" {
  name      = "High CPU Usage"
  folder_uid = grafana_folder.kubernetes.uid
  # ...
}

Manage dashboards, data sources, and alerts together with infrastructure.

Practical Pattern:

Write template dashboards (Node, Pod, Ingress) in Jsonnet
Inject environment-specific variables (dev/staging/prod)
Auto-deploy via CI/CD pipeline
Separate folders per team

Key Terms: Liveness Probe, Readiness Probe, Startup Probe, Prometheus, Metrics Server, Custom Metrics API, OpenTelemetry, Distributed Tracing — refer to the Monitoring & Observability glossary section for detailed explanations.

EN_K8s_Monitoring

Kubernetes Monitoring & Observability

Monitoring & Observability (Q41-Q45)

Q41. How does Prometheus Service Discovery work, and how do you use relabel_configs effectively?

Key SD Mechanisms:

Practical Patterns:

Q42. What is the role of OpenTelemetry Collector in Distributed Tracing, and how is it configured?

Core Components:

Configuration Example:

Deployment Patterns:

Sampling Strategies:

Q43. How do you long-term retain and analyze Kubernetes Events?

Option Comparison:

Practical Approach:

Useful Queries:

Q44. What are the differences between Metrics Server, Prometheus, and Custom Metrics API?

Metrics Server:

Prometheus:

Custom Metrics API:

Q45. What is the Dashboard as Code strategy with Grafana and Jsonnet?

Approach 1 — JSON Export/Import:

Approach 2 — Jsonnet + Grafonnet:

Approach 3 — Terraform Grafana Provider:

Practical Pattern:

Reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!