Skip to content

Commit eb16f91

Browse files
authored
feat: Adding PodMonitor to OTEL and OBI (#662)
1 parent 970437d commit eb16f91

File tree

4 files changed

+260
-4
lines changed

4 files changed

+260
-4
lines changed

charts/kvisor/reliability-stack-installation.md

Lines changed: 118 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,6 +172,19 @@ agent:
172172
OTEL_EBPF_SKIP_GO_SPECIFIC_TRACERS: "true" # Skip expensive Go uprobe attachment
173173
OTEL_EBPF_BPF_HIGH_REQUEST_VOLUME: "true" # Ring-buffer mode for high-throughput nodes
174174

175+
# OBI-specific settings
176+
obi:
177+
# Internal metrics — exposes OBI's own health via Prometheus endpoint
178+
internalMetrics:
179+
enabled: false
180+
port: 6061 # HTTP port for internal metrics
181+
path: "/internal/metrics" # Scrape path
182+
podMonitor:
183+
enabled: false
184+
labels: {}
185+
interval: 30s
186+
scrapeTimeout: 10s
187+
175188
# OTel Collector sidecar (agent)
176189
collector:
177190
enabled: true
@@ -187,6 +200,12 @@ agent:
187200
clickhouseExporter:
188201
enabled: true
189202
address: "tcp://castai-kvisor-clickhouse.castai-agent.svc.cluster.local:9000"
203+
# PodMonitor for Prometheus Operator (scrapes collector self-metrics on port 8888)
204+
podMonitor:
205+
enabled: false
206+
labels: {} # Extra labels for Prometheus Operator selector filtering
207+
interval: 30s
208+
scrapeTimeout: 10s
190209

191210
controller:
192211
reliabilityMetrics:
@@ -200,6 +219,12 @@ controller:
200219
limits:
201220
memory: 512Mi
202221
prometheusPort: 9401
222+
# PodMonitor for Prometheus Operator (scrapes collector self-metrics on port 8889)
223+
podMonitor:
224+
enabled: false
225+
labels: {}
226+
interval: 30s
227+
scrapeTimeout: 10s
203228

204229
# Subchart (reliability-metrics-ch-exporter)
205230
reliabilityMetrics:
@@ -252,13 +277,20 @@ reliabilityMetrics:
252277
grpcAddr: "" # Defaults to reliabilityMetrics.castai.grpcAddr if empty
253278
image:
254279
repository: ghcr.io/castai/kvisor/reliability-metrics-ch-exporter
255-
tag: "v0.3.6"
280+
tag: "v0.3.7"
256281
resources:
257282
requests:
258283
cpu: 50m
259284
memory: 64Mi
260285
limits:
261286
memory: 128Mi
287+
# PodMonitor for Prometheus Operator (scrapes exporter metrics on port 8080)
288+
podMonitor:
289+
enabled: false
290+
labels: {}
291+
selectorLabels: {} # Override auto-detected pod selector
292+
interval: 30s
293+
scrapeTimeout: 10s
262294

263295
# External ClickHouse (alternative to install.enabled)
264296
external:
@@ -418,6 +450,91 @@ Approximate per-component resource consumption:
418450

419451
For clusters with 30+ nodes or high-cardinality workloads, consider increasing the agent OTel Collector memory limit above 256 MiB.
420452

453+
## Monitoring with Prometheus Operator
454+
455+
If your cluster runs [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), you can create PodMonitor resources to scrape the reliability metrics components automatically.
456+
457+
### Available PodMonitors
458+
459+
| Component | Values path | Metrics port | Key metrics |
460+
|-----------|------------|-------------|-------------|
461+
| Agent OTel Collector | `agent.reliabilityMetrics.collector.podMonitor` | 8888 | `otelcol_receiver_accepted_metric_points`, `otelcol_exporter_sent_metric_points`, `otelcol_processor_dropped_metric_points`, queue sizes |
462+
| OBI (eBPF instrumenter) | `agent.reliabilityMetrics.obi.internalMetrics` | 6061 | Instrumented process count, eBPF map usage, Go runtime stats |
463+
| Controller OTel Collector | `controller.reliabilityMetrics.collector.podMonitor` | 8889 | Same as agent collector (k8s_cluster receiver pipeline) |
464+
| ch-exporter | `reliabilityMetrics.exporter.podMonitor` | 8080 | Export throughput, ClickHouse query latency, gRPC send errors |
465+
466+
**Note:** OBI internal metrics require two enable flags: `agent.reliabilityMetrics.obi.internalMetrics.enabled` (exposes the `/internal/metrics` endpoint) and `agent.reliabilityMetrics.obi.internalMetrics.podMonitor.enabled` (creates the PodMonitor).
467+
468+
### Enable All PodMonitors
469+
470+
Add these to your values file to enable scraping of all components:
471+
472+
```yaml
473+
agent:
474+
reliabilityMetrics:
475+
# OBI settings
476+
obi:
477+
internalMetrics:
478+
enabled: true
479+
podMonitor:
480+
enabled: true
481+
labels:
482+
release: prometheus
483+
# Agent OTel Collector
484+
collector:
485+
podMonitor:
486+
enabled: true
487+
labels:
488+
release: prometheus # Match your Prometheus Operator's serviceMonitorSelector
489+
490+
controller:
491+
reliabilityMetrics:
492+
collector:
493+
podMonitor:
494+
enabled: true
495+
labels:
496+
release: prometheus
497+
498+
reliabilityMetrics:
499+
exporter:
500+
podMonitor:
501+
enabled: true
502+
labels:
503+
release: prometheus
504+
```
505+
506+
Or via `--set` flags:
507+
508+
```bash
509+
helm upgrade castai-kvisor castai-helm/castai-kvisor \
510+
-n castai-agent \
511+
--reset-then-reuse-values \
512+
--set agent.reliabilityMetrics.obi.internalMetrics.enabled=true \
513+
--set agent.reliabilityMetrics.obi.internalMetrics.podMonitor.enabled=true \
514+
--set agent.reliabilityMetrics.collector.podMonitor.enabled=true \
515+
--set controller.reliabilityMetrics.collector.podMonitor.enabled=true \
516+
--set reliabilityMetrics.exporter.podMonitor.enabled=true
517+
```
518+
519+
### Prometheus Operator Label Matching
520+
521+
Prometheus Operator uses label selectors to decide which PodMonitors to pick up. If your Prometheus is configured with a `podMonitorSelector` (e.g., `release: prometheus`), add matching labels:
522+
523+
```yaml
524+
podMonitor:
525+
enabled: true
526+
labels:
527+
release: prometheus
528+
```
529+
530+
To check what selector your Prometheus uses:
531+
532+
```bash
533+
kubectl get prometheus -A -o jsonpath='{.items[*].spec.podMonitorSelector}'
534+
```
535+
536+
An empty `podMonitorSelector` means Prometheus picks up all PodMonitors in its namespace.
537+
421538
## Troubleshooting
422539

423540
### OBI: "data refused due to high memory usage"

charts/kvisor/templates/agent.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,10 +189,26 @@ spec:
189189
valueFrom:
190190
fieldRef:
191191
fieldPath: spec.nodeName
192+
{{- if (dig "reliabilityMetrics" "obi" "internalMetrics" "enabled" false .Values.agent) }}
193+
- name: OTEL_EBPF_INTERNAL_METRICS_EXPORTER
194+
value: "prometheus"
195+
- name: OTEL_EBPF_INTERNAL_METRICS_PROMETHEUS_PORT
196+
value: {{ .Values.agent.reliabilityMetrics.obi.internalMetrics.port | default 6061 | quote }}
197+
{{- with (dig "reliabilityMetrics" "obi" "internalMetrics" "path" "" .Values.agent) }}
198+
- name: OTEL_EBPF_INTERNAL_METRICS_PROMETHEUS_PATH
199+
value: {{ . | quote }}
200+
{{- end }}
201+
{{- end }}
192202
{{- range $k, $v := .Values.agent.reliabilityMetrics.env }}
193203
- name: {{ $k }}
194204
value: "{{ $v }}"
195205
{{- end }}
206+
{{- if (dig "reliabilityMetrics" "obi" "internalMetrics" "enabled" false .Values.agent) }}
207+
ports:
208+
- containerPort: {{ .Values.agent.reliabilityMetrics.obi.internalMetrics.port | default 6061 }}
209+
name: obi-metrics
210+
protocol: TCP
211+
{{- end }}
196212
volumeMounts:
197213
- name: var-run-obi
198214
mountPath: /var/run/beyla
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
{{/*
2+
PodMonitors for the reliability metrics stack components.
3+
4+
1. Agent OTel Collector — self-metrics (port 8888):
5+
otelcol_receiver_accepted_metric_points, otelcol_exporter_sent_metric_points,
6+
otelcol_processor_dropped_metric_points, queue sizes, Go runtime metrics
7+
2. OBI (eBPF instrumentation) — internal metrics (port 6061 default):
8+
Instrumented process count, eBPF map usage, Go runtime stats
9+
3. Controller OTel Collector — self-metrics (port 8889):
10+
Same as agent collector for the k8s_cluster receiver pipeline
11+
*/}}
12+
{{- if and (dig "reliabilityMetrics" "enabled" false .Values.agent) (dig "reliabilityMetrics" "collector" "enabled" false .Values.agent) (dig "reliabilityMetrics" "collector" "podMonitor" "enabled" false .Values.agent) }}
13+
apiVersion: monitoring.coreos.com/v1
14+
kind: PodMonitor
15+
metadata:
16+
name: {{ include "kvisor.agent.fullname" . }}-otel-collector
17+
namespace: {{ .Release.Namespace }}
18+
labels:
19+
{{- include "kvisor.labels" . | nindent 4 }}
20+
app.kubernetes.io/component: otel-collector
21+
{{- with .Values.agent.reliabilityMetrics.collector.podMonitor.labels }}
22+
{{- toYaml . | nindent 4 }}
23+
{{- end }}
24+
spec:
25+
namespaceSelector:
26+
matchNames:
27+
- {{ .Release.Namespace }}
28+
podMetricsEndpoints:
29+
- port: otel-metrics
30+
path: /metrics
31+
scheme: http
32+
honorLabels: true
33+
interval: {{ .Values.agent.reliabilityMetrics.collector.podMonitor.interval | default "30s" }}
34+
scrapeTimeout: {{ .Values.agent.reliabilityMetrics.collector.podMonitor.scrapeTimeout | default "10s" }}
35+
selector:
36+
matchLabels:
37+
{{- include "kvisor.agent.selectorLabels" . | nindent 6 }}
38+
{{- end }}
39+
---
40+
{{- if and (dig "reliabilityMetrics" "enabled" false .Values.agent) (dig "reliabilityMetrics" "obi" "internalMetrics" "enabled" false .Values.agent) (dig "reliabilityMetrics" "obi" "internalMetrics" "podMonitor" "enabled" false .Values.agent) }}
41+
apiVersion: monitoring.coreos.com/v1
42+
kind: PodMonitor
43+
metadata:
44+
name: {{ include "kvisor.agent.fullname" . }}-obi
45+
namespace: {{ .Release.Namespace }}
46+
labels:
47+
{{- include "kvisor.labels" . | nindent 4 }}
48+
app.kubernetes.io/component: obi
49+
{{- with .Values.agent.reliabilityMetrics.obi.internalMetrics.podMonitor.labels }}
50+
{{- toYaml . | nindent 4 }}
51+
{{- end }}
52+
spec:
53+
namespaceSelector:
54+
matchNames:
55+
- {{ .Release.Namespace }}
56+
podMetricsEndpoints:
57+
- port: obi-metrics
58+
path: {{ .Values.agent.reliabilityMetrics.obi.internalMetrics.path | default "/internal/metrics" }}
59+
scheme: http
60+
honorLabels: true
61+
interval: {{ .Values.agent.reliabilityMetrics.obi.internalMetrics.podMonitor.interval | default "30s" }}
62+
scrapeTimeout: {{ .Values.agent.reliabilityMetrics.obi.internalMetrics.podMonitor.scrapeTimeout | default "10s" }}
63+
selector:
64+
matchLabels:
65+
{{- include "kvisor.agent.selectorLabels" . | nindent 6 }}
66+
{{- end }}
67+
---
68+
{{- if and (dig "reliabilityMetrics" "enabled" false .Values.controller) (dig "reliabilityMetrics" "collector" "enabled" false .Values.controller) (dig "reliabilityMetrics" "collector" "podMonitor" "enabled" false .Values.controller) }}
69+
apiVersion: monitoring.coreos.com/v1
70+
kind: PodMonitor
71+
metadata:
72+
name: {{ include "kvisor.controller.fullname" . }}-otel-collector
73+
namespace: {{ .Release.Namespace }}
74+
labels:
75+
{{- include "kvisor.labels" . | nindent 4 }}
76+
app.kubernetes.io/component: otel-collector
77+
{{- with .Values.controller.reliabilityMetrics.collector.podMonitor.labels }}
78+
{{- toYaml . | nindent 4 }}
79+
{{- end }}
80+
spec:
81+
namespaceSelector:
82+
matchNames:
83+
- {{ .Release.Namespace }}
84+
podMetricsEndpoints:
85+
- port: k8s-metrics
86+
path: /metrics
87+
scheme: http
88+
honorLabels: true
89+
interval: {{ .Values.controller.reliabilityMetrics.collector.podMonitor.interval | default "30s" }}
90+
scrapeTimeout: {{ .Values.controller.reliabilityMetrics.collector.podMonitor.scrapeTimeout | default "10s" }}
91+
selector:
92+
matchLabels:
93+
{{- include "kvisor.controller.selectorLabels" . | nindent 6 }}
94+
{{- end }}

charts/kvisor/values.yaml

Lines changed: 32 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,21 @@ agent:
226226
OTEL_EBPF_BPF_HIGH_REQUEST_VOLUME: "true"
227227
# Container security context override (if empty, uses unprivileged defaults with fine-grained capabilities).
228228
containerSecurityContext: {}
229+
# --- OBI-specific settings ---
230+
obi:
231+
# Internal metrics — exposes OBI's own health via a Prometheus endpoint.
232+
# Metrics include: instrumented process count, eBPF map usage, Go runtime stats.
233+
# Disabled by default. Set port > 0 to enable.
234+
internalMetrics:
235+
enabled: false
236+
port: 6061
237+
path: "/internal/metrics"
238+
# PodMonitor for Prometheus Operator — scrapes OBI internal metrics.
239+
podMonitor:
240+
enabled: false
241+
labels: {}
242+
interval: 30s
243+
scrapeTimeout: 10s
229244

230245
# --- OTel Collector sidecar settings ---
231246
# Receives OTLP from OBI, applies golden signal filtering, cardinality control,
@@ -248,6 +263,15 @@ agent:
248263
clickhouseExporter:
249264
enabled: true
250265
address: "tcp://castai-kvisor-clickhouse.castai-agent.svc.cluster.local:9000"
266+
# PodMonitor for Prometheus Operator — scrapes collector self-metrics (port 8888).
267+
# Provides: otelcol_receiver_accepted_metric_points, otelcol_exporter_sent_metric_points,
268+
# otelcol_processor_dropped_metric_points, queue sizes, Go runtime metrics.
269+
podMonitor:
270+
enabled: false
271+
# Extra labels on the PodMonitor metadata (for Prometheus Operator selector filtering).
272+
labels: {}
273+
interval: 30s
274+
scrapeTimeout: 10s
251275

252276
controller:
253277
enabled: true
@@ -395,8 +419,13 @@ controller:
395419
address: "tcp://castai-kvisor-clickhouse.castai-agent.svc.cluster.local:9000"
396420
# Port for the collector's Prometheus exporter (different from agent collector's 9400).
397421
prometheusPort: 9401
398-
# Labels to add to the PodMonitor (e.g., for Prometheus Operator selector filtering).
399-
podMonitorLabels: {}
422+
# PodMonitor for Prometheus Operator — scrapes collector self-metrics (port 8889).
423+
podMonitor:
424+
enabled: false
425+
# Extra labels on the PodMonitor metadata (for Prometheus Operator selector filtering).
426+
labels: {}
427+
interval: 30s
428+
scrapeTimeout: 10s
400429

401430
eventGenerator:
402431
enabled: false
@@ -476,4 +505,4 @@ reliabilityMetrics:
476505
exporter:
477506
image:
478507
repository: us-docker.pkg.dev/castai-hub/library/reliability-metrics-ch-exporter
479-
tag: "v0.3.6"
508+
tag: "v0.3.7"

0 commit comments

Comments
 (0)