Skip to content

Commit 600988a

Browse files
csviriCopilot
andcommitted
improve: health probes showcase & docs to operations (#3291)
Attila Mészáros <a_meszaros@apple.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent c5daaf5 commit 600988a

28 files changed

Lines changed: 550 additions & 229 deletions

File tree

.github/workflows/e2e-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ jobs:
2424
- "sample-operators/tomcat-operator"
2525
- "sample-operators/webpage"
2626
- "sample-operators/leader-election"
27-
- "sample-operators/metrics-processing"
27+
- "sample-operators/operations"
2828
runs-on: ubuntu-latest
2929
steps:
3030
- name: Checkout

docs/content/en/blog/releases/v5-3-release.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ A ready-to-use **Grafana dashboard** is included at
9797
[`observability/josdk-operator-metrics-dashboard.json`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/observability/josdk-operator-metrics-dashboard.json).
9898

9999
The
100-
[`metrics-processing` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing)
100+
[`operations` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/operations)
101101
provides a complete end-to-end setup with Prometheus, Grafana, and an OpenTelemetry Collector,
102102
installable via `observability/install-observability.sh`. This is a good starting point for
103103
verifying metrics in a real cluster.

docs/content/en/docs/documentation/operations/_index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,7 @@ weight: 80
44
---
55

66
This section covers operations-related features for running and managing operators in production.
7+
8+
See the
9+
[`operations` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/operations)
10+
for a complete working example that demonstrates health probes, metrics, and Helm-based deployment.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: Health Probes
3+
weight: 85
4+
---
5+
6+
Operators running in Kubernetes should expose health probe endpoints so that the kubelet can detect startup
7+
failures and runtime degradation. JOSDK provides the building blocks through its
8+
[`RuntimeInfo`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/RuntimeInfo.java)
9+
API.
10+
11+
## RuntimeInfo
12+
13+
`RuntimeInfo` is available via `operator.getRuntimeInfo()` and exposes:
14+
15+
| Method | Purpose |
16+
|---|---|
17+
| `isStarted()` | `true` once the operator and all its controllers have fully started |
18+
| `allEventSourcesAreHealthy()` | `true` when every registered event source (informers, polling sources, etc.) reports a healthy status |
19+
| `unhealthyEventSources()` | returns a map of controller name → unhealthy event sources, useful for diagnostics |
20+
| `unhealthyInformerWrappingEventSourceHealthIndicator()` | returns a map of controller name → unhealthy informer-wrapping event sources, each exposing per-informer details via `InformerHealthIndicator` (`hasSynced()`, `isWatching()`, `isRunning()`, `getTargetNamespace()`) |
21+
22+
In most cases a single readiness probe backed by `allEventSourcesAreHealthy()` is sufficient: before the
23+
operator has fully started the informers will not have synced yet, so the check naturally covers the startup
24+
case as well. Once running, it detects runtime degradation such as a lost watch connection.
25+
26+
### Fine-Grained Informer Diagnostics
27+
28+
For advanced use cases — such as exposing per-informer health in a diagnostic endpoint or logging which
29+
specific namespace lost its watch — `unhealthyInformerWrappingEventSourceHealthIndicator()` gives access to
30+
individual `InformerHealthIndicator` instances. Each indicator exposes `hasSynced()`, `isWatching()`,
31+
`isRunning()`, and `getTargetNamespace()`. This is typically not needed for a standard health probe but can
32+
be valuable for operational dashboards or troubleshooting.
33+
34+
## Setting Up a Probe Endpoint
35+
36+
The example below uses [Jetty](https://eclipse.dev/jetty/) to expose a `/healthz` endpoint. Any HTTP
37+
server library works — the key is calling the `RuntimeInfo` methods to determine the response code.
38+
39+
```java
40+
import org.eclipse.jetty.server.Server;
41+
import org.eclipse.jetty.server.handler.ContextHandler;
42+
43+
Operator operator = new Operator();
44+
operator.register(new MyReconciler());
45+
46+
// start the health server before the operator so probes can be queried during startup
47+
var health = new ContextHandler(new HealthHandler(operator), "/healthz");
48+
Server server = new Server(8080);
49+
server.setHandler(health);
50+
server.start();
51+
52+
operator.start();
53+
```
54+
55+
Where `HealthHandler` extends `org.eclipse.jetty.server.Handler.Abstract` and checks
56+
`operator.getRuntimeInfo().allEventSourcesAreHealthy()`.
57+
58+
See the
59+
[`operations` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/operations)
60+
for a complete working example.
61+
62+
## Kubernetes Deployment Configuration
63+
64+
Once your operator exposes the probe endpoint, configure probes in your Deployment manifest. Both the
65+
startup and readiness probes can point to the same `/healthz` endpoint — the startup probe simply uses a
66+
higher `failureThreshold` to give the operator time to initialize:
67+
68+
```yaml
69+
containers:
70+
- name: operator
71+
ports:
72+
- name: probes
73+
containerPort: 8080
74+
startupProbe:
75+
httpGet:
76+
path: /healthz
77+
port: probes
78+
initialDelaySeconds: 1
79+
periodSeconds: 3
80+
failureThreshold: 20
81+
readinessProbe:
82+
httpGet:
83+
path: /healthz
84+
port: probes
85+
initialDelaySeconds: 5
86+
periodSeconds: 5
87+
failureThreshold: 3
88+
```
89+
90+
The startup probe gives the operator time to start (up to ~60 s with the settings above). Once the startup
91+
probe succeeds, the readiness probe takes over and will mark the pod as not-ready if any event source
92+
becomes unhealthy.
93+
94+
## Helm Chart Support
95+
96+
The [generic Helm chart](/docs/documentation/operations/helm-chart) supports health probes out of the box.
97+
Enable them in your `values.yaml`:
98+
99+
```yaml
100+
probes:
101+
port: 8080
102+
startup:
103+
enabled: true
104+
path: /healthz
105+
readiness:
106+
enabled: true
107+
path: /healthz
108+
```
109+
110+
All probe timing parameters (`initialDelaySeconds`, `periodSeconds`, `failureThreshold`) have sensible
111+
defaults and can be overridden.

docs/content/en/docs/documentation/operations/helm-chart.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ patterns so you don't have to write a chart from scratch. The chart is maintaine
1111
Contributions are more than welcome.
1212

1313
The chart is used in the
14-
[`metrics-processing` sample operator E2E test](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java)
14+
[`operations` sample operator E2E test](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/operations/src/test/java/io/javaoperatorsdk/operator/sample/operations/OperationsE2E.java)
1515
to deploy the operator to a cluster via Helm.
1616

1717
## What the Chart Provides
@@ -80,16 +80,16 @@ for all available options.
8080

8181
## Usage Example
8282

83-
A working example of how to use the chart can be found in the metrics-processing sample operator's
84-
[`helm-values.yaml`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/resources/helm-values.yaml):
83+
A working example of how to use the chart can be found in the operations sample operator's
84+
[`helm-values.yaml`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/operations/src/test/resources/helm-values.yaml):
8585

8686
```yaml
8787
image:
88-
repository: metrics-processing-operator
88+
repository: operations-operator
8989
pullPolicy: Never
9090
tag: "latest"
9191
92-
nameOverride: "metrics-processing-operator"
92+
nameOverride: "operations-operator"
9393
9494
resources: {}
9595

docs/content/en/docs/documentation/operations/metrics.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,9 +103,9 @@ observability sample (see below).
103103
#### Exploring metrics end-to-end
104104

105105
The
106-
[`metrics-processing` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing)
106+
[`operations` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/operations)
107107
includes a full end-to-end test,
108-
[`MetricsHandlingE2E`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java),
108+
[`OperationsE2E`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/operations/src/test/java/io/javaoperatorsdk/operator/sample/metrics/OperationsE2E.java),
109109
that:
110110

111111
1. Installs a local observability stack (Prometheus, Grafana, OpenTelemetry Collector) via

helm/generic-helm-chart/templates/deployment.yaml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,42 @@ spec:
5454
{{- toYaml .Values.securityContext | nindent 12 }}
5555
image: "{{ required "A valid .Values.image.repository is required" .Values.image.repository }}:{{ include "generic-operator.imageTag" . }}"
5656
imagePullPolicy: {{ .Values.image.pullPolicy }}
57+
{{- if or .Values.probes.startup.enabled .Values.probes.readiness.enabled .Values.probes.liveness.enabled }}
58+
ports:
59+
- name: probes
60+
containerPort: {{ .Values.probes.port }}
61+
protocol: TCP
62+
{{- end }}
63+
{{- if .Values.probes.startup.enabled }}
64+
startupProbe:
65+
httpGet:
66+
path: {{ .Values.probes.startup.path }}
67+
port: probes
68+
initialDelaySeconds: {{ .Values.probes.startup.initialDelaySeconds }}
69+
periodSeconds: {{ .Values.probes.startup.periodSeconds }}
70+
timeoutSeconds: {{ .Values.probes.startup.timeoutSeconds }}
71+
failureThreshold: {{ .Values.probes.startup.failureThreshold }}
72+
{{- end }}
73+
{{- if .Values.probes.readiness.enabled }}
74+
readinessProbe:
75+
httpGet:
76+
path: {{ .Values.probes.readiness.path }}
77+
port: probes
78+
initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
79+
periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
80+
timeoutSeconds: {{ .Values.probes.readiness.timeoutSeconds }}
81+
failureThreshold: {{ .Values.probes.readiness.failureThreshold }}
82+
{{- end }}
83+
{{- if .Values.probes.liveness.enabled }}
84+
livenessProbe:
85+
httpGet:
86+
path: {{ .Values.probes.liveness.path }}
87+
port: probes
88+
initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
89+
periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
90+
timeoutSeconds: {{ .Values.probes.liveness.timeoutSeconds }}
91+
failureThreshold: {{ .Values.probes.liveness.failureThreshold }}
92+
{{- end }}
5793
env:
5894
- name: OPERATOR_NAMESPACE
5995
valueFrom:

helm/generic-helm-chart/tests/deployment_test.yaml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -288,3 +288,57 @@ tests:
288288
- equal:
289289
path: spec.template.spec.serviceAccountName
290290
value: my-operator
291+
292+
- it: should not include probes by default
293+
asserts:
294+
- isNull:
295+
path: spec.template.spec.containers[0].startupProbe
296+
- isNull:
297+
path: spec.template.spec.containers[0].readinessProbe
298+
299+
- it: should add startup probe when enabled
300+
documentSelector:
301+
path: kind
302+
value: Deployment
303+
set:
304+
probes.startup.enabled: true
305+
asserts:
306+
- equal:
307+
path: spec.template.spec.containers[0].startupProbe.httpGet.path
308+
value: /health/startup
309+
- equal:
310+
path: spec.template.spec.containers[0].startupProbe.httpGet.port
311+
value: probes
312+
- contains:
313+
path: spec.template.spec.containers[0].ports
314+
content:
315+
name: probes
316+
containerPort: 8080
317+
protocol: TCP
318+
319+
- it: should add readiness probe when enabled
320+
documentSelector:
321+
path: kind
322+
value: Deployment
323+
set:
324+
probes.readiness.enabled: true
325+
asserts:
326+
- equal:
327+
path: spec.template.spec.containers[0].readinessProbe.httpGet.path
328+
value: /health/ready
329+
- equal:
330+
path: spec.template.spec.containers[0].readinessProbe.httpGet.port
331+
value: probes
332+
333+
- it: should add both probes when both enabled
334+
documentSelector:
335+
path: kind
336+
value: Deployment
337+
set:
338+
probes.startup.enabled: true
339+
probes.readiness.enabled: true
340+
asserts:
341+
- isNotNull:
342+
path: spec.template.spec.containers[0].startupProbe
343+
- isNotNull:
344+
path: spec.template.spec.containers[0].readinessProbe

helm/generic-helm-chart/values.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,9 @@ operatorConfig:
8686
</Console>
8787
</Appenders>
8888
<Loggers>
89+
<Logger name="io.micrometer.registry.otlp.OtlpMeterRegistry" level="ERROR" additivity="false">
90+
<AppenderRef ref="Console"/>
91+
</Logger>
8992
<Root level="INFO">
9093
<AppenderRef ref="Console"/>
9194
</Root>
@@ -128,3 +131,32 @@ extraVolumeMounts: []
128131
# RBAC configuration
129132
rbac:
130133
create: true
134+
135+
# Health probes configuration
136+
probes:
137+
port: 8080
138+
startup:
139+
enabled: false
140+
path: /health/startup
141+
initialDelaySeconds: 1
142+
periodSeconds: 10
143+
timeoutSeconds: 5
144+
failureThreshold: 20
145+
readiness:
146+
enabled: false
147+
path: /health/ready
148+
initialDelaySeconds: 5
149+
periodSeconds: 5
150+
timeoutSeconds: 5
151+
failureThreshold: 3
152+
# We provide an option to specify liveness probes.
153+
# However, the framework itself does not define any runtime
154+
# information what such probe should check. The only purpose here
155+
# is to cover your domain specific use case.
156+
liveness:
157+
enabled: false
158+
path: /health/live
159+
initialDelaySeconds: 15
160+
periodSeconds: 10
161+
timeoutSeconds: 5
162+
failureThreshold: 3

observability/install-observability.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,7 @@ kubectl wait --for=condition=ready pod --all -n cert-manager --timeout=300s 2>/d
237237

238238
# Wait for observability pods
239239
echo -e "${YELLOW}Checking observability pods...${NC}"
240-
kubectl wait --for=condition=ready pod --all -n observability --timeout=300s
240+
kubectl wait --for=condition=ready pod --all -n observability --timeout=480s
241241

242242
echo -e "${GREEN}✓ All pods are ready${NC}"
243243

0 commit comments

Comments
 (0)