Skip to content

Commit fb6484f

Browse files
committed
deployment: add startupProbe for nfd-master
This patch mitigates inadvertent termination of nfd-master pods by the liveness probe on big clusters. With a recent change nfd-master started to wait (block) for informer caches to sync before starting the main loop. Consequently, this change also made the gRPC health enpoint to not respond until the caches have been synced. In big clusters the syncing the NodeFeature object cache takes a long time as the objects are big and there's (at least) one per each node in the cluster. Thus, in big clusters, the liveness probe kicks in and kills the nfd-master pod before it's ready.
1 parent 6bc66ed commit fb6484f

File tree

4 files changed

+32
-8
lines changed

4 files changed

+32
-8
lines changed

deployment/base/master/master-deployment.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,16 +28,16 @@ spec:
2828
requests:
2929
cpu: 100m
3030
memory: 128Mi
31+
startupProbe:
32+
grpc:
33+
port: 8082
34+
failureThreshold: 30
3135
livenessProbe:
3236
grpc:
3337
port: 8082
34-
initialDelaySeconds: 10
35-
periodSeconds: 10
3638
readinessProbe:
3739
grpc:
3840
port: 8082
39-
initialDelaySeconds: 5
40-
periodSeconds: 10
4141
failureThreshold: 10
4242
command:
4343
- "nfd-master"

deployment/helm/node-feature-discovery/templates/master.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,21 @@ spec:
4747
{{- toYaml .Values.master.securityContext | nindent 12 }}
4848
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
4949
imagePullPolicy: {{ .Values.image.pullPolicy }}
50+
startupProbe:
51+
grpc:
52+
port: {{ .Values.master.healthPort | default "8082" }}
53+
{{- with .Values.master.startupProbe.initialDelaySeconds }}
54+
initialDelaySeconds: {{ . }}
55+
{{- end }}
56+
{{- with .Values.master.startupProbe.failureThreshold }}
57+
failureThreshold: {{ . }}
58+
{{- end }}
59+
{{- with .Values.master.startupProbe.periodSeconds }}
60+
periodSeconds: {{ . }}
61+
{{- end }}
62+
{{- with .Values.master.startupProbe.timeoutSeconds }}
63+
timeoutSeconds: {{ . }}
64+
{{- end }}
5065
livenessProbe:
5166
grpc:
5267
port: {{ .Values.master.healthPort | default "8082" }}

deployment/helm/node-feature-discovery/values.yaml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -152,18 +152,23 @@ master:
152152
operator: In
153153
values: [""]
154154

155+
startupProbe:
156+
grpc:
157+
port: 8082
158+
failureThreshold: 30
159+
# periodSeconds: 10
155160
livenessProbe:
156161
grpc:
157162
port: 8082
158-
initialDelaySeconds: 10
159163
# failureThreshold: 3
164+
# initialDelaySeconds: 0
160165
# periodSeconds: 10
161166
# timeoutSeconds: 1
162167
readinessProbe:
163168
grpc:
164169
port: 8082
165-
initialDelaySeconds: 5
166170
failureThreshold: 10
171+
# initialDelaySeconds: 0
167172
# periodSeconds: 10
168173
# timeoutSeconds: 1
169174
# successThreshold: 1

docs/deployment/helm.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -201,11 +201,15 @@ API's you need to install the prometheus operator in your cluster.
201201
| `master.extraArgs` | array | [] | Additional [command line arguments](../reference/master-commandline-reference.md) to pass to nfd-master |
202202
| `master.extraEnvs` | array | [] | Additional environment variables to pass to nfd-master |
203203
| `master.revisionHistoryLimit` | integer | | Specify how many old ReplicaSets for this Deployment you want to retain. [revisionHistoryLimit](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#revision-history-limit) |
204-
| `master.livenessProbe.initialDelaySeconds` | integer | 10 | Specifies the number of seconds after the container has started before liveness probes are initiated. |
204+
| `master.startupProbe.initialDelaySecond s` | integer | 0 (by Kubernetes) | Specifies the number of seconds after the container has started before startup probes are initiated. |
205+
| `master.startupProbe.failureThreshold` | integer | 30 | Specifies the number of consecutive failures of startup probes before considering the pod as not ready. |
206+
| `master.startupProbe.periodSeconds` | integer | 10 (by Kubernetes) | Specifies how often (in seconds) to perform the startup probe. |
207+
| `master.startupProbe.timeoutSeconds` | integer | 1 (by Kubernetes) | Specifies the number of seconds after which the probe times out. |
208+
| `master.livenessProbe.initialDelaySeconds` | integer | 0 (by Kubernetes) | Specifies the number of seconds after the container has started before liveness probes are initiated. |
205209
| `master.livenessProbe.failureThreshold` | integer | 3 (by Kubernetes) | Specifies the number of consecutive failures of liveness probes before considering the pod as not ready. |
206210
| `master.livenessProbe.periodSeconds` | integer | 10 (by Kubernetes) | Specifies how often (in seconds) to perform the liveness probe. |
207211
| `master.livenessProbe.timeoutSeconds` | integer | 1 (by Kubernetes) | Specifies the number of seconds after which the probe times out. |
208-
| `master.readinessProbe.initialDelaySeconds` | integer | 5 | Specifies the number of seconds after the container has started before readiness probes are initiated. |
212+
| `master.readinessProbe.initialDelaySeconds` | integer | 0 (by Kubernetes) | Specifies the number of seconds after the container has started before readiness probes are initiated. |
209213
| `master.readinessProbe.failureThreshold` | integer | 10 | Specifies the number of consecutive failures of readiness probes before considering the pod as not ready. |
210214
| `master.readinessProbe.periodSeconds` | integer | 10 (by Kubernetes) | Specifies how often (in seconds) to perform the readiness probe. |
211215
| `master.readinessProbe.timeoutSeconds` | integer | 1 (by Kubernetes) | Specifies the number of seconds after which the probe times out. |

0 commit comments

Comments
 (0)