You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `csi_operations_seconds` metrics reports a latency histogram of kubelet-initiated CSI gRPC calls by gRPC status code. Filtering by `NodeStageVolume` and `NodePublishVolume` will give us latency data for the respective gRPC calls which include FSGroup operations for drivers with `VOLUME_MOUNT_GROUP` capability, but analyzing driver logs is necessary to further isolate the problem to this feature.
249
+
250
+
An SLI isn't necessary for kubelet logic since it just passes the FSGroup parameter to the CSI driver.
247
251
248
252
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
249
-
At a high level, this usually will be in the form of "high percentile of SLI
250
-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
251
-
high level (needs more precise definitions) those may be things like:
252
-
- per-day percentage of API calls finishing with 5XX errors <= 1%
253
-
- 99% percentile over day of absolute value from (job creation time minus expected
254
-
job creation time) for cron job <= 10%
255
-
- 99,9% of /health requests per day finish with 200 code
253
+
254
+
For a particular CSI driver, per-day percentage of gRPC calls with `method_name=NodeStageVolume|NodePublishVolume` returning error status codes (as defined by the CSI spec) <= 1%.
255
+
256
+
Latency SLO would be specific to each driver.
256
257
257
258
***Are there any missing metrics that would be useful to have to improve observability
258
259
of this feature?**
259
-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
260
-
implementation difficulties, etc.).
260
+
261
+
https://github.com/kubernetes/kubernetes/issues/98667 as mentioned above - aiming to implement this as part of beta.
261
262
262
263
### Dependencies
263
264
@@ -282,37 +283,33 @@ _For GA, this section is required: approvers should be able to confirm the
282
283
previous answers based on experience in the field._
283
284
284
285
***Will enabling / using this feature result in any new API calls?**
- components listing and/or watching resources they didn't before
291
-
- API calls that may be triggered by changes of some Kubernetes resources
292
-
(e.g. update of object X triggers new updates of object Y)
293
-
- periodic API calls to reconcile state (e.g. periodic fetching state,
294
-
heartbeats, leader election, etc.)
286
+
287
+
No.
295
288
296
289
***Will enabling / using this feature result in introducing new API types?**
297
-
Describe them, providing:
298
-
- API type
299
-
- Supported number of objects per cluster
300
-
- Supported number of objects per namespace (for namespace-scoped objects)
290
+
291
+
No.
301
292
302
293
***Will enabling / using this feature result in any new calls to the cloud
303
294
provider?**
304
295
296
+
No.
297
+
305
298
***Will enabling / using this feature result in increasing size or count of
306
299
the existing API objects?**
307
-
Describe them, providing:
308
-
- API type(s):
309
-
- Estimated increase in size: (e.g., new annotation of size 32B)
310
-
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
300
+
301
+
No.
311
302
312
303
***Will enabling / using this feature result in increasing time taken by any
313
304
operations covered by [existing SLIs/SLOs]?**
314
305
Think about adding additional work or introducing new steps in between
315
306
(e.g. need to do X to start a container), etc. Please describe the details.
307
+
308
+
Depending on the driver implementation of applying FSGroup, latency for the following SLI may increase:
309
+
310
+
"Startup latency of schedulable stateful pods, excluding time to pull images, run init containers, provision volumes (in delayed binding mode) and unmount/detach volumes (from previous pod if needed), measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes"
311
+
312
+
Comparing to the existing recursive `chown` and `chmod` strategy, this operation will likely improve pod startup latency in the most common case.
316
313
317
314
***Will enabling / using this feature result in non-negligible increase of
318
315
resource usage (CPU, RAM, disk, IO, ...) in any components?**
@@ -321,6 +318,8 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
321
318
volume), significant amount of data sent and/or received over network, etc.
322
319
This through this both in small and large cases, again with respect to the
323
320
[supported limits].
321
+
322
+
Not in Kubernetes components. CSI drivers may vary in their implementation and may increase resource usage.
324
323
325
324
### Troubleshooting
326
325
@@ -332,6 +331,8 @@ _This section must be completed when targeting beta graduation to a release._
332
331
333
332
***How does this feature react if the API server and/or etcd is unavailable?**
334
333
334
+
This feature is part of the volume mount path in kubelet, and does not add extra communication with the API server, so this does not introduce new failure modes in the presence of API server or etcd downtime.
335
+
335
336
***What are other known failure modes?**
336
337
For each of them, fill in the following information by copying the below template:
337
338
-[Failure mode brief description]
@@ -343,9 +344,20 @@ _This section must be completed when targeting beta graduation to a release._
343
344
levels that could help debug the issue?
344
345
Not required until feature graduated to beta.
345
346
- Testing: Are there any tests for failure mode? If not, describe why.
347
+
348
+
In addition to existing k8s volume and CSI failure modes:
349
+
350
+
- Driver fails to apply FSGroup (due to a driver error).
351
+
- Detection: SLI above, in conjunction with the metric in https://github.com/kubernetes/kubernetes/issues/98667 to determine if this feature is being used.
352
+
- Mitigations: Revert the CSI driver version to one without the issue, or avoid specifying an FSGroup in the pod's security context, if possible.
353
+
- Diagnostics: Depends on the driver. Generally look for FSGroup-related messages in `NodeStageVolume` and `NodePublishVolume` logs.
354
+
- Testing: Will add an e2e test with a test driver (csi-driver-host-path) simulating a FSGroup failure.
355
+
346
356
347
357
***What steps should be taken if SLOs are not being met to determine the problem?**
348
358
359
+
The CSI driver log should be inspected to look for `NodeStageVolume` and/or `NodePublishVolume` errors.
0 commit comments