Skip to content

Commit 14b80ca

Browse files
committed
KEP-3107: target csiNodeExpandSecret beta in 1.27
Signed-off-by: Humble Chirammal <[email protected]>
1 parent c8c1592 commit 14b80ca

File tree

3 files changed

+149
-23
lines changed

3 files changed

+149
-23
lines changed

keps/prod-readiness/sig-storage/3107.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 3107
55
alpha:
66
approver: "@deads2k"
7+
beta:
8+
approver: "@deads2k"

keps/sig-storage/3107-csi-nodeexpandsecret/README.md

Lines changed: 143 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -172,27 +172,47 @@ N/A
172172
[sidecar](https://github.com/kubernetes-csi/external-provisioner/). Once added this
173173
support to mentioned sidecar, the e2e tests will be added and validated using
174174
example CSI driver [tests](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/drivers/csi-test/driver/driver.go).
175+
- E2E test PR is available [here](https://github.com/kubernetes/kubernetes/pull/115451)
176+
and this test has been enabled in the [testgrid](https://k8s-testgrid.appspot.com/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-gce-cos-alpha-features)
175177

176178
### Graduation Criteria
177179

178180
#### Alpha
179181

180182
- Implemented the feature.
181-
- Wrote all the unit and E2E tests.
183+
- implementation of unit tests.
182184

183185
#### Beta
184186

185187
- Deployed the feature in production and went through at least minor k8s
186188
version.
189+
- Feedback from users.
190+
- Implementation of e2e tests.
187191

188192
#### GA
189193

190194
#### Deprecation
191195

192196
### Upgrade / Downgrade Strategy
193197

198+
1. Upgrading a Kubernetes cluster with this feature flag enabled:
199+
- in this upgraded cluster, a CSI driver should receive secrets as
200+
part of NodeExpansion RPC call from CO side and should be able to
201+
make use of it while expanding volumes on node.
202+
203+
2. Downgrading a Kubernetes cluster with feature disabled:
204+
- in this downgraded cluster, a CSI driver will not receive secrets
205+
as part of the NodeExpansion RPC call from CO side.
206+
194207
### Version Skew Strategy
195208

209+
The proposal requires changes to kubelet and kube api server feature
210+
flag set. If any of the components are not upgraded to a version
211+
supporting this feature, then the feature will not work as expected.
212+
From an end user perspective, the existing behaviour will continue, ie,
213+
there will be no facility to get the secrets as part of the node expansion
214+
RPC call from CO side to the CSI driver.
215+
196216
## Production Readiness Review Questionnaire
197217

198218
### Feature Enablement and Rollback
@@ -220,53 +240,138 @@ N/A
220240

221241
### Rollout, Upgrade and Rollback Planning
222242

223-
TBD
224-
225243
###### How can a rollout or rollback fail? Can it impact already running workloads?
226244

227-
TBD
245+
A failed scenario of rollout or rollback dont have any impact on running workloads.
246+
The CSI drivers use the feature based on the availability of Secrets in NodeExpansion
247+
call which is controlled by the Kubernetes feature flag set.
248+
249+
<!--
250+
Try to be as paranoid as possible - e.g., what if some components will restart
251+
mid-rollout?
252+
253+
Be sure to consider highly-available clusters, where, for example,
254+
feature flags will be enabled on some API servers and not others during the
255+
rollout. Similarly, consider large clusters and how enablement/disablement
256+
will rollout across nodes.
257+
-->
228258

229259
###### What specific metrics should inform a rollback?
230260

231-
TBD
261+
`csi_kubelet_operations_seconds` metric available
262+
[here](https://github.com/kubernetes/kubernetes/blob/6b55f097bb2140381a58312aeede37fc76a0762e/pkg/volume/util/metrics.go#L66)
263+
covers CSI NodeExpand operation which can be used for this purpose.
232264

265+
<!--
266+
What signals should users be paying attention to when the feature is young
267+
that might indicate a serious problem?
268+
-->
233269
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
234270

235-
TBD
271+
manual testing will be performed on upgrade and rollback.
236272

237273
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
238274

239-
TBD
275+
No.
240276

241277
### Monitoring Requirements
242278

243-
TBD
244-
245279
###### How can an operator determine if the feature is in use by workloads?
246280

247-
TBD
281+
An operator can query for api server and kubelet flags in the cluster
282+
for `CSINodeExpandSecret` flag and if it exist then the feature is
283+
in use.
284+
285+
286+
<!--
287+
For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
288+
for each individual pod.
289+
Pick one more of these and delete the rest.
290+
Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
291+
and operation of this feature.
292+
Recall that end users cannot usually observe component logs or access metrics.
293+
-->
294+
295+
- [ ] Events
296+
- Event Reason:
297+
- [ ] API .status
298+
- Condition name:
299+
- Other field:
300+
- [x] Other (treat as last resort)
301+
- Details: to make use of this feature in a cluster a StorageClass instance has
302+
to carry below entries in the parameter list.
303+
304+
```
305+
csi.storage.k8s.io/node-expand-secret-name
306+
csi.storage.k8s.io/node-expand-secret-namespace
307+
```
308+
309+
The subjected CSI PV object should have `nodeExpandSecretRef` field filled with the
310+
details given in the StorageClass.
248311

249-
###### How can someone using this feature know that it is working for their instance?
250-
251-
TBD
252312
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
253313

254-
TBD
314+
<!--
315+
This is your opportunity to define what "normal" quality of service looks like
316+
for a feature.
317+
318+
It's impossible to provide comprehensive guidance, but at the very
319+
high level (needs more precise definitions) those may be things like:
320+
- per-day percentage of API calls finishing with 5XX errors <= 1%
321+
- 99% percentile over day of absolute value from (job creation time minus expected
322+
job creation time) for cron job <= 10%
323+
- 99.9% of /health requests per day finish with 200 code
324+
325+
These goals will help you determine what you need to measure (SLIs) in the next
326+
question.
327+
-->
255328

256329
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
257-
TBD
330+
331+
<!--
332+
Pick one more of these and delete the rest.
333+
-->
334+
335+
- [ ] Metrics
336+
- Metric name: `csiOperationsLatencyMetric` can be used by an operator to determine
337+
the health of the service.
258338

259339
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
260340

261-
TBD
341+
<!--
342+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
343+
implementation difficulties, etc.).
344+
-->
262345

263346
### Dependencies
264347

265-
TBD
348+
This feature depends on the cluster having CSI drivers and sidecars that use CSI
349+
spec v1.5.0 at minimum.
266350

267351
###### Does this feature depend on any specific services running in the cluster?
268352

269-
TBD
353+
<!--
354+
Think about both cluster-level services (e.g. metrics-server) as well
355+
as node-level agents (e.g. specific version of CRI). Focus on external or
356+
optional services that are needed. For example, if this feature depends on
357+
a cloud provider API, or upon an external software-defined storage or network
358+
control plane.
359+
360+
For each of these, fill in the following—thinking about running existing user workloads
361+
and creating new ones, as well as about cluster-level services (e.g. DNS):
362+
- [Dependency name]
363+
- Usage description:
364+
- Impact of its outage on the feature:
365+
- Impact of its degraded performance or high-error rates on the feature:
366+
-->
367+
- [CSI drivers and sidecars]
368+
- Usage description:
369+
- Impact of its outage on the feature: Inability to perform CSI storage
370+
operations with NodeExpandVolume RPC call where the CSI driver require
371+
credentials to complete this specific operation.
372+
- Impact of its degraded performance or high-error rates on the feature:
373+
Increase in latency performing CSI storage operations (due to repeated
374+
retries)
270375

271376
### Scalability
272377

@@ -279,16 +384,30 @@ TBD
279384
provider?** no.
280385

281386
- **Will enabling / using this feature result in increasing size or count of
282-
the existing API objects?** no.
387+
the existing API objects?**
388+
yes, this adds a new field to the API so it changes the size.
283389

284390
- **Will enabling / using this feature result in increasing time taken by any
285391
operations covered by [existing SLIs/SLOs]?** no.
286392

287393
- **Will enabling / using this feature result in non-negligible increase of
288394
resource usage (CPU, RAM, disk, IO, ...) in any components?** no.
289395

396+
- **Can enabling / using this feature result in resource exhaustion of som
397+
node resources (PIDs, sockets, inodes, etc.)?** no.
398+
290399
### Troubleshooting
291400

401+
If the CSI driver does not receive the secrets as part of nodeExpansion
402+
request, below things have to be checked in a cluster.
403+
404+
- make sure StorageClass has `csi.storage.k8s.io/node-expand-secret-name`
405+
and `csi.storage.k8s.io/node-expand-secret-namespace` parameters set
406+
with proper value.
407+
408+
- make sure `CSINodeExpandSecret` feature gate has been enabled for
409+
`kubelet` and `kube-apiserver` configuration in the cluster.
410+
292411
## Implementation History
293412

294413
- 18/01/2022: Implementation started
@@ -303,4 +422,9 @@ however this is really a hacky way and not the CSI driver authors want.
303422

304423
## Infrastructure Needed (Optional)
305424

425+
<!--
426+
Use this section if you need things from the project/SIG. Examples include a
427+
new subproject, repos requested, or GitHub details. Listing these here allows a
428+
SIG to get the process for these resources started right away.
429+
-->
306430
---

keps/sig-storage/3107-csi-nodeexpandsecret/kep.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,18 @@ see-also:
1616
- TBD
1717

1818
# The target maturity stage in the current dev cycle for this KEP.
19-
stage: alpha
19+
stage: beta
2020

2121
# The most recent milestone for which work toward delivery of this KEP has been
2222
# done. This can be the current (upcoming) milestone, if it is being actively
2323
# worked on.
24-
latest-milestone: "v1.25"
24+
latest-milestone: "v1.27"
2525

2626
# The milestone at which this feature was, or is targeted to be, at each stage.
2727
milestone:
2828
alpha: "v1.25"
29-
beta: "v1.26"
30-
stable: "v1.27"
29+
beta: "v1.27"
30+
stable: "v1.28"
3131

3232
# The following PRR answers are required at alpha release
3333
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)