@@ -89,6 +89,7 @@ that can be used to request a particular duration.
89
89
### Non-Goals
90
90
91
91
- Requiring signers to honor the requested duration
92
+ - Configuring existing in-tree consumers of the CSR API (i.e. kubelets) to make use of this feature
92
93
93
94
## Proposal
94
95
@@ -133,8 +134,10 @@ structs found in the `k8s.io/api/certificates/v1` and `k8s.io/api/certificates/v
133
134
packages.
134
135
135
136
A new optional ` ExpirationSeconds ` field will be added to this struct. The go
136
- doc comment describes the behavior of this field. A ` *uint32 ` is used because
137
- the field is optional, must be positive and must not overflow JSON parsers.
137
+ doc comment describes the behavior of this field. A ` *int32 ` is used because
138
+ the field is optional and must not overflow JSON parsers (an unsigned type is
139
+ avoided as we want to provide a detailed validation error on a negative input
140
+ instead of a difficult to understand decoding error).
138
141
139
142
``` go
140
143
// CertificateSigningRequestSpec contains the certificate request.
@@ -164,7 +167,7 @@ type CertificateSigningRequestSpec struct {
164
167
// The minimum valid value for expirationSeconds is 600, i.e. 10 minutes.
165
168
//
166
169
// +optional
167
- ExpirationSeconds *uint32 ` json:"expirationSeconds,omitempty" protobuf:"varint,8,opt,name=expirationSeconds"`
170
+ ExpirationSeconds *int32 ` json:"expirationSeconds,omitempty" protobuf:"varint,8,opt,name=expirationSeconds"`
168
171
169
172
// go doc omitted for brevity
170
173
Usages []KeyUsage ` json:"usages,omitempty" protobuf:"bytes,5,opt,name=usages"`
@@ -183,12 +186,17 @@ and signing is asynchronous which necessitates a buffer), `10` minutes seems lik
183
186
appropriate minimum to prevent accidental DOS against the CSR API. Furthermore,
184
187
` 10 ` minutes is a short enough lifetime that revocation is not of concern.
185
188
189
+ Metrics will be included to show if requested expirations are being extended or
190
+ truncated (i.e. is the requested duration being honored by the signer).
191
+
186
192
### Test Plan
187
193
188
194
Unit tests covering:
189
195
190
196
1 . Validation logic for minimum duration
191
197
2 . ` pkg/controller/certificates/authority.PermissiveSigningPolicy ` updates to handle ` expirationSeconds `
198
+ 3 . Metrics
199
+ 4 . CSR REST storage ignores ` spec.expirationSeconds ` when the feature gate is disabled
192
200
193
201
Integration test covering:
194
202
@@ -200,8 +208,7 @@ Integration test covering:
200
208
201
209
#### Alpha
202
210
203
- - This design will start at the beta phase and the functionality will always be enabled
204
- - There will be no feature gate associated with this design (discussed below)
211
+ - This design will start at the beta phase and the functionality will be enabled by default
205
212
206
213
This design represents a small, optional change to an existing GA API. Thus it
207
214
prioritizes rollout speed to allow clients to start using this functionality
@@ -217,7 +224,9 @@ during version skews (discussed below).
217
224
218
225
#### GA
219
226
220
- - Confirm with cert-manager that the new functionality addresses their use case
227
+ - Confirm with [ cert-manager] ( https://github.com/jetstack/cert-manager/pull/3646 ) that the new functionality addresses their use case
228
+ - Confirm with [ pinniped] ( https://pinniped.dev ) that the new functionality addresses their use case
229
+ - Confirm that no other metrics are necessary
221
230
- Wait one release after beta to allow bugs to be reported
222
231
223
232
The existing conformance tests for the certificates API (` test/e2e/auth/certificates.go ` )
@@ -237,7 +246,7 @@ could request a duration of a month but the signer could truncate the duration t
237
246
its internal maximum of two weeks). Thus this design emphasizes feature rollout
238
247
speed to aid in ecosystem adoption instead of data durability. Combined with the
239
248
simplicity of implementation and low risk nature of the proposal, the alpha stage
240
- and related feature gate have been omitted from this design.
249
+ has been omitted from this design.
241
250
242
251
Clients that do not set the ` spec.expirationSeconds ` field will observe no change
243
252
in behavior, even during upgrades and downgrades.
@@ -277,7 +286,7 @@ represents the status quo.
277
286
278
287
In this scenario, the requested ` spec.expirationSeconds ` may be ignored because
279
288
old API servers will silently drop the field on update. This is harmless
280
- represents and the status quo.
289
+ and represents the status quo.
281
290
282
291
The CSR API is resilient to split brain scenarios as unknown fields are silently
283
292
dropped and the ` spec ` fields are immutable after creation [ 1] [ 2] [ 3] .
@@ -292,15 +301,11 @@ dropped and the `spec` fields are immutable after creation [1] [2] [3].
292
301
293
302
###### How can this feature be enabled / disabled in a live cluster?
294
303
295
- - Other
296
- - Describe the mechanism:
297
- As written above, this functionality will be enabled by default with no configuration options.
298
- - Will enabling / disabling the feature require downtime of the control
299
- plane?
300
- N/A
301
- - Will enabling / disabling the feature require downtime or reprovisioning
302
- of a node? (Do not assume ` Dynamic Kubelet Config ` feature is enabled).
303
- N/A
304
+ - Feature gate
305
+ - Feature gate name: ` CSRDuration ` (enabled by default)
306
+ - Components depending on the feature gate:
307
+ - kube-apiserver
308
+ - kube-controller-manager
304
309
305
310
###### Does enabling the feature change any default behavior?
306
311
@@ -309,14 +314,22 @@ would observe no difference in behavior.
309
314
310
315
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
311
316
312
- N/A
317
+ Yes, via the the ` CSRDuration ` feature gate. Disabling this gate will cause the
318
+ API server to remove the ` spec.expirationSeconds ` field on ` create ` and thus all
319
+ clients would have their requested duration ignored. This is a safe to do as the
320
+ field is optional and represents the status quo.
313
321
314
322
###### What happens if we reenable the feature if it was previously rolled back?
315
323
316
- N/A
324
+ Clients could set ` spec.expirationSeconds ` on newly created CSRs and signers may
325
+ choose to honor them. There are no specific issues caused by repeatedly enabling
326
+ and disabling the feature.
317
327
318
328
###### Are there any tests for feature enablement/disablement?
319
329
330
+ Unit tests will confirm that ` spec.expirationSeconds ` is ignored when the feature
331
+ gate is disabled.
332
+
320
333
Unit and integration tests will cover cases where ` spec.expirationSeconds ` is
321
334
specified and cases where it is left unspecified.
322
335
@@ -326,14 +339,22 @@ specified and cases where it is left unspecified.
326
339
327
340
Since it is optional for signers to honor ` spec.expirationSeconds ` , this design
328
341
is fully tolerant of API server and controller manager rollouts/rollbacks that
329
- fail or get wedged in a partial state. The worse case scenario is that the
330
- ` spec.expirationSeconds ` field is ignored, which mimics the status quo. Clients
331
- must always check the duration of the issued certificate to determine if the
332
- requested ` spec.expirationSeconds ` was honored.
342
+ fail or get wedged in a partial state. The ` spec.expirationSeconds ` field being
343
+ ignored just mimics the status quo. Clients must always check the duration of the
344
+ issued certificate to determine if the requested ` spec.expirationSeconds ` was honored.
345
+
346
+ The worst case scenario is that the Kubernetes controller manager or a critical
347
+ signer encounters a fatal error if this field is set (i.e. a nil pointer exception
348
+ that causes the process to crash). This would cause CSRs to stop being approved
349
+ which would eventually lead to kubelet credentials expiring. Kubelets would no
350
+ longer be able to update pod status and the node controller would mark these
351
+ kubelets as dead. To mitigate the impact of any such scenario, the feature gate
352
+ is included to allow this functionality to be disabled.
333
353
334
354
###### What specific metrics should inform a rollback?
335
355
336
- N/A
356
+ The ` csr_duration_honored ` metric can be used to determine if signers and/or clients
357
+ should be upgraded to handle the ` spec.expirationSeconds ` field.
337
358
338
359
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
339
360
@@ -355,12 +376,20 @@ N/A
355
376
Use ` kubectl ` to determine if CSRs with ` spec.expirationSeconds ` set are being
356
377
created. Audit logging could also be used to determine this.
357
378
379
+ Once a target CSR has been located, check that the issued certificate in
380
+ ` .status.certificate ` has the correct duration. Audit logging could also be
381
+ used to determine this.
382
+
383
+ The ` csr_duration_honored ` can be used to determine if signers are honoring
384
+ durations when explicitly requested by clients.
385
+
358
386
###### How can someone using this feature know that it is working for their instance?
359
387
360
388
- API ` .status `
361
389
- Condition name: ` Approved ` ` = ` ` true `
362
390
- Other field:
363
- Check that the issued certificate in ` .status.certificate ` has the correct duration
391
+ - Check that CSR ` spec.expirationSeconds ` has the correct requested duration
392
+ - Check that the issued certificate in ` .status.certificate ` has the correct duration
364
393
365
394
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
366
395
@@ -369,6 +398,11 @@ API.
369
398
370
399
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
371
400
401
+ - Metrics
402
+ - Metric name: ` csr_duration_honored `
403
+ - Components exposing the metric:
404
+ - kube-apiserver
405
+
372
406
- Details: Check the Kubernetes audit log from CRUD operations on CSRs.
373
407
374
408
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -385,17 +419,20 @@ N/A
385
419
- Impact of its degraded performance or high-error rates on the feature:
386
420
+ Signers may have difficulty issuing certificates
387
421
+ Clients may have to wait a long time for certificates to be issued
422
+ and their credentials could expire which could cause an outage
388
423
- etcd
389
424
- Usage description: stores data for the CSR API
390
425
- Impact of its outage on the feature: CSR API will be unavailable
391
426
- Impact of its degraded performance or high-error rates on the feature:
392
427
+ Signers may have difficulty issuing certificates
393
428
+ Clients may have to wait a long time for certificates to be issued
429
+ and their credentials could expire which could cause an outage
394
430
- Kubernetes controller manager
395
431
- Usage description: hosts the in-tree signer controllers
396
432
- Impact of its outage on the feature: in-tree signers will be unavailable
397
433
- Impact of its degraded performance or high-error rates on the feature:
398
434
+ Clients may have to wait a long time for certificates to be issued
435
+ and their credentials could expire which could cause an outage
399
436
400
437
### Scalability
401
438
@@ -444,7 +481,14 @@ The semantics of the Kubernetes CSR API do not change in this regard.
444
481
445
482
###### What are other known failure modes?
446
483
447
- N/A
484
+ - Signer does not honor requested duration
485
+ - Detection: See ` csr_duration_honored ` metric and ` kubectl ` discussion above.
486
+ - Mitigations: Upgrade signers to honor the new field. Clients could also be
487
+ configured to set a requested duration that is within the signer's policy.
488
+ - Diagnostics: Audit logs could be used to capture full request and response
489
+ data in case the metrics are not sufficient.
490
+ - Testing: This scenario is fully covered by unit and integration tests as
491
+ honoring this field is optional.
448
492
449
493
###### What steps should be taken if SLOs are not being met to determine the problem?
450
494
453
497
## Implementation History
454
498
455
499
- 1.22: 2021-06-17: KEP written
500
+ - 1.22: 2021-06-21: KEP review comments addressed
456
501
457
502
## Drawbacks
458
503
0 commit comments