Skip to content

Commit 2b9bdca

Browse files
committed
csr duration kep: address review comments
Signed-off-by: Monis Khan <[email protected]>
1 parent 5a9b24b commit 2b9bdca

File tree

2 files changed

+79
-29
lines changed

2 files changed

+79
-29
lines changed

keps/sig-auth/2784-csr-duration/README.md

Lines changed: 71 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ that can be used to request a particular duration.
8989
### Non-Goals
9090

9191
- Requiring signers to honor the requested duration
92+
- Configuring existing in-tree consumers of the CSR API (i.e. kubelets) to make use of this feature
9293

9394
## Proposal
9495

@@ -133,8 +134,10 @@ structs found in the `k8s.io/api/certificates/v1` and `k8s.io/api/certificates/v
133134
packages.
134135

135136
A new optional `ExpirationSeconds` field will be added to this struct. The go
136-
doc comment describes the behavior of this field. A `*uint32` is used because
137-
the field is optional, must be positive and must not overflow JSON parsers.
137+
doc comment describes the behavior of this field. A `*int32` is used because
138+
the field is optional and must not overflow JSON parsers (an unsigned type is
139+
avoided as we want to provide a detailed validation error on a negative input
140+
instead of a difficult to understand decoding error).
138141

139142
```go
140143
// CertificateSigningRequestSpec contains the certificate request.
@@ -164,7 +167,7 @@ type CertificateSigningRequestSpec struct {
164167
// The minimum valid value for expirationSeconds is 600, i.e. 10 minutes.
165168
//
166169
// +optional
167-
ExpirationSeconds *uint32 `json:"expirationSeconds,omitempty" protobuf:"varint,8,opt,name=expirationSeconds"`
170+
ExpirationSeconds *int32 `json:"expirationSeconds,omitempty" protobuf:"varint,8,opt,name=expirationSeconds"`
168171

169172
// go doc omitted for brevity
170173
Usages []KeyUsage `json:"usages,omitempty" protobuf:"bytes,5,opt,name=usages"`
@@ -183,12 +186,17 @@ and signing is asynchronous which necessitates a buffer), `10` minutes seems lik
183186
appropriate minimum to prevent accidental DOS against the CSR API. Furthermore,
184187
`10` minutes is a short enough lifetime that revocation is not of concern.
185188

189+
Metrics will be included to show if requested expirations are being extended or
190+
truncated (i.e. is the requested duration being honored by the signer).
191+
186192
### Test Plan
187193

188194
Unit tests covering:
189195

190196
1. Validation logic for minimum duration
191197
2. `pkg/controller/certificates/authority.PermissiveSigningPolicy` updates to handle `expirationSeconds`
198+
3. Metrics
199+
4. CSR REST storage ignores `spec.expirationSeconds` when the feature gate is disabled
192200

193201
Integration test covering:
194202

@@ -200,8 +208,7 @@ Integration test covering:
200208

201209
#### Alpha
202210

203-
- This design will start at the beta phase and the functionality will always be enabled
204-
- There will be no feature gate associated with this design (discussed below)
211+
- This design will start at the beta phase and the functionality will be enabled by default
205212

206213
This design represents a small, optional change to an existing GA API. Thus it
207214
prioritizes rollout speed to allow clients to start using this functionality
@@ -217,7 +224,9 @@ during version skews (discussed below).
217224

218225
#### GA
219226

220-
- Confirm with cert-manager that the new functionality addresses their use case
227+
- Confirm with [cert-manager](https://github.com/jetstack/cert-manager/pull/3646) that the new functionality addresses their use case
228+
- Confirm with [pinniped](https://pinniped.dev) that the new functionality addresses their use case
229+
- Confirm that no other metrics are necessary
221230
- Wait one release after beta to allow bugs to be reported
222231

223232
The existing conformance tests for the certificates API (`test/e2e/auth/certificates.go`)
@@ -237,7 +246,7 @@ could request a duration of a month but the signer could truncate the duration t
237246
its internal maximum of two weeks). Thus this design emphasizes feature rollout
238247
speed to aid in ecosystem adoption instead of data durability. Combined with the
239248
simplicity of implementation and low risk nature of the proposal, the alpha stage
240-
and related feature gate have been omitted from this design.
249+
has been omitted from this design.
241250

242251
Clients that do not set the `spec.expirationSeconds` field will observe no change
243252
in behavior, even during upgrades and downgrades.
@@ -277,7 +286,7 @@ represents the status quo.
277286

278287
In this scenario, the requested `spec.expirationSeconds` may be ignored because
279288
old API servers will silently drop the field on update. This is harmless
280-
represents and the status quo.
289+
and represents the status quo.
281290

282291
The CSR API is resilient to split brain scenarios as unknown fields are silently
283292
dropped and the `spec` fields are immutable after creation [1] [2] [3].
@@ -292,15 +301,11 @@ dropped and the `spec` fields are immutable after creation [1] [2] [3].
292301

293302
###### How can this feature be enabled / disabled in a live cluster?
294303

295-
- Other
296-
- Describe the mechanism:
297-
As written above, this functionality will be enabled by default with no configuration options.
298-
- Will enabling / disabling the feature require downtime of the control
299-
plane?
300-
N/A
301-
- Will enabling / disabling the feature require downtime or reprovisioning
302-
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
303-
N/A
304+
- Feature gate
305+
- Feature gate name: `CSRDuration` (enabled by default)
306+
- Components depending on the feature gate:
307+
- kube-apiserver
308+
- kube-controller-manager
304309

305310
###### Does enabling the feature change any default behavior?
306311

@@ -309,14 +314,22 @@ would observe no difference in behavior.
309314

310315
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
311316

312-
N/A
317+
Yes, via the the `CSRDuration` feature gate. Disabling this gate will cause the
318+
API server to remove the `spec.expirationSeconds` field on `create` and thus all
319+
clients would have their requested duration ignored. This is a safe to do as the
320+
field is optional and represents the status quo.
313321

314322
###### What happens if we reenable the feature if it was previously rolled back?
315323

316-
N/A
324+
Clients could set `spec.expirationSeconds` on newly created CSRs and signers may
325+
choose to honor them. There are no specific issues caused by repeatedly enabling
326+
and disabling the feature.
317327

318328
###### Are there any tests for feature enablement/disablement?
319329

330+
Unit tests will confirm that `spec.expirationSeconds` is ignored when the feature
331+
gate is disabled.
332+
320333
Unit and integration tests will cover cases where `spec.expirationSeconds` is
321334
specified and cases where it is left unspecified.
322335

@@ -326,14 +339,22 @@ specified and cases where it is left unspecified.
326339

327340
Since it is optional for signers to honor `spec.expirationSeconds`, this design
328341
is fully tolerant of API server and controller manager rollouts/rollbacks that
329-
fail or get wedged in a partial state. The worse case scenario is that the
330-
`spec.expirationSeconds` field is ignored, which mimics the status quo. Clients
331-
must always check the duration of the issued certificate to determine if the
332-
requested `spec.expirationSeconds` was honored.
342+
fail or get wedged in a partial state. The `spec.expirationSeconds` field being
343+
ignored just mimics the status quo. Clients must always check the duration of the
344+
issued certificate to determine if the requested `spec.expirationSeconds` was honored.
345+
346+
The worst case scenario is that the Kubernetes controller manager or a critical
347+
signer encounters a fatal error if this field is set (i.e. a nil pointer exception
348+
that causes the process to crash). This would cause CSRs to stop being approved
349+
which would eventually lead to kubelet credentials expiring. Kubelets would no
350+
longer be able to update pod status and the node controller would mark these
351+
kubelets as dead. To mitigate the impact of any such scenario, the feature gate
352+
is included to allow this functionality to be disabled.
333353

334354
###### What specific metrics should inform a rollback?
335355

336-
N/A
356+
The `csr_duration_honored` metric can be used to determine if signers and/or clients
357+
should be upgraded to handle the `spec.expirationSeconds` field.
337358

338359
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
339360

@@ -355,12 +376,20 @@ N/A
355376
Use `kubectl` to determine if CSRs with `spec.expirationSeconds` set are being
356377
created. Audit logging could also be used to determine this.
357378

379+
Once a target CSR has been located, check that the issued certificate in
380+
`.status.certificate` has the correct duration. Audit logging could also be
381+
used to determine this.
382+
383+
The `csr_duration_honored` can be used to determine if signers are honoring
384+
durations when explicitly requested by clients.
385+
358386
###### How can someone using this feature know that it is working for their instance?
359387

360388
- API `.status`
361389
- Condition name: `Approved` `=` `true`
362390
- Other field:
363-
Check that the issued certificate in `.status.certificate` has the correct duration
391+
- Check that CSR `spec.expirationSeconds` has the correct requested duration
392+
- Check that the issued certificate in `.status.certificate` has the correct duration
364393

365394
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
366395

@@ -369,6 +398,11 @@ API.
369398

370399
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
371400

401+
- Metrics
402+
- Metric name: `csr_duration_honored`
403+
- Components exposing the metric:
404+
- kube-apiserver
405+
372406
- Details: Check the Kubernetes audit log from CRUD operations on CSRs.
373407

374408
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -385,17 +419,20 @@ N/A
385419
- Impact of its degraded performance or high-error rates on the feature:
386420
+ Signers may have difficulty issuing certificates
387421
+ Clients may have to wait a long time for certificates to be issued
422+
and their credentials could expire which could cause an outage
388423
- etcd
389424
- Usage description: stores data for the CSR API
390425
- Impact of its outage on the feature: CSR API will be unavailable
391426
- Impact of its degraded performance or high-error rates on the feature:
392427
+ Signers may have difficulty issuing certificates
393428
+ Clients may have to wait a long time for certificates to be issued
429+
and their credentials could expire which could cause an outage
394430
- Kubernetes controller manager
395431
- Usage description: hosts the in-tree signer controllers
396432
- Impact of its outage on the feature: in-tree signers will be unavailable
397433
- Impact of its degraded performance or high-error rates on the feature:
398434
+ Clients may have to wait a long time for certificates to be issued
435+
and their credentials could expire which could cause an outage
399436

400437
### Scalability
401438

@@ -444,7 +481,14 @@ The semantics of the Kubernetes CSR API do not change in this regard.
444481

445482
###### What are other known failure modes?
446483

447-
N/A
484+
- Signer does not honor requested duration
485+
- Detection: See `csr_duration_honored` metric and `kubectl` discussion above.
486+
- Mitigations: Upgrade signers to honor the new field. Clients could also be
487+
configured to set a requested duration that is within the signer's policy.
488+
- Diagnostics: Audit logs could be used to capture full request and response
489+
data in case the metrics are not sufficient.
490+
- Testing: This scenario is fully covered by unit and integration tests as
491+
honoring this field is optional.
448492

449493
###### What steps should be taken if SLOs are not being met to determine the problem?
450494

@@ -453,6 +497,7 @@ N/A
453497
## Implementation History
454498

455499
- 1.22: 2021-06-17: KEP written
500+
- 1.22: 2021-06-21: KEP review comments addressed
456501

457502
## Drawbacks
458503

keps/sig-auth/2784-csr-duration/kep.yaml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,12 @@ milestone:
2727
stable: "v1.23"
2828

2929

30-
feature-gates: []
31-
disable-supported: false
30+
feature-gates:
31+
- name: CSRDuration
32+
components:
33+
- kube-apiserver
34+
- kube-controller-manager
35+
disable-supported: true
3236

33-
metrics: []
37+
metrics:
38+
- csr_duration_honored

0 commit comments

Comments
 (0)