You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -64,7 +64,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
64
64
## Summary
65
65
66
66
This KEP proposes the new v2alpha1 `KeyManagementService` service contract to:
67
-
- enable fully automated key rotation for the latest key
67
+
- enable partially automated key rotation for the latest key without API server restarts
68
68
- improve KMS plugin health check reliability
69
69
- improve observability of envelop operations between kube-apiserver, KMS plugins and KMS
70
70
@@ -113,22 +113,15 @@ Performance, Health Check, Observability and Rotation:
113
113
- Watch on the `EncryptionConfiguration`
114
114
- When changes are detected, process the `EncryptionConfiguration` resource, and add new transformers and update existing ones atomically.
115
115
- If there is an issue with creating or updating any of the transformers, retain the current configuration in the kube-apiserver and generate an error in logs.
116
-
- Enable fully automated rotation for `latest` key in KMS:
116
+
- Enable partially automated rotation for `latest` key in KMS:
117
117
> NOTE: Prerequisite: `EncryptionConfiguration` is set up to always use the `latest` key version in KMS and the values can be interpreted dynamically at runtime by the KMS plugin to hot reload the current write key. Rotation process sequence:
118
-
- record initial key ID across all API servers (this could be recorded in the `StorageVersionStatus` as a new field)
118
+
- record initial key ID across all API servers
119
119
- cause key rotation in KMS (user action in the remote KMS)
120
-
- observe the change across the stack (wait for convergence of `StorageVersionStatus`)
120
+
- observe the change across the stack using metrics from API server
121
121
- storage migration (run storage migrator)
122
122
123
123
## Design Details
124
124
125
-
<!--
126
-
This section should contain enough information that the specifics of your
127
-
change are understandable. This may include API specs (though not always
128
-
required) or even code snippets. If there's any ambiguity about HOW your
129
-
proposal will be implemented, this is the place to discuss them.
130
-
-->
131
-
132
125
### v2 API
133
126
134
127
`EncryptionConfiguration` will be expanded to support the new v2 API:
@@ -143,7 +136,7 @@ index d7d68d2584d..84c1fa6546f 100644
143
136
+ APIVersion string `json:"apiVersion"`
144
137
```
145
138
146
-
Support key hierarchy in KMS plugin that generates local KEK and add v2alpha1`KeyManagementService` proto service contract in Kubernetes to include `key_id`, `annotations`, and `status`.
139
+
Support key hierarchy in KMS plugin that generates local KEK and add v2`KeyManagementService` proto service contract in Kubernetes to include `key_id`, `annotations`, and `status`.
147
140
148
141
### Key Hierarchy
149
142
@@ -242,34 +235,7 @@ message StatusResponse {
242
235
}
243
236
```
244
237
245
-
The `key_id` may be funneled into the storage version status as another field that API servers can attempt to gain consensus on:
> NOTE: Since the storage version API is still alpha, this KEP will simply aim to make it possible to have automated rotation when that API is enabled and has been updated to include the new fields. The rotation feature will first be scoped to a single API server and will not be part of the graduation criteria for this KEP.
238
+
The `key_id` will be funneled into API server metrics and audit annotations. *TODO: define metrics.*
273
239
274
240
### Observability
275
241
@@ -337,57 +303,96 @@ sequenceDiagram
337
303
338
304
### Test Plan
339
305
340
-
[] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
306
+
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
341
307
342
308
##### Prerequisite testing updates
343
309
344
-
This section is incomplete and will be updated before the beta milestone.
345
-
346
310
##### Unit tests
347
311
348
-
This section is incomplete and will be updated before the beta milestone.
312
+
- Validate key hierarchy behavior in reference implementation
313
+
- Local KEK is generated for encryption
314
+
- Local KEK is rotated after a period of time/number of operations
315
+
- Remote KEK is only called when
316
+
- Cache is empty, so remote KEK is needed to decrypt local KEK
317
+
- Local KEK is generated or rotated
318
+
- When key hierarchy is not used, remote KEK is called for every encryption/decryption
319
+
- Ensure the logs and metrics are generated as expected
320
+
- At least 75% code coverage
321
+
- Staleness check based on keyID in the `StatusResponse`
322
+
- Unit test for gRPC request/response validation
323
+
- Serialize/Deserialize `EncryptedObject`
324
+
- Validate keyID in `StatusResponse` and `EncryptResponse`
325
+
- Validate annotations in `EncryptResponse`
326
+
- Validate logs are generated with the correct `UID` and additional metadata
349
327
350
328
##### Integration tests
351
329
352
-
This section is incomplete and will be updated before the beta milestone.
330
+
- Integration tests to validate
331
+
- Encryption of custom resources and custom resource definitions
332
+
- No-op writes cause rewrite of stale data (data that has correct schema but was encrypted with keyID that is not the latest)
333
+
- Health checks
334
+
- single health check for v2 at `/kms-providers`
335
+
- individual health checks for v1 and v2 with `/kms-provider-0` and `/kms-provider-1`
336
+
- Integration tests with base64 plugin to validate the encryption and decryption of data
337
+
- Integration tests with reference implementation
338
+
- Validate the encryption and decryption of data
339
+
- Validate the functionality of reference implementation
340
+
- Integration tests to check rotation is possible without restarting API server
341
+
- Integration tests that exercise the feature enablement/disablement flow
353
342
354
343
##### e2e tests
355
344
356
-
This section is incomplete and will be updated before the beta milestone.
345
+
With this e2e test suite, we want to do the following:
346
+
1. Run the e2e suite against a kind cluster without kms encryption enabled.
347
+
2. Run the e2e suite against a kind cluster that has kms v2 encryption enabled (as defined below).
348
+
3. Compare `request_duration_seconds`, `request_terminations_total`, `request_aborts_total` API server metrics between the two runs. The acceptable delta should be less than 20%.
349
+
4. Observe metrics from the reference implementation to determine time taken at each step of the encryption/decryption process.
350
+
5. Observe API server startup time with and without kms encryption enabled.
357
351
352
+
- KMSv2 config would use the reference implementation
353
+
- Validate all resources are encrypted
354
+
- The "remote" kms would be a local encryption key
355
+
- that adds 100 ms latency
356
+
- that has rate limiting
357
+
- Key hierarchy would be used
358
358
359
359
### Graduation Criteria
360
360
361
-
Since the storage version API is still alpha, this KEP will simply aim to make it possible to have automated rotation when that API is enabled and has been updated to include the new fields. The rotation feature will first be scoped to a single API server and will not be part of the graduation criteria for this KEP because the storage version API will take time to mature. However, testing of rotation will be part of the graduation criteria to confirm that the right information is being made available to allow for automated rotation when the storage version API graduates.
362
-
363
361
#### Alpha
364
362
365
363
- Feature implemented behind a feature flag
366
364
- Initial unit and integration tests completed and enabled
367
365
368
366
#### Beta
369
367
370
-
TBD
368
+
- Feature is enabled by default
369
+
- All of the above documented tests are complete
370
+
- Precise documentation outlining how to build components that can do automatic rotation
371
+
- For example, how to write a controller that scrapes the API server metrics and uses that to trigger rotation with the storage migration controller.
372
+
- Reference implementation is ready
373
+
- Provide an example implementation using pkcs11
374
+
- Metrics in the implementation to gauge performance
375
+
- Documentation of the metrics
376
+
- Metrics in API server to gauge performance impact
371
377
372
378
#### GA
373
379
374
-
TBD
380
+
- Tracing is added to the reference implementation
381
+
- At least 2 KMSv2 plugin implementations are available
382
+
- We will gather feedback from these implementations to determine if API is sufficient
375
383
376
384
## Production Readiness Review Questionnaire
377
385
378
386
### Feature Enablement and Rollback
379
387
380
388
###### How can this feature be enabled / disabled in a live cluster?
381
389
382
-
<!--
383
-
Pick one of these and delete the rest.
384
-
-->
385
-
386
390
- Feature gate
387
391
- Feature gate name: `KMSv2`
388
392
- Components depending on the feature gate:
389
393
- kube-apiserver
390
394
395
+
**Alpha**
391
396
```go
392
397
FeatureSpec{
393
398
Default: false,
@@ -396,18 +401,34 @@ FeatureSpec{
396
401
}
397
402
```
398
403
404
+
**Beta**
405
+
```go
406
+
FeatureSpec{
407
+
Default: true,
408
+
LockToDefault: false,
409
+
PreRelease: featuregate.Beta,
410
+
}
411
+
```
412
+
399
413
###### Does enabling the feature change any default behavior?
400
414
401
-
No. The v2 API is new in the v1.25 release.
415
+
No. The v2 API is new in the v1.25 release. Furthermore, even with the feature enabled by default, the user needs to explicitly configure a KMSv2 provider to use this.
402
416
403
417
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
404
418
405
419
Yes, To disable encryption at rest using the v2 API:
406
-
1. Disable encryption at rest with KMS provider by running through these [steps](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#disabling-encryption-at-rest)
407
-
1. At the end of this step, all the data in etcd will be unencrypted.
408
-
2. Disable the `KMSv2` feature gate.
420
+
1. Add new `identity` provider at the top of encryption config
421
+
1. Restart kube-apiserver
422
+
1. Run storage migration to migrate all the existing encrypted data to use the `identity` provider
423
+
1. If running `kubectl get <resource> --all-namespaces -o json | kubectl replace -f -` to migrate data, the user can confirm that the migration is complete by observing the kube-apiserver metrics `apiserver_envelope_encryption_key_id_hash_last_timestamp_seconds` and `apiserver_envelope_encryption_key_id_hash_total`. These metrics will no longer contain the keyID hash of the old KEK after storage migration and kube-apiserver restart.
424
+
2. If running storage version migrator to migrate data, the user can confirm that the migration is complete by observing the conditions in `storageversionmigrations`. Refer to [doc](https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/master/USER_GUIDE.md#check-if-migration-has-completed) for more details. Using the storage version migrator is recommended.
425
+
1. Remove the KMS provider from the encryption config and restart kube-apiserver
426
+
2. At the end of these steps, all the data in etcd will be unencrypted.
427
+
428
+
More details are available [here](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#disabling-encryption-at-rest)
409
429
410
430
Disabling this gate without first doing a storage migration to use a different encryption at rest mechanism will result in data loss.
431
+
- For secrets that are mounted in pods, if the DEK used to encrypt the secret is not present in the kube-apiserver cache, the pods will fail to start as the secret will not be able to be decrypted.
411
432
412
433
Once the feature gate is disabled, if the plan is to use a different encryption at rest mechanism instead of KMS, then unset the `--encryption-provider-config` flag on the kube-apiserver.
413
434
@@ -419,72 +440,37 @@ After the feature is reenabled, if a v2 KMS provider is still configured in the
419
440
420
441
###### Are there any tests for feature enablement/disablement?
421
442
422
-
<!--
423
-
The e2e framework does not currently support enabling or disabling feature
424
-
gates. However, unit tests in each component dealing with managing data, created
425
-
with and without the feature, are necessary. At the very least, think about
426
-
conversion tests if API types are being modified.
427
-
428
-
Additionally, for features that are introducing a new API field, unit tests that
429
-
are exercising the `switch` of feature gate itself (what happens if I disable a
430
-
feature gate after having objects written with the new field) are also critical.
431
-
You can take a look at one potential example of such test in:
- We will add integration tests to validate the enablement/disablement flow.
435
444
- When the feature is disabled, data stored in etcd will no longer be encrypted using the external kms provider with v2 API.
436
445
- If the feature is disabled incorrectly (i.e without performing a storage migration), existing data that is encrypted with the external kms provider will be unable to be decrypted. This will cause list and get operations to fail for the resources that were encrypted.
437
446
438
447
### Rollout, Upgrade and Rollback Planning
439
448
440
-
<!--
441
-
This section must be completed when targeting beta to a release.
442
-
-->
443
-
444
-
This section is incomplete and will be updated before the beta milestone.
445
-
446
449
###### How can a rollout or rollback fail? Can it impact already running workloads?
447
450
448
-
<!--
449
-
Try to be as paranoid as possible - e.g., what if some components will restart
450
-
mid-rollout?
451
-
452
-
Be sure to consider highly-available clusters, where, for example,
453
-
feature flags will be enabled on some API servers and not others during the
454
-
rollout. Similarly, consider large clusters and how enablement/disablement
455
-
will rollout across nodes.
456
-
-->
457
-
458
451
- If a rollback of the feature is done without first doing a storage migration to use a different encryption at rest mechanism will result in data loss.
459
452
- Workloads relying on existing data in etcd will no longer be able to access it.
460
453
- The data can be retrieved by reenabling the feature gate or deleting and recreating the data.
454
+
- The rollout of the feature can fail if there are too many calls to the external kms provider.
455
+
- API server will not report healthy.
456
+
- For highly-available clusters, the feature can be enabled on some API servers only for read purpose.
457
+
- For rollout, add KMSv2 providers as read across all API servers first before adding the provider for write.
458
+
- For rollback, move KMSv2 providers from write to read position across all API servers.
461
459
462
460
###### What specific metrics should inform a rollback?
463
461
464
-
<!--
465
-
What signals should users be paying attention to when the feature is young
466
-
that might indicate a serious problem?
467
-
-->
468
-
469
-
This section is incomplete and will be updated before the beta milestone.
- After rollback is complete, you should no longer see the keyID metric `apiserver_envelope_encryption_key_id_hash_total` increment.
470
465
471
466
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
472
467
473
-
<!--
474
-
Describe manual testing that was done and the outcomes.
475
-
Longer term, we may want to require automated upgrade/rollback tests, but we
476
-
are missing a bunch of machinery and tooling and can't do that now.
477
-
-->
478
-
479
-
This section is incomplete and will be updated before the beta milestone.
468
+
This will be covered by integration tests.
480
469
481
470
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
482
471
483
-
<!--
484
-
Even if applying deprecation policies, they may still surprise some users.
485
-
-->
486
-
487
-
N/A
472
+
- The `cacheSize` field in `EncryptionConfiguration` is no longer valid for KMS v2.
473
+
- When KMSv2 is used without KMSv1 provider, the health endpoints don't individually identify for each KMS provider.
488
474
489
475
### Monitoring Requirements
490
476
@@ -493,7 +479,8 @@ N/A
493
479
-[x] Other (treat as last resort)
494
480
- Details:
495
481
- Logs in kube-apiserver, kms-plugin and KMS will be logged with the corresponding `key_id`, `annotations`, and `UID`.
496
-
- count of encryption/decryption requests by resource and version
482
+
- Number of times a keyID is used for encryption/decryption
483
+
- Metric recording the last time in seconds when a keyID was returned in the `StatusResponse` e.g. `apiserver_envelope_encryption_key_id_hash_status_last_timestamp_seconds{key_id_hash="sha256", provider_name="providerName"} 1.674865558833728e+09`
497
484
498
485
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
499
486
@@ -510,7 +497,7 @@ There should be no impact on the SLO with this change.
510
497
511
498
###### Does this feature depend on any specific services running in the cluster?
512
499
513
-
No.
500
+
This feature requires the KMS plugin to be running.
514
501
515
502
### Scalability
516
503
@@ -552,29 +539,16 @@ No.
552
539
553
540
## Implementation History
554
541
555
-
<!--
556
-
Major milestones in the lifecycle of a KEP should be tracked in this section.
557
-
Major milestones might include:
558
-
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
559
-
- the `Proposal` section being merged, signaling agreement on a proposed design
560
-
- the date implementation started
561
-
- the first Kubernetes release where an initial version of the KEP was available
562
-
- the version of Kubernetes where the KEP graduated to general availability
What other approaches did you consider, and why did you rule them out? These do
570
-
not need to be as detailed as the proposal, but should include enough
571
-
information to express the idea and why it was not acceptable.
572
-
-->
573
547
**Performance and rotation:**
574
548
575
549
We considered the follow approaches and each has its own drawbacks:
576
550
1.`cacheSize` field in `EncryptionConfiguration`. It is used by the API server to initialize a LRU cache of the given size with the encrypted ciphertext used as index. Having a higher value for the `cacheSize` will prevent calls to the plugin for decryption operations. However, this does not solve the issue with the number of calls to KMS plugin when encryption traffic is bursty.
577
-
2. Reduce the number of trips to KMS by caching DEKs by allowing one DEK to be used to encrypt multiple objects within the configured TTL period. One issue with this approach is it will be very hard to inform the API server to rotate the DEKs when a KEK has been rotated.
551
+
1. Reduce the number of trips to KMS by caching DEKs by allowing one DEK to be used to encrypt multiple objects within the configured TTL period. One issue with this approach is it will be very hard to inform the API server to rotate the DEKs when a KEK has been rotated.
578
552
579
553
**Observability**:
580
554
@@ -586,9 +560,5 @@ We considered using the `AuditID` from the kube-apiserver request that generated
586
560
587
561
## Infrastructure Needed
588
562
589
-
<!--
590
-
Use this section if you need things from the project/SIG. Examples include a
591
-
new subproject, repos requested, or GitHub details. Listing these here allows a
592
-
SIG to get the process for these resources started right away.
593
-
-->
594
563
We need a new git repo for the KMS plugin reference implementation. It will need to be synced from the k/k staging dir.
0 commit comments