Skip to content

Commit b1fdf4f

Browse files
authored
Merge pull request kubernetes#3804 from aramase/aramase/3299/beta
KEP-3299: Update kmsv2 kep for beta
2 parents 7f1eec4 + 02dbdd3 commit b1fdf4f

File tree

3 files changed

+102
-130
lines changed

3 files changed

+102
-130
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3299
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-auth/3299-kms-v2-improvements/README.md

Lines changed: 98 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
6464
## Summary
6565

6666
This KEP proposes the new v2alpha1 `KeyManagementService` service contract to:
67-
- enable fully automated key rotation for the latest key
67+
- enable partially automated key rotation for the latest key without API server restarts
6868
- improve KMS plugin health check reliability
6969
- improve observability of envelop operations between kube-apiserver, KMS plugins and KMS
7070

@@ -113,22 +113,15 @@ Performance, Health Check, Observability and Rotation:
113113
- Watch on the `EncryptionConfiguration`
114114
- When changes are detected, process the `EncryptionConfiguration` resource, and add new transformers and update existing ones atomically.
115115
- If there is an issue with creating or updating any of the transformers, retain the current configuration in the kube-apiserver and generate an error in logs.
116-
- Enable fully automated rotation for `latest` key in KMS:
116+
- Enable partially automated rotation for `latest` key in KMS:
117117
> NOTE: Prerequisite: `EncryptionConfiguration` is set up to always use the `latest` key version in KMS and the values can be interpreted dynamically at runtime by the KMS plugin to hot reload the current write key. Rotation process sequence:
118-
- record initial key ID across all API servers (this could be recorded in the `StorageVersionStatus` as a new field)
118+
- record initial key ID across all API servers
119119
- cause key rotation in KMS (user action in the remote KMS)
120-
- observe the change across the stack (wait for convergence of `StorageVersionStatus`)
120+
- observe the change across the stack using metrics from API server
121121
- storage migration (run storage migrator)
122122

123123
## Design Details
124124

125-
<!--
126-
This section should contain enough information that the specifics of your
127-
change are understandable. This may include API specs (though not always
128-
required) or even code snippets. If there's any ambiguity about HOW your
129-
proposal will be implemented, this is the place to discuss them.
130-
-->
131-
132125
### v2 API
133126

134127
`EncryptionConfiguration` will be expanded to support the new v2 API:
@@ -143,7 +136,7 @@ index d7d68d2584d..84c1fa6546f 100644
143136
+ APIVersion string `json:"apiVersion"`
144137
```
145138

146-
Support key hierarchy in KMS plugin that generates local KEK and add v2alpha1 `KeyManagementService` proto service contract in Kubernetes to include `key_id`, `annotations`, and `status`.
139+
Support key hierarchy in KMS plugin that generates local KEK and add v2 `KeyManagementService` proto service contract in Kubernetes to include `key_id`, `annotations`, and `status`.
147140

148141
### Key Hierarchy
149142

@@ -242,34 +235,7 @@ message StatusResponse {
242235
}
243236
```
244237

245-
The `key_id` may be funneled into the storage version status as another field that API servers can attempt to gain consensus on:
246-
247-
```diff
248-
diff --git a/staging/src/k8s.io/api/apiserverinternal/v1alpha1/types.go b/staging/src/k8s.io/api/apiserverinternal/v1alpha1/types.go
249-
index bfa249e135c..e671fe599a9 100644
250-
--- a/staging/src/k8s.io/api/apiserverinternal/v1alpha1/types.go
251-
+++ b/staging/src/k8s.io/api/apiserverinternal/v1alpha1/types.go
252-
@@ -56,6 +56,8 @@ type StorageVersionStatus struct {
253-
// +optional
254-
CommonEncodingVersion *string `json:"commonEncodingVersion,omitempty" protobuf:"bytes,2,opt,name=commonEncodingVersion"`
255-
256-
+ CommonKeyID *string `json:"commonKeyID,omitempty" protobuf:"bytes,4,opt,name=commonKeyID"`
257-
+
258-
// The latest available observations of the storageVersion's state.
259-
// +optional
260-
// +listType=map
261-
@@ -77,6 +79,8 @@ type ServerStorageVersion struct {
262-
// The encodingVersion must be included in the decodableVersions.
263-
// +listType=set
264-
DecodableVersions []string `json:"decodableVersions,omitempty" protobuf:"bytes,3,opt,name=decodableVersions"`
265-
+
266-
+ KeyID *string `json:"keyID,omitempty" protobuf:"bytes,4,opt,name=keyID"`
267-
}
268-
269-
type StorageVersionConditionType string
270-
```
271-
272-
> NOTE: Since the storage version API is still alpha, this KEP will simply aim to make it possible to have automated rotation when that API is enabled and has been updated to include the new fields. The rotation feature will first be scoped to a single API server and will not be part of the graduation criteria for this KEP.
238+
The `key_id` will be funneled into API server metrics and audit annotations. *TODO: define metrics.*
273239

274240
### Observability
275241

@@ -337,57 +303,96 @@ sequenceDiagram
337303

338304
### Test Plan
339305

340-
[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
306+
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
341307

342308
##### Prerequisite testing updates
343309

344-
This section is incomplete and will be updated before the beta milestone.
345-
346310
##### Unit tests
347311

348-
This section is incomplete and will be updated before the beta milestone.
312+
- Validate key hierarchy behavior in reference implementation
313+
- Local KEK is generated for encryption
314+
- Local KEK is rotated after a period of time/number of operations
315+
- Remote KEK is only called when
316+
- Cache is empty, so remote KEK is needed to decrypt local KEK
317+
- Local KEK is generated or rotated
318+
- When key hierarchy is not used, remote KEK is called for every encryption/decryption
319+
- Ensure the logs and metrics are generated as expected
320+
- At least 75% code coverage
321+
- Staleness check based on keyID in the `StatusResponse`
322+
- Unit test for gRPC request/response validation
323+
- Serialize/Deserialize `EncryptedObject`
324+
- Validate keyID in `StatusResponse` and `EncryptResponse`
325+
- Validate annotations in `EncryptResponse`
326+
- Validate logs are generated with the correct `UID` and additional metadata
349327

350328
##### Integration tests
351329

352-
This section is incomplete and will be updated before the beta milestone.
330+
- Integration tests to validate
331+
- Encryption of custom resources and custom resource definitions
332+
- No-op writes cause rewrite of stale data (data that has correct schema but was encrypted with keyID that is not the latest)
333+
- Health checks
334+
- single health check for v2 at `/kms-providers`
335+
- individual health checks for v1 and v2 with `/kms-provider-0` and `/kms-provider-1`
336+
- Integration tests with base64 plugin to validate the encryption and decryption of data
337+
- Integration tests with reference implementation
338+
- Validate the encryption and decryption of data
339+
- Validate the functionality of reference implementation
340+
- Integration tests to check rotation is possible without restarting API server
341+
- Integration tests that exercise the feature enablement/disablement flow
353342

354343
##### e2e tests
355344

356-
This section is incomplete and will be updated before the beta milestone.
345+
With this e2e test suite, we want to do the following:
346+
1. Run the e2e suite against a kind cluster without kms encryption enabled.
347+
2. Run the e2e suite against a kind cluster that has kms v2 encryption enabled (as defined below).
348+
3. Compare `request_duration_seconds`, `request_terminations_total`, `request_aborts_total` API server metrics between the two runs. The acceptable delta should be less than 20%.
349+
4. Observe metrics from the reference implementation to determine time taken at each step of the encryption/decryption process.
350+
5. Observe API server startup time with and without kms encryption enabled.
357351

352+
- KMSv2 config would use the reference implementation
353+
- Validate all resources are encrypted
354+
- The "remote" kms would be a local encryption key
355+
- that adds 100 ms latency
356+
- that has rate limiting
357+
- Key hierarchy would be used
358358

359359
### Graduation Criteria
360360

361-
Since the storage version API is still alpha, this KEP will simply aim to make it possible to have automated rotation when that API is enabled and has been updated to include the new fields. The rotation feature will first be scoped to a single API server and will not be part of the graduation criteria for this KEP because the storage version API will take time to mature. However, testing of rotation will be part of the graduation criteria to confirm that the right information is being made available to allow for automated rotation when the storage version API graduates.
362-
363361
#### Alpha
364362

365363
- Feature implemented behind a feature flag
366364
- Initial unit and integration tests completed and enabled
367365

368366
#### Beta
369367

370-
TBD
368+
- Feature is enabled by default
369+
- All of the above documented tests are complete
370+
- Precise documentation outlining how to build components that can do automatic rotation
371+
- For example, how to write a controller that scrapes the API server metrics and uses that to trigger rotation with the storage migration controller.
372+
- Reference implementation is ready
373+
- Provide an example implementation using pkcs11
374+
- Metrics in the implementation to gauge performance
375+
- Documentation of the metrics
376+
- Metrics in API server to gauge performance impact
371377

372378
#### GA
373379

374-
TBD
380+
- Tracing is added to the reference implementation
381+
- At least 2 KMSv2 plugin implementations are available
382+
- We will gather feedback from these implementations to determine if API is sufficient
375383

376384
## Production Readiness Review Questionnaire
377385

378386
### Feature Enablement and Rollback
379387

380388
###### How can this feature be enabled / disabled in a live cluster?
381389

382-
<!--
383-
Pick one of these and delete the rest.
384-
-->
385-
386390
- Feature gate
387391
- Feature gate name: `KMSv2`
388392
- Components depending on the feature gate:
389393
- kube-apiserver
390394

395+
**Alpha**
391396
```go
392397
FeatureSpec{
393398
Default: false,
@@ -396,18 +401,34 @@ FeatureSpec{
396401
}
397402
```
398403

404+
**Beta**
405+
```go
406+
FeatureSpec{
407+
Default: true,
408+
LockToDefault: false,
409+
PreRelease: featuregate.Beta,
410+
}
411+
```
412+
399413
###### Does enabling the feature change any default behavior?
400414

401-
No. The v2 API is new in the v1.25 release.
415+
No. The v2 API is new in the v1.25 release. Furthermore, even with the feature enabled by default, the user needs to explicitly configure a KMSv2 provider to use this.
402416

403417
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
404418

405419
Yes, To disable encryption at rest using the v2 API:
406-
1. Disable encryption at rest with KMS provider by running through these [steps](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#disabling-encryption-at-rest)
407-
1. At the end of this step, all the data in etcd will be unencrypted.
408-
2. Disable the `KMSv2` feature gate.
420+
1. Add new `identity` provider at the top of encryption config
421+
1. Restart kube-apiserver
422+
1. Run storage migration to migrate all the existing encrypted data to use the `identity` provider
423+
1. If running `kubectl get <resource> --all-namespaces -o json | kubectl replace -f -` to migrate data, the user can confirm that the migration is complete by observing the kube-apiserver metrics `apiserver_envelope_encryption_key_id_hash_last_timestamp_seconds` and `apiserver_envelope_encryption_key_id_hash_total`. These metrics will no longer contain the keyID hash of the old KEK after storage migration and kube-apiserver restart.
424+
2. If running storage version migrator to migrate data, the user can confirm that the migration is complete by observing the conditions in `storageversionmigrations`. Refer to [doc](https://github.com/kubernetes-sigs/kube-storage-version-migrator/blob/master/USER_GUIDE.md#check-if-migration-has-completed) for more details. Using the storage version migrator is recommended.
425+
1. Remove the KMS provider from the encryption config and restart kube-apiserver
426+
2. At the end of these steps, all the data in etcd will be unencrypted.
427+
428+
More details are available [here](https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#disabling-encryption-at-rest)
409429

410430
Disabling this gate without first doing a storage migration to use a different encryption at rest mechanism will result in data loss.
431+
- For secrets that are mounted in pods, if the DEK used to encrypt the secret is not present in the kube-apiserver cache, the pods will fail to start as the secret will not be able to be decrypted.
411432

412433
Once the feature gate is disabled, if the plan is to use a different encryption at rest mechanism instead of KMS, then unset the `--encryption-provider-config` flag on the kube-apiserver.
413434

@@ -419,72 +440,37 @@ After the feature is reenabled, if a v2 KMS provider is still configured in the
419440

420441
###### Are there any tests for feature enablement/disablement?
421442

422-
<!--
423-
The e2e framework does not currently support enabling or disabling feature
424-
gates. However, unit tests in each component dealing with managing data, created
425-
with and without the feature, are necessary. At the very least, think about
426-
conversion tests if API types are being modified.
427-
428-
Additionally, for features that are introducing a new API field, unit tests that
429-
are exercising the `switch` of feature gate itself (what happens if I disable a
430-
feature gate after having objects written with the new field) are also critical.
431-
You can take a look at one potential example of such test in:
432-
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
433-
-->
434-
443+
- We will add integration tests to validate the enablement/disablement flow.
435444
- When the feature is disabled, data stored in etcd will no longer be encrypted using the external kms provider with v2 API.
436445
- If the feature is disabled incorrectly (i.e without performing a storage migration), existing data that is encrypted with the external kms provider will be unable to be decrypted. This will cause list and get operations to fail for the resources that were encrypted.
437446

438447
### Rollout, Upgrade and Rollback Planning
439448

440-
<!--
441-
This section must be completed when targeting beta to a release.
442-
-->
443-
444-
This section is incomplete and will be updated before the beta milestone.
445-
446449
###### How can a rollout or rollback fail? Can it impact already running workloads?
447450

448-
<!--
449-
Try to be as paranoid as possible - e.g., what if some components will restart
450-
mid-rollout?
451-
452-
Be sure to consider highly-available clusters, where, for example,
453-
feature flags will be enabled on some API servers and not others during the
454-
rollout. Similarly, consider large clusters and how enablement/disablement
455-
will rollout across nodes.
456-
-->
457-
458451
- If a rollback of the feature is done without first doing a storage migration to use a different encryption at rest mechanism will result in data loss.
459452
- Workloads relying on existing data in etcd will no longer be able to access it.
460453
- The data can be retrieved by reenabling the feature gate or deleting and recreating the data.
454+
- The rollout of the feature can fail if there are too many calls to the external kms provider.
455+
- API server will not report healthy.
456+
- For highly-available clusters, the feature can be enabled on some API servers only for read purpose.
457+
- For rollout, add KMSv2 providers as read across all API servers first before adding the provider for write.
458+
- For rollback, move KMSv2 providers from write to read position across all API servers.
461459

462460
###### What specific metrics should inform a rollback?
463461

464-
<!--
465-
What signals should users be paying attention to when the feature is young
466-
that might indicate a serious problem?
467-
-->
468-
469-
This section is incomplete and will be updated before the beta milestone.
462+
- Latency metrics `transformation_duration_seconds`
463+
- Transformation error count metric `apiserver_storage_transformation_duration_seconds_bucket{transformation_type="from_storage", transformer_prefix="k8s:enc:kms:v2:"}`
464+
- After rollback is complete, you should no longer see the keyID metric `apiserver_envelope_encryption_key_id_hash_total` increment.
470465

471466
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
472467

473-
<!--
474-
Describe manual testing that was done and the outcomes.
475-
Longer term, we may want to require automated upgrade/rollback tests, but we
476-
are missing a bunch of machinery and tooling and can't do that now.
477-
-->
478-
479-
This section is incomplete and will be updated before the beta milestone.
468+
This will be covered by integration tests.
480469

481470
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
482471

483-
<!--
484-
Even if applying deprecation policies, they may still surprise some users.
485-
-->
486-
487-
N/A
472+
- The `cacheSize` field in `EncryptionConfiguration` is no longer valid for KMS v2.
473+
- When KMSv2 is used without KMSv1 provider, the health endpoints don't individually identify for each KMS provider.
488474

489475
### Monitoring Requirements
490476

@@ -493,7 +479,8 @@ N/A
493479
- [x] Other (treat as last resort)
494480
- Details:
495481
- Logs in kube-apiserver, kms-plugin and KMS will be logged with the corresponding `key_id`, `annotations`, and `UID`.
496-
- count of encryption/decryption requests by resource and version
482+
- Number of times a keyID is used for encryption/decryption
483+
- Metric recording the last time in seconds when a keyID was returned in the `StatusResponse` e.g. `apiserver_envelope_encryption_key_id_hash_status_last_timestamp_seconds{key_id_hash="sha256", provider_name="providerName"} 1.674865558833728e+09`
497484

498485
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
499486

@@ -510,7 +497,7 @@ There should be no impact on the SLO with this change.
510497

511498
###### Does this feature depend on any specific services running in the cluster?
512499

513-
No.
500+
This feature requires the KMS plugin to be running.
514501

515502
### Scalability
516503

@@ -552,29 +539,16 @@ No.
552539

553540
## Implementation History
554541

555-
<!--
556-
Major milestones in the lifecycle of a KEP should be tracked in this section.
557-
Major milestones might include:
558-
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
559-
- the `Proposal` section being merged, signaling agreement on a proposed design
560-
- the date implementation started
561-
- the first Kubernetes release where an initial version of the KEP was available
562-
- the version of Kubernetes where the KEP graduated to general availability
563-
- when the KEP was retired or superseded
564-
-->
542+
- 2022-05-09: Initial KEP draft submitted.
543+
- 2022-09-09: KMSv2 Alpha (v1.25) implemented https://github.com/kubernetes/kubernetes/pull/111126
565544

566545
## Alternatives
567546

568-
<!--
569-
What other approaches did you consider, and why did you rule them out? These do
570-
not need to be as detailed as the proposal, but should include enough
571-
information to express the idea and why it was not acceptable.
572-
-->
573547
**Performance and rotation:**
574548

575549
We considered the follow approaches and each has its own drawbacks:
576550
1. `cacheSize` field in `EncryptionConfiguration`. It is used by the API server to initialize a LRU cache of the given size with the encrypted ciphertext used as index. Having a higher value for the `cacheSize` will prevent calls to the plugin for decryption operations. However, this does not solve the issue with the number of calls to KMS plugin when encryption traffic is bursty.
577-
2. Reduce the number of trips to KMS by caching DEKs by allowing one DEK to be used to encrypt multiple objects within the configured TTL period. One issue with this approach is it will be very hard to inform the API server to rotate the DEKs when a KEK has been rotated.
551+
1. Reduce the number of trips to KMS by caching DEKs by allowing one DEK to be used to encrypt multiple objects within the configured TTL period. One issue with this approach is it will be very hard to inform the API server to rotate the DEKs when a KEK has been rotated.
578552

579553
**Observability**:
580554

@@ -586,9 +560,5 @@ We considered using the `AuditID` from the kube-apiserver request that generated
586560

587561
## Infrastructure Needed
588562

589-
<!--
590-
Use this section if you need things from the project/SIG. Examples include a
591-
new subproject, repos requested, or GitHub details. Listing these here allows a
592-
SIG to get the process for these resources started right away.
593-
-->
594563
We need a new git repo for the KMS plugin reference implementation. It will need to be synced from the k/k staging dir.
564+
- repo created: https://github.com/kubernetes/kms

0 commit comments

Comments
 (0)