You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KEP-2831: update kubelet tracing KEP to target GA in 1.33 (kubernetes#5134)
* update kubelet tracing KEP to target GA in 1.33
* update to latest template
* add progress for otel self-observability metrics
* add explicit steps to test upgrade-downgrade-upgrade flow
-[Feature Enablement and Rollback](#feature-enablement-and-rollback)
27
29
-[Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior)
@@ -47,14 +49,23 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
47
49
-[X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
48
50
-[X] (R) KEP approvers have approved the KEP status as `implementable`
49
51
-[X] (R) Design details are appropriately documented
50
-
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
52
+
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
53
+
-[X] e2e Tests for all Beta API Operations (endpoints)
54
+
-[X] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
55
+
-[X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
51
56
-[X] (R) Graduation criteria is in place
57
+
-[X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
52
58
-[X] (R) Production readiness review completed
53
-
-[X] Production readiness review approved
59
+
-[X](R) Production readiness review approved
54
60
-[X] "Implementation History" section is up-to-date for milestone
55
61
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
56
62
-[X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
-[X] Feedback from users collected and incorporated over multiple releases
267
+
268
+
### Upgrade / Downgrade Strategy
269
+
270
+
Tracing will work if the kubelet version supports the feature, and will not export spans if it doesn't. It does not impact the ability to upgrade or rollback kubelet versions.
271
+
272
+
### Version Skew Strategy
273
+
274
+
Version skew isn't applicable because this feature only involves the kubelet.
275
+
255
276
## Production Readiness Review Questionnaire
256
277
257
278
### Feature Enablement and Rollback
@@ -319,20 +340,37 @@ _This section must be completed when targeting beta graduation to a release._
319
340
320
341
321
342
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
322
-
Upgrades and rollbacks will be tested while feature-gate is experimental
343
+
344
+
Yes. These were tested on a 1.27 kind cluster by enabling, disabling, and re-enabling the feature-gate on the kubelet.
None. Operators can use the absence of traces which an observability signal in their own right.
350
386
351
-
##### Are there any missing metrics that would be useful to have to improve observability
352
-
To be determined.
387
+
##### Are there any missing metrics that would be useful to have to improve observability
353
388
389
+
It would be helpful to have metrics about span generation and export: [opentelemetry-go issue #2547](https://github.com/open-telemetry/opentelemetry-go/issues/2547)
354
390
355
-
### Dependencies
391
+
There is progress on defining and implementing OpenTelemetry trace SDK self-observability metrics:
356
392
357
-
_This section must be completed when targeting beta graduation to a release._
393
+
* Proposal for names: https://github.com/open-telemetry/semantic-conventions/pull/1631
394
+
* Prototype for OpenTelemetry-Go: https://github.com/open-telemetry/opentelemetry-go/pull/6153
395
+
396
+
### Dependencies
358
397
359
398
###### Does this feature depend on any specific services running in the cluster?**
360
399
361
400
Yes. In the current version of the proposal, users must run the [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector)
362
-
as a daemonset and configure a backend trace visualization tool (jaeger, zipkin, etc).
363
-
401
+
as a daemonset and configure a backend trace visualization tool (jaeger, zipkin, etc). There are also a wide variety of vendors and cloud providers which support OTLP.
364
402
365
403
### Scalability
366
404
367
-
_For alpha, this section is encouraged: reviewers should consider these questions
368
-
and attempt to answer them._
369
-
370
-
_For beta, this section is required: reviewers must answer these questions._
371
-
372
-
_For GA, this section is required: approvers should be able to confirm the
373
-
previous answers based on experience in the field._
374
-
375
405
###### Will enabling / using this feature result in any new API calls?
376
406
377
407
This will not add any additional API calls.
@@ -401,28 +431,30 @@ previous answers based on experience in the field._
401
431
402
432
The tracing client library has a small, in-memory cache for outgoing spans.
403
433
434
+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
435
+
436
+
No.
437
+
404
438
### Troubleshooting
405
439
406
440
The Troubleshooting section currently serves the `Playbook` role. We may consider
407
441
splitting it into a dedicated `Playbook` document (potentially with some monitoring
408
442
details). For now, we leave it here.
409
443
410
-
_This section must be completed when targeting beta graduation to a release._
411
-
412
444
###### How does this feature react if the API server and/or etcd is unavailable?
413
445
414
446
No reaction specific to this feature if API server and/or etcd is unavailable.
415
447
416
448
###### What are other known failure modes?
417
449
418
-
-[The controller is misconfigured and cannot talk to the collector or the collector cannot send traces to the backend]
450
+
- [The kubelet is misconfigured and cannot talk to the collector or the kubelet cannot send traces to the backend]
419
451
- Detection: How can it be detected via metrics? Stated another way:
420
452
how can an operator troubleshoot without logging into a master or worker node?
0 commit comments