You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -38,10 +45,14 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
38
45
-[X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
39
46
-[X] (R) KEP approvers have approved the KEP status as `implementable`
40
47
-[X] (R) Design details are appropriately documented
41
-
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
48
+
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
49
+
-[X] e2e Tests for all Beta API Operations (endpoints)
50
+
-[X] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
51
+
-[X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
42
52
-[X] (R) Graduation criteria is in place
53
+
-[X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
43
54
-[X] (R) Production readiness review completed
44
-
-[X] Production readiness review approved
55
+
-[X](R) Production readiness review approved
45
56
-[X] "Implementation History" section is up-to-date for milestone
46
57
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
47
58
-[X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
@@ -162,10 +173,31 @@ As the community found in the [Metrics Stability Framework KEP](https://github.c
162
173
163
174
### Test Plan
164
175
176
+
[X] I/we understand the owners of the involved components may require updates to
177
+
existing tests to make this code solid enough prior to committing the changes necessary
178
+
to implement this enhancement.
179
+
165
180
We will test tracing added by this feature with an integration test. The
166
181
integration test will verify that spans exported by the apiserver match what is
This feature is upgraded or downgraded with the API Server. It is not otherwise impacted.
223
+
224
+
### Version Skew Strategy
225
+
226
+
This feature is not impacted by version skew. API Servers of different versions can each prodce traces to provide observability signals independently.
227
+
187
228
## Production Readiness Review Questionnaire
188
229
189
230
### Feature Enablement and Rollback
190
231
191
-
***How can this feature be enabled / disabled in a live cluster?**
232
+
###### How can this feature be enabled / disabled in a live cluster?
192
233
-[X] Feature gate (also fill in values in `kep.yaml`)
193
234
- Feature gate name: APIServerTracing
194
235
- Components depending on the feature gate: kube-apiserver
@@ -199,62 +240,58 @@ GA
199
240
- Will enabling / disabling the feature require downtime or reprovisioning
200
241
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). No.
201
242
202
-
***Does enabling the feature change any default behavior?**
243
+
###### Does enabling the feature change any default behavior?
203
244
No. The feature is disabled unlesss both the feature gate and `--opentelemetry-config-file` flag are set. When the feature is enabled, it doesn't change behavior from the users' perspective; it only adds tracing telemetry based on API Server requests.
204
245
205
-
***Can the feature be disabled once it has been enabled (i.e. can we roll back
206
-
the enablement)?**
246
+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
207
247
Yes.
208
248
209
-
***What happens if we reenable the feature if it was previously rolled back?**
249
+
###### What happens if we reenable the feature if it was previously rolled back?
210
250
It will start sending traces again. This will happen regardless of whether it was disabled by removing the `--opentelemetry-config-file` flag, or by disabling via feature gate.
211
251
212
-
***Are there any tests for feature enablement/disablement?**
252
+
###### Are there any tests for feature enablement/disablement?
213
253
[Unit tests](https://github.com/kubernetes/kubernetes/blob/5426da8f69c1d5fa99814526c1878aeb99b2456e/test/integration/apiserver/tracing/tracing_test.go) exist which enable the feature gate.
214
254
215
255
### Rollout, Upgrade and Rollback Planning
216
256
217
257
_This section must be completed when targeting beta graduation to a release._
218
258
219
-
***How can a rollout fail? Can it impact already running workloads?**
259
+
###### How can a rollout fail? Can it impact already running workloads?
220
260
Try to be as paranoid as possible - e.g., what if some components will restart
221
261
mid-rollout?
222
262
* If APIServer tracing is rolled out with a high sampling rate, it is possible for it to have a performance impact on the api server, which can have a variety of impacts on the cluster.
223
263
224
-
***What specific metrics should inform a rollback?**
264
+
###### What specific metrics should inform a rollback?
225
265
226
266
* API Server [SLOs](https://github.com/kubernetes/community/tree/master/sig-scalability/slos) are the signals that should guide a rollback. In particular, the [`apiserver_request_duration_seconds` and `apiserver_request_slo_duration_seconds`](apiserver_request_slo_duration_seconds) metrics would surface issues resulting in slower API Server responses.
227
267
228
-
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
268
+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
229
269
Manually enabled the feature-gate and tracing, verified the apiserver in my cluster was reachable, and disabled the feature-gate and tracing in a dev cluster.
230
270
231
-
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
232
-
fields of API types, flags, etc.?**
271
+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
233
272
No.
234
273
235
274
### Monitoring Requirements
236
275
237
276
_This section must be completed when targeting beta graduation to a release._
238
277
239
-
***How can an operator determine if the feature is in use by workloads?**
278
+
###### How can an operator determine if the feature is in use by workloads?
240
279
This is an operator-facing feature. Look for traces to see if tracing is enabled.
241
280
242
-
***What are the SLIs (Service Level Indicators) an operator can use to determine
243
-
the health of the service?**
281
+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
244
282
- OpenTelemetry does not currently expose metrics about the number of traces successfully sent: https://github.com/open-telemetry/opentelemetry-go/issues/2547
245
283
246
-
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
284
+
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
247
285
N/A
248
286
249
-
***Are there any missing metrics that would be useful to have to improve observability
250
-
of this feature?**
287
+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
251
288
N/A
252
289
253
290
### Dependencies
254
291
255
292
_This section must be completed when targeting beta graduation to a release._
256
293
257
-
***Does this feature depend on any specific services running in the cluster?**
294
+
###### Does this feature depend on any specific services running in the cluster?
258
295
The feature itself (tracing in the API Server) does not depend on services running in the cluster. However, like with other signals (metrics, logs), collecting traces from the API Server requires a trace collection pipeline, which will differ depending on the cluster. The following is an example, and other OTLP-compatible collection mechanisms may be substituted for it. The impact of outages are likely to be the same, regardless of collection pipeline.
259
296
260
297
-[OpenTelemetry Collector (optional)]
@@ -273,31 +310,27 @@ _For beta, this section is required: reviewers must answer these questions._
273
310
_For GA, this section is required: approvers should be able to confirm the
274
311
previous answers based on experience in the field._
275
312
276
-
***Will enabling / using this feature result in any new API calls?**
313
+
###### Will enabling / using this feature result in any new API calls?
277
314
This will not add any additional API calls.
278
315
279
-
***Will enabling / using this feature result in introducing new API types?**
316
+
###### Will enabling / using this feature result in introducing new API types?
280
317
This will introduce an API type for the configuration. This is only for
281
318
loading configuration, users cannot create these objects.
282
319
283
-
***Will enabling / using this feature result in any new calls to the cloud
284
-
provider?**
320
+
###### Will enabling / using this feature result in any new calls to the cloud provider?
285
321
Not directly. Cloud providers could choose to send traces to their managed
286
322
trace backends, but this requires them to set up a telemetry pipeline as
287
323
described above.
288
324
289
-
***Will enabling / using this feature result in increasing size or count of
290
-
the existing API objects?**
325
+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
291
326
No.
292
327
293
-
***Will enabling / using this feature result in increasing time taken by any
294
-
operations covered by [existing SLIs/SLOs]?**
328
+
###### Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?
295
329
It will increase API Server request latency by a negligible amount (<1 microsecond)
296
330
for encoding and decoding the trace contex from headers, and recording spans
297
331
in memory. Exporting spans is not in the critical path.
298
332
299
-
***Will enabling / using this feature result in non-negligible increase of
300
-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
333
+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
301
334
The tracing client library has a small, in-memory cache for outgoing spans. Based on current benchmarks, a full cache could use as much as 5 Mb of memory.
302
335
303
336
### Troubleshooting
@@ -308,17 +341,17 @@ details). For now, we leave it here.
308
341
309
342
_This section must be completed when targeting beta graduation to a release._
310
343
311
-
***How does this feature react if the API server and/or etcd is unavailable?**
344
+
###### How does this feature react if the API server and/or etcd is unavailable?
312
345
This feature does not have a dependency on the API Server or etcd (it is built into the API Server).
313
346
314
-
***What are other known failure modes?**
347
+
###### What are other known failure modes?
315
348
-[Trace endpoint misconfigured, or unavailable]
316
349
- Detection: No traces processed by trace ingestion pipeline
317
350
- Mitigations: None
318
351
- Diagnostics: API Server logs containing: "traces exporter is disconnected from the server"
319
352
- Testing: The feature will simply not work if misconfigured. It doesn't seem worth verifying.
320
353
321
-
***What steps should be taken if SLOs are not being met to determine the problem?**
354
+
###### What steps should be taken if SLOs are not being met to determine the problem?
@@ -332,6 +365,10 @@ _This section must be completed when targeting beta graduation to a release._
332
365
* KEP scoped down to only API Server traces on 5/1/2020
333
366
* Updated PRR section 2/8/2021
334
367
368
+
## Drawbacks
369
+
370
+
Depending on the chosen sampling rate, tracing can increase CPU and memory usage by a small amount, and can also add a negligible amount of latency to API Server requests, when enabled.
0 commit comments