Skip to content

Commit 6b652d9

Browse files
committed
update API Server Tracing to latest KEP template
1 parent 1241f79 commit 6b652d9

File tree

2 files changed

+71
-34
lines changed

2 files changed

+71
-34
lines changed

keps/sig-instrumentation/647-apiserver-tracing/README.md

Lines changed: 70 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,13 @@
1717
- [APIServer Configuration and EgressSelectors](#apiserver-configuration-and-egressselectors)
1818
- [Controlling use of the OpenTelemetry library](#controlling-use-of-the-opentelemetry-library)
1919
- [Test Plan](#test-plan)
20+
- [Prerequisite testing updates](#prerequisite-testing-updates)
21+
- [Unit tests](#unit-tests)
22+
- [Integration tests](#integration-tests)
23+
- [e2e tests](#e2e-tests)
2024
- [Graduation requirements](#graduation-requirements)
25+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
26+
- [Version Skew Strategy](#version-skew-strategy)
2127
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
2228
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
2329
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
@@ -26,6 +32,7 @@
2632
- [Scalability](#scalability)
2733
- [Troubleshooting](#troubleshooting)
2834
- [Implementation History](#implementation-history)
35+
- [Drawbacks](#drawbacks)
2936
- [Alternatives considered](#alternatives-considered)
3037
- [Introducing a new EgressSelector type](#introducing-a-new-egressselector-type)
3138
- [Other OpenTelemetry Exporters](#other-opentelemetry-exporters)
@@ -38,10 +45,14 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
3845
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
3946
- [X] (R) KEP approvers have approved the KEP status as `implementable`
4047
- [X] (R) Design details are appropriately documented
41-
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
48+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
49+
- [X] e2e Tests for all Beta API Operations (endpoints)
50+
- [X] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
51+
- [X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
4252
- [X] (R) Graduation criteria is in place
53+
- [X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4354
- [X] (R) Production readiness review completed
44-
- [X] Production readiness review approved
55+
- [X] (R) Production readiness review approved
4556
- [X] "Implementation History" section is up-to-date for milestone
4657
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
4758
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
@@ -162,10 +173,31 @@ As the community found in the [Metrics Stability Framework KEP](https://github.c
162173

163174
### Test Plan
164175

176+
[X] I/we understand the owners of the involved components may require updates to
177+
existing tests to make this code solid enough prior to committing the changes necessary
178+
to implement this enhancement.
179+
165180
We will test tracing added by this feature with an integration test. The
166181
integration test will verify that spans exported by the apiserver match what is
167182
expected from the request.
168183

184+
##### Prerequisite testing updates
185+
186+
None.
187+
188+
##### Unit tests
189+
190+
- `staging/src/k8s.io/apiserver/pkg/server/options/tracing_test.go`: `10/10/2021`
191+
- `staging/src/k8s.io/component-base/tracing/api/v1/config_test.go`: `10/10/21`
192+
193+
##### Integration tests
194+
195+
- ``test/integration/apiserver/tracing/tracing_test.go`
196+
197+
##### e2e tests
198+
199+
Not Required.
200+
169201
## Graduation requirements
170202

171203
Alpha
@@ -184,11 +216,20 @@ Beta
184216

185217
GA
186218

219+
220+
### Upgrade / Downgrade Strategy
221+
222+
This feature is upgraded or downgraded with the API Server. It is not otherwise impacted.
223+
224+
### Version Skew Strategy
225+
226+
This feature is not impacted by version skew. API Servers of different versions can each prodce traces to provide observability signals independently.
227+
187228
## Production Readiness Review Questionnaire
188229

189230
### Feature Enablement and Rollback
190231

191-
* **How can this feature be enabled / disabled in a live cluster?**
232+
###### How can this feature be enabled / disabled in a live cluster?
192233
- [X] Feature gate (also fill in values in `kep.yaml`)
193234
- Feature gate name: APIServerTracing
194235
- Components depending on the feature gate: kube-apiserver
@@ -199,62 +240,58 @@ GA
199240
- Will enabling / disabling the feature require downtime or reprovisioning
200241
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). No.
201242

202-
* **Does enabling the feature change any default behavior?**
243+
###### Does enabling the feature change any default behavior?
203244
No. The feature is disabled unlesss both the feature gate and `--opentelemetry-config-file` flag are set. When the feature is enabled, it doesn't change behavior from the users' perspective; it only adds tracing telemetry based on API Server requests.
204245

205-
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
206-
the enablement)?**
246+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
207247
Yes.
208248

209-
* **What happens if we reenable the feature if it was previously rolled back?**
249+
###### What happens if we reenable the feature if it was previously rolled back?
210250
It will start sending traces again. This will happen regardless of whether it was disabled by removing the `--opentelemetry-config-file` flag, or by disabling via feature gate.
211251

212-
* **Are there any tests for feature enablement/disablement?**
252+
###### Are there any tests for feature enablement/disablement?
213253
[Unit tests](https://github.com/kubernetes/kubernetes/blob/5426da8f69c1d5fa99814526c1878aeb99b2456e/test/integration/apiserver/tracing/tracing_test.go) exist which enable the feature gate.
214254

215255
### Rollout, Upgrade and Rollback Planning
216256

217257
_This section must be completed when targeting beta graduation to a release._
218258

219-
* **How can a rollout fail? Can it impact already running workloads?**
259+
###### How can a rollout fail? Can it impact already running workloads?
220260
Try to be as paranoid as possible - e.g., what if some components will restart
221261
mid-rollout?
222262
* If APIServer tracing is rolled out with a high sampling rate, it is possible for it to have a performance impact on the api server, which can have a variety of impacts on the cluster.
223263

224-
* **What specific metrics should inform a rollback?**
264+
###### What specific metrics should inform a rollback?
225265

226266
* API Server [SLOs](https://github.com/kubernetes/community/tree/master/sig-scalability/slos) are the signals that should guide a rollback. In particular, the [`apiserver_request_duration_seconds` and `apiserver_request_slo_duration_seconds`](apiserver_request_slo_duration_seconds) metrics would surface issues resulting in slower API Server responses.
227267

228-
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
268+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
229269
Manually enabled the feature-gate and tracing, verified the apiserver in my cluster was reachable, and disabled the feature-gate and tracing in a dev cluster.
230270

231-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
232-
fields of API types, flags, etc.?**
271+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
233272
No.
234273

235274
### Monitoring Requirements
236275

237276
_This section must be completed when targeting beta graduation to a release._
238277

239-
* **How can an operator determine if the feature is in use by workloads?**
278+
###### How can an operator determine if the feature is in use by workloads?
240279
This is an operator-facing feature. Look for traces to see if tracing is enabled.
241280

242-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
243-
the health of the service?**
281+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
244282
- OpenTelemetry does not currently expose metrics about the number of traces successfully sent: https://github.com/open-telemetry/opentelemetry-go/issues/2547
245283

246-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
284+
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
247285
N/A
248286

249-
* **Are there any missing metrics that would be useful to have to improve observability
250-
of this feature?**
287+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
251288
N/A
252289

253290
### Dependencies
254291

255292
_This section must be completed when targeting beta graduation to a release._
256293

257-
* **Does this feature depend on any specific services running in the cluster?**
294+
###### Does this feature depend on any specific services running in the cluster?
258295
The feature itself (tracing in the API Server) does not depend on services running in the cluster. However, like with other signals (metrics, logs), collecting traces from the API Server requires a trace collection pipeline, which will differ depending on the cluster. The following is an example, and other OTLP-compatible collection mechanisms may be substituted for it. The impact of outages are likely to be the same, regardless of collection pipeline.
259296

260297
- [OpenTelemetry Collector (optional)]
@@ -273,31 +310,27 @@ _For beta, this section is required: reviewers must answer these questions._
273310
_For GA, this section is required: approvers should be able to confirm the
274311
previous answers based on experience in the field._
275312

276-
* **Will enabling / using this feature result in any new API calls?**
313+
###### Will enabling / using this feature result in any new API calls?
277314
This will not add any additional API calls.
278315

279-
* **Will enabling / using this feature result in introducing new API types?**
316+
###### Will enabling / using this feature result in introducing new API types?
280317
This will introduce an API type for the configuration. This is only for
281318
loading configuration, users cannot create these objects.
282319

283-
* **Will enabling / using this feature result in any new calls to the cloud
284-
provider?**
320+
###### Will enabling / using this feature result in any new calls to the cloud provider?
285321
Not directly. Cloud providers could choose to send traces to their managed
286322
trace backends, but this requires them to set up a telemetry pipeline as
287323
described above.
288324

289-
* **Will enabling / using this feature result in increasing size or count of
290-
the existing API objects?**
325+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
291326
No.
292327

293-
* **Will enabling / using this feature result in increasing time taken by any
294-
operations covered by [existing SLIs/SLOs]?**
328+
###### Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?
295329
It will increase API Server request latency by a negligible amount (<1 microsecond)
296330
for encoding and decoding the trace contex from headers, and recording spans
297331
in memory. Exporting spans is not in the critical path.
298332

299-
* **Will enabling / using this feature result in non-negligible increase of
300-
resource usage (CPU, RAM, disk, IO, ...) in any components?**
333+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
301334
The tracing client library has a small, in-memory cache for outgoing spans. Based on current benchmarks, a full cache could use as much as 5 Mb of memory.
302335

303336
### Troubleshooting
@@ -308,17 +341,17 @@ details). For now, we leave it here.
308341

309342
_This section must be completed when targeting beta graduation to a release._
310343

311-
* **How does this feature react if the API server and/or etcd is unavailable?**
344+
###### How does this feature react if the API server and/or etcd is unavailable?
312345
This feature does not have a dependency on the API Server or etcd (it is built into the API Server).
313346

314-
* **What are other known failure modes?**
347+
###### What are other known failure modes?
315348
- [Trace endpoint misconfigured, or unavailable]
316349
- Detection: No traces processed by trace ingestion pipeline
317350
- Mitigations: None
318351
- Diagnostics: API Server logs containing: "traces exporter is disconnected from the server"
319352
- Testing: The feature will simply not work if misconfigured. It doesn't seem worth verifying.
320353

321-
* **What steps should be taken if SLOs are not being met to determine the problem?**
354+
###### What steps should be taken if SLOs are not being met to determine the problem?
322355

323356
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
324357
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
@@ -332,6 +365,10 @@ _This section must be completed when targeting beta graduation to a release._
332365
* KEP scoped down to only API Server traces on 5/1/2020
333366
* Updated PRR section 2/8/2021
334367

368+
## Drawbacks
369+
370+
Depending on the chosen sampling rate, tracing can increase CPU and memory usage by a small amount, and can also add a negligible amount of latency to API Server requests, when enabled.
371+
335372
## Alternatives considered
336373

337374
### Introducing a new EgressSelector type

keps/sig-instrumentation/647-apiserver-tracing/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ approvers:
2020
see-also:
2121
replaces:
2222
stage: beta
23-
last-updated: 2022-09-19
23+
last-updated: 2022-09-29
2424
latest-milestone: "v1.26"
2525
milestone:
2626
alpha: "v1.22"

0 commit comments

Comments
 (0)