You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-instrumentation/647-apiserver-tracing/README.md
+28-51Lines changed: 28 additions & 51 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,11 +40,11 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
40
40
-[X] (R) Design details are appropriately documented
41
41
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
42
42
-[X] (R) Graduation criteria is in place
43
-
-[x] (R) Production readiness review completed
44
-
-[] Production readiness review approved
43
+
-[X] (R) Production readiness review completed
44
+
-[X] Production readiness review approved
45
45
-[X] "Implementation History" section is up-to-date for milestone
46
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
47
-
-[] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
46
+
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
47
+
-[X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
48
48
49
49
## Summary
50
50
@@ -170,8 +170,8 @@ expected from the request.
170
170
171
171
Alpha
172
172
173
-
-[] Implement tracing of incoming and outgoing http/grpc requests in the kube-apiserver
174
-
-[] Integration testing of tracing
173
+
-[X] Implement tracing of incoming and outgoing http/grpc requests in the kube-apiserver
174
+
-[X] Integration testing of tracing
175
175
176
176
Beta
177
177
@@ -209,7 +209,7 @@ GA
209
209
It will start sending traces again. This will happen regardless of whether it was disabled by removing the `--opentelemetry-config-file` flag, or by disabling via feature gate.
210
210
211
211
***Are there any tests for feature enablement/disablement?**
212
-
Unit tests switching feature gates will be added.
212
+
[Unit tests](https://github.com/kubernetes/kubernetes/blob/5426da8f69c1d5fa99814526c1878aeb99b2456e/test/integration/apiserver/tracing/tracing_test.go) exist which enable the feature gate.
213
213
214
214
### Rollout, Upgrade and Rollback Planning
215
215
@@ -218,67 +218,48 @@ _This section must be completed when targeting beta graduation to a release._
218
218
***How can a rollout fail? Can it impact already running workloads?**
219
219
Try to be as paranoid as possible - e.g., what if some components will restart
220
220
mid-rollout?
221
+
* If APIServer tracing is rolled out with a high sampling rate, it is possible for it to have a performance impact on the api server, which can have a variety of impacts on the cluster.
221
222
222
223
***What specific metrics should inform a rollback?**
223
224
225
+
* API Server [SLOs](https://github.com/kubernetes/community/tree/master/sig-scalability/slos) are the signals that should guide a rollback. In particular, the [`apiserver_request_duration_seconds` and `apiserver_request_slo_duration_seconds`](apiserver_request_slo_duration_seconds) metrics would surface issues resulting in slower API Server responses.
226
+
224
227
***Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
225
-
Describe manual testing that was done and the outcomes.
226
-
Longer term, we may want to require automated upgrade/rollback tests, but we
227
-
are missing a bunch of machinery and tooling and can't do that now.
228
+
Manually enabled the feature-gate and tracing, verified the apiserver in my cluster was reachable, and disabled the feature-gate and tracing in a dev cluster.
228
229
229
230
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
230
231
fields of API types, flags, etc.?**
231
-
Even if applying deprecation policies, they may still surprise some users.
232
+
No.
232
233
233
234
### Monitoring Requirements
234
235
235
236
_This section must be completed when targeting beta graduation to a release._
236
237
237
238
***How can an operator determine if the feature is in use by workloads?**
238
-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
239
-
checking if there are objects with field X set) may be a last resort. Avoid
240
-
logs or events for this purpose.
239
+
This is an operator-facing feature. Look for traces to see if tracing is enabled.
241
240
242
241
***What are the SLIs (Service Level Indicators) an operator can use to determine
243
242
the health of the service?**
244
-
-[ ] Metrics
245
-
- Metric name:
246
-
-[Optional] Aggregation method:
247
-
- Components exposing the metric:
248
-
-[ ] Other (treat as last resort)
249
-
- Details:
243
+
- OpenTelemetry does not currently expose metrics about the number of traces successfully sent: https://github.com/open-telemetry/opentelemetry-go/issues/2547
250
244
251
245
***What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
252
-
At a high level, this usually will be in the form of "high percentile of SLI
253
-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
254
-
high level (needs more precise definitions) those may be things like:
255
-
- per-day percentage of API calls finishing with 5XX errors <= 1%
256
-
- 99% percentile over day of absolute value from (job creation time minus expected
257
-
job creation time) for cron job <= 10%
258
-
- 99,9% of /health requests per day finish with 200 code
246
+
N/A
259
247
260
248
***Are there any missing metrics that would be useful to have to improve observability
261
249
of this feature?**
262
-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
263
-
implementation difficulties, etc.).
250
+
N/A
264
251
265
252
### Dependencies
266
253
267
254
_This section must be completed when targeting beta graduation to a release._
268
255
269
256
***Does this feature depend on any specific services running in the cluster?**
270
-
Think about both cluster-level services (e.g. metrics-server) as well
271
-
as node-level agents (e.g. specific version of CRI). Focus on external or
272
-
optional services that are needed. For example, if this feature depends on
273
-
a cloud provider API, or upon an external software-defined storage or network
274
-
control plane.
257
+
The feature itself (tracing in the API Server) does not depend on services running in the cluster. However, like with other signals (metrics, logs), collecting traces from the API Server requires a trace collection pipeline, which will differ depending on the cluster. The following is an example, and other OTLP-compatible collection mechanisms may be substituted for it. The impact of outages are likely to be the same, regardless of collection pipeline.
275
258
276
-
For each of these, fill in the following—thinking about running existing user workloads
277
-
and creating new ones, as well as about cluster-level services (e.g. DNS):
278
-
-[Dependency name]
279
-
- Usage description:
280
-
- Impact of its outage on the feature:
281
-
- Impact of its degraded performance or high-error rates on the feature:
259
+
-[OpenTelemetry Collector (optional)]
260
+
- Usage description: Deploy the collector as a sidecar container to the API Server, and route traces to your backend of choice.
261
+
- Impact of its outage on the feature: Spans will continue to be collected by the kube-apiserver, but may be lost before they reach the trace backend.
262
+
- Impact of its degraded performance or high-error rates on the feature: Spans may be lost before they reach the trace backend.
282
263
283
264
284
265
### Scalability
@@ -316,7 +297,7 @@ operations covered by [existing SLIs/SLOs]?**
316
297
317
298
***Will enabling / using this feature result in non-negligible increase of
318
299
resource usage (CPU, RAM, disk, IO, ...) in any components?**
319
-
The tracing client library has a small, in-memory cache for outgoing spans.
300
+
The tracing client library has a small, in-memory cache for outgoing spans. Based on current benchmarks, a full cache could use as much as 5 Mb of memory.
320
301
321
302
### Troubleshooting
322
303
@@ -327,18 +308,14 @@ details). For now, we leave it here.
327
308
_This section must be completed when targeting beta graduation to a release._
328
309
329
310
***How does this feature react if the API server and/or etcd is unavailable?**
311
+
This feature does not have a dependency on the API Server or etcd (it is built into the API Server).
330
312
331
313
***What are other known failure modes?**
332
-
For each of them, fill in the following information by copying the below template:
333
-
-[Failure mode brief description]
334
-
- Detection: How can it be detected via metrics? Stated another way:
335
-
how can an operator troubleshoot without logging into a master or worker node?
336
-
- Mitigations: What can be done to stop the bleeding, especially for already
337
-
running user workloads?
338
-
- Diagnostics: What are the useful log messages and their required logging
339
-
levels that could help debug the issue?
340
-
Not required until feature graduated to beta.
341
-
- Testing: Are there any tests for failure mode? If not, describe why.
314
+
-[Trace endpoint misconfigured, or unavailable]
315
+
- Detection: No traces processed by trace ingestion pipeline
316
+
- Mitigations: None
317
+
- Diagnostics: API Server logs containing: "traces exporter is disconnected from the server"
318
+
- Testing: The feature will simply not work if misconfigured. It doesn't seem worth verifying.
342
319
343
320
***What steps should be taken if SLOs are not being met to determine the problem?**
0 commit comments