Skip to content

Commit 24ef1ae

Browse files
authored
Merge pull request kubernetes#3161 from dashpole/tracing_124
KEP-647: Update apiserver tracing KEP to beta for 1.24
2 parents 66ea524 + 05a4135 commit 24ef1ae

File tree

3 files changed

+34
-54
lines changed

3 files changed

+34
-54
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 647
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-instrumentation/647-apiserver-tracing/README.md

Lines changed: 28 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,11 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4040
- [X] (R) Design details are appropriately documented
4141
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
4242
- [X] (R) Graduation criteria is in place
43-
- [x] (R) Production readiness review completed
44-
- [ ] Production readiness review approved
43+
- [X] (R) Production readiness review completed
44+
- [X] Production readiness review approved
4545
- [X] "Implementation History" section is up-to-date for milestone
46-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
47-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
46+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
47+
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
4848

4949
## Summary
5050

@@ -170,8 +170,8 @@ expected from the request.
170170

171171
Alpha
172172

173-
- [] Implement tracing of incoming and outgoing http/grpc requests in the kube-apiserver
174-
- [] Integration testing of tracing
173+
- [X] Implement tracing of incoming and outgoing http/grpc requests in the kube-apiserver
174+
- [X] Integration testing of tracing
175175

176176
Beta
177177

@@ -209,7 +209,7 @@ GA
209209
It will start sending traces again. This will happen regardless of whether it was disabled by removing the `--opentelemetry-config-file` flag, or by disabling via feature gate.
210210

211211
* **Are there any tests for feature enablement/disablement?**
212-
Unit tests switching feature gates will be added.
212+
[Unit tests](https://github.com/kubernetes/kubernetes/blob/5426da8f69c1d5fa99814526c1878aeb99b2456e/test/integration/apiserver/tracing/tracing_test.go) exist which enable the feature gate.
213213

214214
### Rollout, Upgrade and Rollback Planning
215215

@@ -218,67 +218,48 @@ _This section must be completed when targeting beta graduation to a release._
218218
* **How can a rollout fail? Can it impact already running workloads?**
219219
Try to be as paranoid as possible - e.g., what if some components will restart
220220
mid-rollout?
221+
* If APIServer tracing is rolled out with a high sampling rate, it is possible for it to have a performance impact on the api server, which can have a variety of impacts on the cluster.
221222

222223
* **What specific metrics should inform a rollback?**
223224

225+
* API Server [SLOs](https://github.com/kubernetes/community/tree/master/sig-scalability/slos) are the signals that should guide a rollback. In particular, the [`apiserver_request_duration_seconds` and `apiserver_request_slo_duration_seconds`](apiserver_request_slo_duration_seconds) metrics would surface issues resulting in slower API Server responses.
226+
224227
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
225-
Describe manual testing that was done and the outcomes.
226-
Longer term, we may want to require automated upgrade/rollback tests, but we
227-
are missing a bunch of machinery and tooling and can't do that now.
228+
Manually enabled the feature-gate and tracing, verified the apiserver in my cluster was reachable, and disabled the feature-gate and tracing in a dev cluster.
228229

229230
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
230231
fields of API types, flags, etc.?**
231-
Even if applying deprecation policies, they may still surprise some users.
232+
No.
232233

233234
### Monitoring Requirements
234235

235236
_This section must be completed when targeting beta graduation to a release._
236237

237238
* **How can an operator determine if the feature is in use by workloads?**
238-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
239-
checking if there are objects with field X set) may be a last resort. Avoid
240-
logs or events for this purpose.
239+
This is an operator-facing feature. Look for traces to see if tracing is enabled.
241240

242241
* **What are the SLIs (Service Level Indicators) an operator can use to determine
243242
the health of the service?**
244-
- [ ] Metrics
245-
- Metric name:
246-
- [Optional] Aggregation method:
247-
- Components exposing the metric:
248-
- [ ] Other (treat as last resort)
249-
- Details:
243+
- OpenTelemetry does not currently expose metrics about the number of traces successfully sent: https://github.com/open-telemetry/opentelemetry-go/issues/2547
250244

251245
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
252-
At a high level, this usually will be in the form of "high percentile of SLI
253-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
254-
high level (needs more precise definitions) those may be things like:
255-
- per-day percentage of API calls finishing with 5XX errors <= 1%
256-
- 99% percentile over day of absolute value from (job creation time minus expected
257-
job creation time) for cron job <= 10%
258-
- 99,9% of /health requests per day finish with 200 code
246+
N/A
259247

260248
* **Are there any missing metrics that would be useful to have to improve observability
261249
of this feature?**
262-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
263-
implementation difficulties, etc.).
250+
N/A
264251

265252
### Dependencies
266253

267254
_This section must be completed when targeting beta graduation to a release._
268255

269256
* **Does this feature depend on any specific services running in the cluster?**
270-
Think about both cluster-level services (e.g. metrics-server) as well
271-
as node-level agents (e.g. specific version of CRI). Focus on external or
272-
optional services that are needed. For example, if this feature depends on
273-
a cloud provider API, or upon an external software-defined storage or network
274-
control plane.
257+
The feature itself (tracing in the API Server) does not depend on services running in the cluster. However, like with other signals (metrics, logs), collecting traces from the API Server requires a trace collection pipeline, which will differ depending on the cluster. The following is an example, and other OTLP-compatible collection mechanisms may be substituted for it. The impact of outages are likely to be the same, regardless of collection pipeline.
275258

276-
For each of these, fill in the following—thinking about running existing user workloads
277-
and creating new ones, as well as about cluster-level services (e.g. DNS):
278-
- [Dependency name]
279-
- Usage description:
280-
- Impact of its outage on the feature:
281-
- Impact of its degraded performance or high-error rates on the feature:
259+
- [OpenTelemetry Collector (optional)]
260+
- Usage description: Deploy the collector as a sidecar container to the API Server, and route traces to your backend of choice.
261+
- Impact of its outage on the feature: Spans will continue to be collected by the kube-apiserver, but may be lost before they reach the trace backend.
262+
- Impact of its degraded performance or high-error rates on the feature: Spans may be lost before they reach the trace backend.
282263

283264

284265
### Scalability
@@ -316,7 +297,7 @@ operations covered by [existing SLIs/SLOs]?**
316297

317298
* **Will enabling / using this feature result in non-negligible increase of
318299
resource usage (CPU, RAM, disk, IO, ...) in any components?**
319-
The tracing client library has a small, in-memory cache for outgoing spans.
300+
The tracing client library has a small, in-memory cache for outgoing spans. Based on current benchmarks, a full cache could use as much as 5 Mb of memory.
320301

321302
### Troubleshooting
322303

@@ -327,18 +308,14 @@ details). For now, we leave it here.
327308
_This section must be completed when targeting beta graduation to a release._
328309

329310
* **How does this feature react if the API server and/or etcd is unavailable?**
311+
This feature does not have a dependency on the API Server or etcd (it is built into the API Server).
330312

331313
* **What are other known failure modes?**
332-
For each of them, fill in the following information by copying the below template:
333-
- [Failure mode brief description]
334-
- Detection: How can it be detected via metrics? Stated another way:
335-
how can an operator troubleshoot without logging into a master or worker node?
336-
- Mitigations: What can be done to stop the bleeding, especially for already
337-
running user workloads?
338-
- Diagnostics: What are the useful log messages and their required logging
339-
levels that could help debug the issue?
340-
Not required until feature graduated to beta.
341-
- Testing: Are there any tests for failure mode? If not, describe why.
314+
- [Trace endpoint misconfigured, or unavailable]
315+
- Detection: No traces processed by trace ingestion pipeline
316+
- Mitigations: None
317+
- Diagnostics: API Server logs containing: "traces exporter is disconnected from the server"
318+
- Testing: The feature will simply not work if misconfigured. It doesn't seem worth verifying.
342319

343320
* **What steps should be taken if SLOs are not being met to determine the problem?**
344321

keps/sig-instrumentation/647-apiserver-tracing/kep.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,12 @@ prr-approvers:
2121
- "@wojtek-t"
2222
see-also:
2323
replaces:
24-
stage: alpha
25-
last-updated: 2021-07-15
26-
latest-milestone: "v1.22"
24+
stage: beta
25+
last-updated: 2022-01-18
26+
latest-milestone: "v1.24"
2727
milestone:
2828
alpha: "v1.22"
29+
beta: "v1.24"
2930
feature-gates:
3031
- name: APIServerTracing
3132
disable-supported: true

0 commit comments

Comments
 (0)