Skip to content

Commit c1cfece

Browse files
authored
Merge pull request kubernetes#2428 from dashpole/tracing_121
Move tracing to 1.21 milestone
2 parents 1c3bc02 + ec249ae commit c1cfece

File tree

3 files changed

+174
-61
lines changed

3 files changed

+174
-61
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 647
2+
alpha:
3+
approver: "@wojtek-t"

keps/sig-instrumentation/647-apiserver-tracing/README.md

Lines changed: 169 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,13 @@
1515
- [Controlling use of the OpenTelemetry library](#controlling-use-of-the-opentelemetry-library)
1616
- [Test Plan](#test-plan)
1717
- [Graduation requirements](#graduation-requirements)
18-
- [Production Readiness Survey](#production-readiness-survey)
18+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
19+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
20+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
21+
- [Monitoring Requirements](#monitoring-requirements)
22+
- [Dependencies](#dependencies)
23+
- [Scalability](#scalability)
24+
- [Troubleshooting](#troubleshooting)
1925
- [Implementation History](#implementation-history)
2026
- [Alternatives considered](#alternatives-considered)
2127
- [Introducing a new EgressSelector type](#introducing-a-new-egressselector-type)
@@ -168,64 +174,167 @@ GA
168174

169175
- [] Tracing e2e tests are promoted to conformance tests
170176

171-
## Production Readiness Survey
172-
173-
* Feature enablement and rollback
174-
- How can this feature be enabled / disabled in a live cluster? **Feature-gate: APIServerTracing. The API Server must be restarted to enable/disable exporting spans.**
175-
- Can the feature be disabled once it has been enabled (i.e., can we roll
176-
back the enablement)? **Yes, the feature gate can be disabled in the API Server**
177-
- Will enabling / disabling the feature require downtime for the control
178-
plane? **Yes, the API Server must be restarted with the feature-gate disabled.**
179-
- Will enabling / disabling the feature require downtime or reprovisioning
180-
of a node? **No.**
181-
- What happens if a cluster with this feature enabled is rolled back? What happens if it is subsequently upgraded again? **Rolling back this feature would break the added tracing telemetry, but would not affect the cluster.**
182-
- Are there tests for this? **No. The feature hasn't been developed yet.**
183-
* Scalability
184-
- Will enabling / using the feature result in any new API calls? **No.**
185-
Describe them with their impact keeping in mind the [supported limits][]
186-
(e.g. 5000 nodes per cluster, 100 pods/s churn) focusing mostly on:
187-
- components listing and/or watching resources they didn't before
188-
- API calls that may be triggered by changes of some Kubernetes
189-
resources (e.g. update object X based on changes of object Y)
190-
- periodic API calls to reconcile state (e.g. periodic fetching state,
191-
heartbeats, leader election, etc.)
192-
- Will enabling / using the feature result in supporting new API types? **No**
193-
How many objects of that type will be supported (and how that translates
194-
to limitations for users)?
195-
- Will enabling / using the feature result in increasing size or count
196-
of the existing API objects? **No.**
197-
- Will enabling / using the feature result in increasing time taken
198-
by any operations covered by [existing SLIs/SLOs][] (e.g. by adding
199-
additional work, introducing new steps in between, etc.)? **Yes. It will increase API Server request latency by a negligible amount (<1 microsecond) for encoding and decoding the trace contex from headers, and recording spans in memory. Exporting spans is not in the critical path.**
200-
Please describe the details if so.
201-
- Will enabling / using the feature result in non-negligible increase
202-
of resource usage (CPU, RAM, disk IO, ...) in any components?
203-
Things to keep in mind include: additional in-memory state, additional
204-
non-trivial computations, excessive access to disks (including increased
205-
log volume), significant amount of data sent and/or received over
206-
network, etc. Think through this in both small and large cases, again
207-
with respect to the [supported limits][]. **The tracing client library has a small, in-memory cache for outgoing spans.**
208-
* Rollout, Upgrade, and Rollback Planning
209-
* Dependencies
210-
- Does this feature depend on any specific services running in the cluster
211-
(e.g., a metrics service)? **Yes. In the current version of the proposal, users can run the [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector) to configure which backend (e.g. jager, zipkin, etc.) they want telemetry sent to.**
212-
- How does this feature respond to complete failures of the services on
213-
which it depends? **Traces will stop being exported, and components will store spans in memory until the buffer is full. After the buffer fills up, spans will be dropped.**
214-
- How does this feature respond to degraded performance or high error rates
215-
from services on which it depends? **If the bi-directional grpc streaming connection to the collector cannot be established or is broken, the controller retries the connection every 5 minutes (by default).**
216-
* Monitoring requirements
217-
- How can an operator determine if the feature is in use by workloads? **Operators are generally expected to have access to the trace backend.**
218-
- How can an operator determine if the feature is functioning properly?
219-
- What are the service level indicators an operator can use to determine the
220-
health of the service? **Error rate of sending traces in the API Server and OpenTelemetry collector.**
221-
- What are reasonable service level objectives for the feature? **Not entirely sure, but I would expect at least 99% of spans to be sent successfully, if not more.**
222-
* Troubleshooting
223-
- What are the known failure modes? **The API Server is misconfigured, and cannot talk to the collector. The collector is misconfigured, and can't send traces to the backend.**
224-
- How can those be detected via metrics or logs? Logs from the component or agent based on the failure mode.
225-
- What are the mitigations for each of those failure modes? **None. You must correctly configure the collector for tracing to work.**
226-
- What are the most useful log messages and what logging levels do they require? **All errors are useful, and are logged as errors (no logging levels required). Failure to initialize exporters (in both controller and collector), failures exporting metrics are the most useful. Errors are logged for each failed attempt to establish a connection to the collector.**
227-
- What steps should be taken if SLOs are not being met to determine the
228-
problem? **Look at API Server and collector logs.**
177+
## Production Readiness Review Questionnaire
178+
179+
### Feature Enablement and Rollback
180+
181+
* **How can this feature be enabled / disabled in a live cluster?**
182+
- [X] Feature gate (also fill in values in `kep.yaml`)
183+
- Feature gate name: APIServerTracing
184+
- Components depending on the feature gate: kube-apiserver
185+
- [X] Other
186+
- Describe the mechanism: Use specify a file using the `--opentelemetry-config-file` API Server flag.
187+
- Will enabling / disabling the feature require downtime of the control
188+
plane? Yes, it will require restarting the API Server.
189+
- Will enabling / disabling the feature require downtime or reprovisioning
190+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). No.
191+
192+
* **Does enabling the feature change any default behavior?**
193+
No. The feature is disabled unlesss both the feature gate and `--opentelemetry-config-file` flag are set. When the feature is enabled, it doesn't change behavior from the users' perspective; it only adds tracing telemetry based on API Server requests.
194+
195+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
196+
the enablement)?**
197+
Yes.
198+
199+
* **What happens if we reenable the feature if it was previously rolled back?**
200+
It will start sending traces again. This will happen regardless of whether it was disabled by removing the `--opentelemetry-config-file` flag, or by disabling via feature gate.
201+
202+
* **Are there any tests for feature enablement/disablement?**
203+
Unit tests switching feature gates will be added.
204+
205+
### Rollout, Upgrade and Rollback Planning
206+
207+
_This section must be completed when targeting beta graduation to a release._
208+
209+
* **How can a rollout fail? Can it impact already running workloads?**
210+
Try to be as paranoid as possible - e.g., what if some components will restart
211+
mid-rollout?
212+
213+
* **What specific metrics should inform a rollback?**
214+
215+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
216+
Describe manual testing that was done and the outcomes.
217+
Longer term, we may want to require automated upgrade/rollback tests, but we
218+
are missing a bunch of machinery and tooling and can't do that now.
219+
220+
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
221+
fields of API types, flags, etc.?**
222+
Even if applying deprecation policies, they may still surprise some users.
223+
224+
### Monitoring Requirements
225+
226+
_This section must be completed when targeting beta graduation to a release._
227+
228+
* **How can an operator determine if the feature is in use by workloads?**
229+
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
230+
checking if there are objects with field X set) may be a last resort. Avoid
231+
logs or events for this purpose.
232+
233+
* **What are the SLIs (Service Level Indicators) an operator can use to determine
234+
the health of the service?**
235+
- [ ] Metrics
236+
- Metric name:
237+
- [Optional] Aggregation method:
238+
- Components exposing the metric:
239+
- [ ] Other (treat as last resort)
240+
- Details:
241+
242+
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
243+
At a high level, this usually will be in the form of "high percentile of SLI
244+
per day <= X". It's impossible to provide comprehensive guidance, but at the very
245+
high level (needs more precise definitions) those may be things like:
246+
- per-day percentage of API calls finishing with 5XX errors <= 1%
247+
- 99% percentile over day of absolute value from (job creation time minus expected
248+
job creation time) for cron job <= 10%
249+
- 99,9% of /health requests per day finish with 200 code
250+
251+
* **Are there any missing metrics that would be useful to have to improve observability
252+
of this feature?**
253+
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
254+
implementation difficulties, etc.).
255+
256+
### Dependencies
257+
258+
_This section must be completed when targeting beta graduation to a release._
259+
260+
* **Does this feature depend on any specific services running in the cluster?**
261+
Think about both cluster-level services (e.g. metrics-server) as well
262+
as node-level agents (e.g. specific version of CRI). Focus on external or
263+
optional services that are needed. For example, if this feature depends on
264+
a cloud provider API, or upon an external software-defined storage or network
265+
control plane.
266+
267+
For each of these, fill in the following—thinking about running existing user workloads
268+
and creating new ones, as well as about cluster-level services (e.g. DNS):
269+
- [Dependency name]
270+
- Usage description:
271+
- Impact of its outage on the feature:
272+
- Impact of its degraded performance or high-error rates on the feature:
273+
274+
275+
### Scalability
276+
277+
_For alpha, this section is encouraged: reviewers should consider these questions
278+
and attempt to answer them._
279+
280+
_For beta, this section is required: reviewers must answer these questions._
281+
282+
_For GA, this section is required: approvers should be able to confirm the
283+
previous answers based on experience in the field._
284+
285+
* **Will enabling / using this feature result in any new API calls?**
286+
This will not add any additional API calls.
287+
288+
* **Will enabling / using this feature result in introducing new API types?**
289+
This will introduce an API type for the configuration. This is only for
290+
loading configuration, users cannot create these objects.
291+
292+
* **Will enabling / using this feature result in any new calls to the cloud
293+
provider?**
294+
Not directly. Cloud providers could choose to send traces to their managed
295+
trace backends, but this requires them to set up a telemetry pipeline as
296+
described above.
297+
298+
* **Will enabling / using this feature result in increasing size or count of
299+
the existing API objects?**
300+
No.
301+
302+
* **Will enabling / using this feature result in increasing time taken by any
303+
operations covered by [existing SLIs/SLOs]?**
304+
It will increase API Server request latency by a negligible amount (<1 microsecond)
305+
for encoding and decoding the trace contex from headers, and recording spans
306+
in memory. Exporting spans is not in the critical path.
307+
308+
* **Will enabling / using this feature result in non-negligible increase of
309+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
310+
The tracing client library has a small, in-memory cache for outgoing spans.
311+
312+
### Troubleshooting
313+
314+
The Troubleshooting section currently serves the `Playbook` role. We may consider
315+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
316+
details). For now, we leave it here.
317+
318+
_This section must be completed when targeting beta graduation to a release._
319+
320+
* **How does this feature react if the API server and/or etcd is unavailable?**
321+
322+
* **What are other known failure modes?**
323+
For each of them, fill in the following information by copying the below template:
324+
- [Failure mode brief description]
325+
- Detection: How can it be detected via metrics? Stated another way:
326+
how can an operator troubleshoot without logging into a master or worker node?
327+
- Mitigations: What can be done to stop the bleeding, especially for already
328+
running user workloads?
329+
- Diagnostics: What are the useful log messages and their required logging
330+
levels that could help debug the issue?
331+
Not required until feature graduated to beta.
332+
- Testing: Are there any tests for failure mode? If not, describe why.
333+
334+
* **What steps should be taken if SLOs are not being met to determine the problem?**
335+
336+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
337+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
229338

230339
## Implementation History
231340

@@ -234,6 +343,7 @@ GA
234343
* [Instrumentation of Kubernetes components for 1/24/2019 community demo](https://github.com/kubernetes/kubernetes/compare/master...dashpole:tracing)
235344
* KEP merged as provisional on 1/8/2020, including controller tracing
236345
* KEP scoped down to only API Server traces on 5/1/2020
346+
* Updated PRR section 2/8/2021
237347

238348
## Alternatives considered
239349

keps/sig-instrumentation/647-apiserver-tracing/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@ see-also:
2323
replaces:
2424
stage: alpha
2525
last-updated: 2020-10-14
26-
latest-milestone: "v1.20"
26+
latest-milestone: "v1.21"
2727
milestone:
28-
alpha: "v1.20"
28+
alpha: "v1.21"
2929
feature-gates:
3030
- name: APIServerTracing
3131
disable-supported: true

0 commit comments

Comments
 (0)