|
8 | 8 | - [Goals](#goals)
|
9 | 9 | - [Non-Goals](#non-goals)
|
10 | 10 | - [Proposal](#proposal)
|
| 11 | + - [User Stories](#user-stories) |
| 12 | + - [Steady-State trace collection](#steady-state-trace-collection) |
| 13 | + - [On-Demand trace collection](#on-demand-trace-collection) |
11 | 14 | - [Tracing API Requests](#tracing-api-requests)
|
12 | 15 | - [Exporting Spans](#exporting-spans)
|
13 | 16 | - [Running the OpenTelemetry Collector](#running-the-opentelemetry-collector)
|
@@ -73,6 +76,27 @@ Along with metrics and logs, traces are a useful form of telemetry to aid with d
|
73 | 76 |
|
74 | 77 | ## Proposal
|
75 | 78 |
|
| 79 | +### User Stories |
| 80 | + |
| 81 | +Since this feature is for diagnosing problems with the Kube-API Server, it is targeted at Cluster Operators and Cloud Vendors that manage kubernetes control-planes. |
| 82 | + |
| 83 | +For the following use-cases, I can deploy an OpenTelemetry collector as a sidecar to the API Server. I can use the API Server's `--opentelemetry-config-file` flag with the default URL to make the API Server send its spans to the sidecar collector. Alternatively, I can point the API Server at an OpenTelemetry collector listening on a different port or URL if I need to. |
| 84 | + |
| 85 | +#### Steady-State trace collection |
| 86 | + |
| 87 | +As a cluster operator or cloud provider, I would like to collect traces for API requests to the API Server to help debug a variety of control-plane problems. I can set the `SamplingRatePerMillion` in the configuration file to a non-zero number to have spans collected for a small fraction of requests. Depending on the symptoms I need to debug, I can search span metadata to find a trace which displays the symptoms I am looking to debug. Even for issues which occur non-deterministically, a low sampling rate is generally still enough to surface a representative trace over time. |
| 88 | + |
| 89 | +#### On-Demand trace collection |
| 90 | + |
| 91 | +As a cluster operator or cloud provider, I would like to collect a trace for a specific request to the API Server. This will often happen when debugging a live problem. In such cases, I don't want to change the `SamplingRatePerMillion` to collecting a high percentage of requests, which would be expensive and collect many things I don't care about. I also don't want to restart the API Server, which may fix the problem I am trying to debug. Instead, I can make sure the incoming request to the API Server is sampled. The tooling to do this easily doesn't exist today, but could be added in the future. |
| 92 | + |
| 93 | +For example, to trace a request to list nodes, with traceid=4bf92f3577b34da6a3ce929d0e0e4737, no parent span, and sampled=true: |
| 94 | + |
| 95 | +```bash |
| 96 | +kubectl proxy --port=8080 & |
| 97 | +curl http://localhost:8080/api/v1/nodes -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4737-0000000000000000-01" |
| 98 | +``` |
| 99 | + |
76 | 100 | ### Tracing API Requests
|
77 | 101 |
|
78 | 102 | We will wrap the API Server's http server and http clients with [otelhttp](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/master/instrumentation/net/http/otelhttp) to get spans for incoming and outgoing http requests. This generates spans for all sampled incoming requests and propagates context with all client requests. For incoming requests, this would go below [WithRequestInfo](https://github.com/kubernetes/kubernetes/blob/9eb097c4b07ea59c674a69e19c1519f0d10f2fa8/staging/src/k8s.io/apiserver/pkg/server/config.go#L676) in the filter stack, as it must be after authentication and authorization, before the panic filter, and is closest in function to the WithRequestInfo filter.
|
|
0 commit comments