Skip to content

Commit ddc0c48

Browse files
authored
Merge pull request kubernetes#2526 from dashpole/apiserver_tracing_updates
Apiserver tracing updates
2 parents ea7746f + dea4c18 commit ddc0c48

File tree

1 file changed

+37
-28
lines changed
  • keps/sig-instrumentation/647-apiserver-tracing

1 file changed

+37
-28
lines changed

keps/sig-instrumentation/647-apiserver-tracing/README.md

Lines changed: 37 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@
88
- [Goals](#goals)
99
- [Non-Goals](#non-goals)
1010
- [Proposal](#proposal)
11+
- [User Stories](#user-stories)
12+
- [Steady-State trace collection](#steady-state-trace-collection)
13+
- [On-Demand trace collection](#on-demand-trace-collection)
1114
- [Tracing API Requests](#tracing-api-requests)
1215
- [Exporting Spans](#exporting-spans)
1316
- [Running the OpenTelemetry Collector](#running-the-opentelemetry-collector)
@@ -73,6 +76,27 @@ Along with metrics and logs, traces are a useful form of telemetry to aid with d
7376

7477
## Proposal
7578

79+
### User Stories
80+
81+
Since this feature is for diagnosing problems with the Kube-API Server, it is targeted at Cluster Operators and Cloud Vendors that manage kubernetes control-planes.
82+
83+
For the following use-cases, I can deploy an OpenTelemetry collector as a sidecar to the API Server. I can use the API Server's `--opentelemetry-config-file` flag with the default URL to make the API Server send its spans to the sidecar collector. Alternatively, I can point the API Server at an OpenTelemetry collector listening on a different port or URL if I need to.
84+
85+
#### Steady-State trace collection
86+
87+
As a cluster operator or cloud provider, I would like to collect traces for API requests to the API Server to help debug a variety of control-plane problems. I can set the `SamplingRatePerMillion` in the configuration file to a non-zero number to have spans collected for a small fraction of requests. Depending on the symptoms I need to debug, I can search span metadata to find a trace which displays the symptoms I am looking to debug. Even for issues which occur non-deterministically, a low sampling rate is generally still enough to surface a representative trace over time.
88+
89+
#### On-Demand trace collection
90+
91+
As a cluster operator or cloud provider, I would like to collect a trace for a specific request to the API Server. This will often happen when debugging a live problem. In such cases, I don't want to change the `SamplingRatePerMillion` to collecting a high percentage of requests, which would be expensive and collect many things I don't care about. I also don't want to restart the API Server, which may fix the problem I am trying to debug. Instead, I can make sure the incoming request to the API Server is sampled. The tooling to do this easily doesn't exist today, but could be added in the future.
92+
93+
For example, to trace a request to list nodes, with traceid=4bf92f3577b34da6a3ce929d0e0e4737, no parent span, and sampled=true:
94+
95+
```bash
96+
kubectl proxy --port=8080 &
97+
curl http://localhost:8080/api/v1/nodes -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4737-0000000000000000-01"
98+
```
99+
76100
### Tracing API Requests
77101

78102
We will wrap the API Server's http server and http clients with [otelhttp](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/master/instrumentation/net/http/otelhttp) to get spans for incoming and outgoing http requests. This generates spans for all sampled incoming requests and propagates context with all client requests. For incoming requests, this would go below [WithRequestInfo](https://github.com/kubernetes/kubernetes/blob/9eb097c4b07ea59c674a69e19c1519f0d10f2fa8/staging/src/k8s.io/apiserver/pkg/server/config.go#L676) in the filter stack, as it must be after authentication and authorization, before the panic filter, and is closest in function to the WithRequestInfo filter.
@@ -108,52 +132,37 @@ The [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-co
108132

109133
### APIServer Configuration and EgressSelectors
110134

111-
The API Server controls where traffic is sent using an [EgressSelector](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190226-network-proxy.md), and has separate controls for `Master`, `Cluster`, and `Etcd` traffic. As described above, we would like to support either sending telemetry to a url using the `Master` egress, or a service using the `Cluster` egress. To accomplish this, we will introduce a flag, `--opentelemetry-config-file`, that will point to the file that defines the opentelemetry exporter configuration. That file will have the following format:
135+
The API Server controls where traffic is sent using an [EgressSelector](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190226-network-proxy.md), and has separate controls for `ControlPlane`, `Cluster`, and `Etcd` traffic. As described above, we would like to support sending telemetry to a url using the `ControlPlane` egress. To accomplish this, we will introduce a flag, `--opentelemetry-config-file`, that will point to the file that defines the opentelemetry exporter configuration. That file will have the following format:
112136

113137
```golang
114138
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
115139

116-
// OpenTelemetryClientConfiguration provides versioned configuration for opentelemetry clients.
117-
type OpenTelemetryClientConfiguration struct {
140+
// TracingConfiguration provides versioned configuration for tracing clients.
141+
type TracingConfiguration struct {
118142
metav1.TypeMeta `json:",inline"`
119143

120144
// +optional
121-
// URL of the collector that's running on the master.
122-
// if URL is specified, APIServer uses the egressType Master when sending data to the collector.
123-
URL *string `json:"url,omitempty" protobuf:"bytes,3,opt,name=url"`
124-
125-
// +optional
126-
// Service that's the frontend of the collector deployment running in the cluster.
127-
// If Service is specified, APIServer uses the egressType Cluster when sending data to the collector.
128-
Service *ServiceReference `json:"service,omitempty" protobuf:"bytes,1,opt,name=service"`
129-
}
145+
// URL of the collector that's running on the control-plane node.
146+
// the APIServer uses the egressType ControlPlane when sending data to the collector.
147+
// Defaults to localhost:4317
148+
URL *string `json:"url,omitempty" protobuf:"bytes,1,opt,name=url"`
130149

131-
// ServiceReference holds a reference to Service.legacy.k8s.io
132-
type ServiceReference struct {
133-
// `namespace` is the namespace of the service.
134-
// Required
135-
Namespace string `json:"namespace" protobuf:"bytes,1,opt,name=namespace"`
136-
// `name` is the name of the service.
137-
// Required
138-
Name string `json:"name" protobuf:"bytes,2,opt,name=name"`
139-
140-
// If specified, the port on the service.
141-
// Defaults to 4317, the IANA reserved port for OpenTelemetry.
142-
// `port` should be a valid port number (1-65535, inclusive).
143150
// +optional
144-
Port *int32 `json:"port,omitempty" protobuf:"varint,3,opt,name=port"`
151+
// SamplingRatePerMillion is the number of samples to collect per million spans.
152+
// Defaults to 0.
153+
SamplingRatePerMillion *int32 `json:"samplingRatePerMillion,omitempty" protobuf:"varint,2,opt,name=samplingRatePerMillion"`
145154
}
146155
```
147156

148-
If `--opentelemetry-config-file` is not specified, the API Server will not send any telemetry.
157+
If `--opentelemetry-config-file` is not specified, the API Server will not send any spans, even if incoming requests ask for sampling.
149158

150159
### Controlling use of the OpenTelemetry library
151160

152161
As the community found in the [Metrics Stability Framework KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/1209-metrics-stability/kubernetes-control-plane-metrics-stability.md#kubernetes-control-plane-metrics-stability), having control over how the client libraries are used in kubernetes can enable maintainers to enforce policy and make broad improvements to the quality of telemetry. To enable future improvements to tracing, we will restrict the direct use of the OpenTelemetry library within the kubernetes code base, and provide wrapped versions of functions we wish to expose in a utility library.
153162

154163
### Test Plan
155164

156-
We will e2e test this feature by deploying an OpenTelemetry Collector on the master, and configure it to export traces using the [stdout exporter](https://github.com/open-telemetry/opentelemetry-go/tree/master/exporters/stdout), which logs the spans in json format. We can then verify that the logs contain our expected traces when making calls to the API Server.
165+
We will e2e test this feature by deploying an OpenTelemetry Collector on the control-plane node, and configure it to export traces using the [stdout exporter](https://github.com/open-telemetry/opentelemetry-go/tree/master/exporters/stdout), which logs the spans in json format. We can then verify that the logs contain our expected traces when making calls to the API Server.
157166

158167
## Graduation requirements
159168

@@ -349,7 +358,7 @@ _This section must be completed when targeting beta graduation to a release._
349358

350359
### Introducing a new EgressSelector type
351360

352-
Instead of a configuration file to choose between a url on the `Master` network, or a service on the `Cluster` network, we considered introducing a new `OpenTelemetry` egress type, which could be configured separately. However, we aren't actually introducing a new destination for traffic, so it is more conventional to make use of existing egress types. We will also likely want to add additional configuration for the OpenTelemetry client in the future.
361+
Instead of a configuration file to choose between a url on the `ControlPlane` network, or a service on the `Cluster` network, we considered introducing a new `OpenTelemetry` egress type, which could be configured separately. However, we aren't actually introducing a new destination for traffic, so it is more conventional to make use of existing egress types. We will also likely want to add additional configuration for the OpenTelemetry client in the future.
353362

354363
### Other OpenTelemetry Exporters
355364

0 commit comments

Comments
 (0)