|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Boosting Kubernetes container runtime observability with OpenTelemetry" |
| 4 | +date: 2022-12-01 |
| 5 | +slug: runtime-observability-opentelemetry |
| 6 | +--- |
| 7 | + |
| 8 | +**Authors:** Sascha Grunert |
| 9 | + |
| 10 | +When speaking about observability in the cloud native space, then probably |
| 11 | +everyone will mention [OpenTelemetry (OTEL)][otel] at some point in the |
| 12 | +conversation. That's great, because the community needs standards to rely on |
| 13 | +for developing all cluster components into the same direction. OpenTelemetry |
| 14 | +enables us to combine logs, metrics, traces and other contextual information |
| 15 | +(called baggage) into a single resource. Cluster administrators or software |
| 16 | +engineers can use this resource to get a viewport about what is going on in the |
| 17 | +cluster over a defined period of time. But how can Kubernetes itself make use of |
| 18 | +this technology stack? |
| 19 | + |
| 20 | +[otel]: https://opentelemetry.io |
| 21 | + |
| 22 | +Kubernetes consists of multiple components where some are independent and others |
| 23 | +are stacked together. Looking at the architecture from a container runtime |
| 24 | +perspective, then there are from the top to the bottom: |
| 25 | + |
| 26 | +- **kube-apiserver**: Validates and configures data for the API objects |
| 27 | +- **kubelet**: Agent running on each node |
| 28 | +- **CRI runtime**: Container Runtime Interface (CRI) compatible container runtime |
| 29 | + like [CRI-O][crio] or [containerd][containerd] |
| 30 | +- **OCI runtime**: Lower level [Open Container Initiative (OCI)][oci] runtime |
| 31 | + like [runc][runc] or [crun][crun] |
| 32 | +- **Linux kernel** or **Microsoft Windows**: Underlying operating system |
| 33 | + |
| 34 | +[crio]: https://cri-o.io |
| 35 | +[containerd]: https://containerd.io |
| 36 | +[oci]: https://opencontainers.org |
| 37 | +[runc]: https://github.com/opencontainers/runc |
| 38 | +[crun]: https://github.com/containers/crun |
| 39 | + |
| 40 | +That means if we encounter a problem with running containers in Kubernetes, then |
| 41 | +we start looking at one of those components. Finding the root cause for problems |
| 42 | +is one of the most time consuming actions we face with the increased |
| 43 | +architectural complexity from today's cluster setups. Even if we know the |
| 44 | +component which seems to cause the issue, we still have to take the others into |
| 45 | +account to maintain a mental timeline of events which are going on. How do we |
| 46 | +achieve that? Well, most folks will probably stick to scraping logs, filtering |
| 47 | +them and assembling them together over the components borders. We also have |
| 48 | +metrics, right? Correct, but bringing metrics values in correlation with plain |
| 49 | +logs makes it even harder to track what is going on. Some metrics are also not |
| 50 | +made for debugging purposes. They have been defined based on the end user |
| 51 | +perspective of the cluster for linking usable alerts and not for developers |
| 52 | +debugging a cluster setup. |
| 53 | + |
| 54 | +OpenTelemetry to the rescue: the project aims to combine signals such as |
| 55 | +[traces][traces], [metrics][metrics] and [logs][logs] together to maintain the |
| 56 | +right viewport on the cluster state. |
| 57 | + |
| 58 | +[traces]: https://opentelemetry.io/docs/concepts/signals/traces |
| 59 | +[metrics]: https://opentelemetry.io/docs/concepts/signals/metrics |
| 60 | +[logs]: https://opentelemetry.io/docs/concepts/signals/logs |
| 61 | + |
| 62 | +What is the current state of OpenTelemetry tracing in Kubernetes? From an API |
| 63 | +server perspective, we have alpha support for tracing since Kubernetes v1.22, |
| 64 | +which will graduate to beta in one of the upcoming releases. Unfortunately the |
| 65 | +beta graduation has missed the v1.26 Kubernetes release. The design proposal can |
| 66 | +be found in the [_API Server Tracing_ Kubernetes Enhancement Proposal |
| 67 | +(KEP)][kep-647] which provides more information about it. |
| 68 | + |
| 69 | +[kep-647]: https://github.com/kubernetes/enhancements/issues/647 |
| 70 | + |
| 71 | +The kubelet tracing part is tracked [in another KEP][kep-2831], which was |
| 72 | +implemented in an alpha state in Kubernetes v1.25. A beta graduation is not |
| 73 | +planned as time of writing, but more may come in the v1.27 release cycle. |
| 74 | +There are other side-efforts going on beside both KEPs, for example [klog is |
| 75 | +considering OTEL support][klog-otel], which would boost the observability by |
| 76 | +linking log messages to existing traces. Within SIG Instrumentation and SIG Node, |
| 77 | +we're also discussing [how to link the |
| 78 | +kubelet traces together][issue-113414], because right now they're focused on the |
| 79 | +[gRPC][grpc] calls between the kubelet and the CRI container runtime. |
| 80 | + |
| 81 | +[kep-647]: https://github.com/kubernetes/enhancements/issues/647 |
| 82 | +[kep-2831]: https://github.com/kubernetes/enhancements/issues/2831 |
| 83 | +[klog-otel]: https://github.com/kubernetes/klog/issues/356 |
| 84 | +[issue-113414]: https://github.com/kubernetes/kubernetes/issues/113414 |
| 85 | +[grpc]: https://grpc.io |
| 86 | + |
| 87 | +CRI-O features OpenTelemetry tracing support [since v1.23.0][pr-4883] and is |
| 88 | +working on continuously improving them, for example by [attaching the logs to the |
| 89 | +traces][pr-6294] or extending the [spans to logical parts of the |
| 90 | +application][pr-6343]. This helps users of the traces to gain the same |
| 91 | +information like parsing the logs, but with enhanced capabilities of scoping and |
| 92 | +filtering to other OTEL signals. The CRI-O maintainers are also working on a |
| 93 | +container monitoring replacement for [conmon][conmon], which is called |
| 94 | +[conmon-rs][conmon-rs] and is purely written in [Rust][rust]. One benefit of |
| 95 | +having a Rust implementation is to be able to add features like OpenTelemetry |
| 96 | +support, because the crates (libraries) for those already exist. This allows a |
| 97 | +tight integration with CRI-O and lets consumers see the most low level tracing |
| 98 | +data from their containers. |
| 99 | + |
| 100 | +[pr-4883]: https://github.com/cri-o/cri-o/pull/4883 |
| 101 | +[pr-6294]: https://github.com/cri-o/cri-o/pull/6294 |
| 102 | +[pr-6343]: https://github.com/cri-o/cri-o/pull/6343 |
| 103 | +[conmon]: https://github.com/containers/conmon |
| 104 | +[conmon-rs]: https://github.com/containers/conmon-rs |
| 105 | +[rust]: https://www.rust-lang.org |
| 106 | + |
| 107 | +The [containerd][containerd] folks added tracing support since v1.6.0, which is |
| 108 | +available [by using a plugin][containerd-docs]. Lower level OCI runtimes like |
| 109 | +[runc][runc] or [crun][crun] feature no support for OTEL at all and it does not |
| 110 | +seem to exist a plan for that. We always have to consider that there is a |
| 111 | +performance overhead when collecting the traces as well as exporting them to a |
| 112 | +data sink. I still think it would be worth an evaluation on how extended |
| 113 | +telemetry collection could look like in OCI runtimes. Let's see if the Rust OCI |
| 114 | +runtime [youki][youki-1348] is considering something like that in the future. |
| 115 | + |
| 116 | +[containerd-docs]: https://github.com/containerd/containerd/blob/7def13d/docs/tracing.md |
| 117 | +[youki-1348]: https://github.com/containers/youki/issues/1348 |
| 118 | + |
| 119 | +I'll show you how to give it a try. For my demo I'll stick to a stack with a single local node |
| 120 | +that has runc, conmon-rs, CRI-O, and a kubelet. To enable tracing in the kubelet, I need to |
| 121 | +apply the following `KubeletConfiguration`: |
| 122 | + |
| 123 | +```yaml |
| 124 | +apiVersion: kubelet.config.k8s.io/v1beta1 |
| 125 | +kind: KubeletConfiguration |
| 126 | +featureGates: |
| 127 | + KubeletTracing: true |
| 128 | +tracing: |
| 129 | + samplingRatePerMillion: 1000000 |
| 130 | +``` |
| 131 | +
|
| 132 | +A `samplingRatePerMillion` equally to one million will internally translate to |
| 133 | +sampling everything. A similar configuration has to be applied to CRI-O; I can |
| 134 | +either start the `crio` binary with `--enable-tracing` and |
| 135 | +`--tracing-sampling-rate-per-million 1000000` or we use a drop-in configuration |
| 136 | +like this: |
| 137 | + |
| 138 | +```shell |
| 139 | +cat /etc/crio/crio.conf.d/99-tracing.conf |
| 140 | +``` |
| 141 | + |
| 142 | +```toml |
| 143 | +[crio.tracing] |
| 144 | +enable_tracing = true |
| 145 | +tracing_sampling_rate_per_million = 1000000 |
| 146 | +``` |
| 147 | + |
| 148 | +To configure CRI-O to use conmon-rs, you require at least the latest CRI-O |
| 149 | +v1.25.x and conmon-rs v0.4.0. Then a configuration drop-in like this can be used |
| 150 | +to make CRI-O use conmon-rs: |
| 151 | + |
| 152 | +```shell |
| 153 | +cat /etc/crio/crio.conf.d/99-runtimes.conf |
| 154 | +``` |
| 155 | + |
| 156 | +```toml |
| 157 | +[crio.runtime] |
| 158 | +default_runtime = "runc" |
| 159 | +
|
| 160 | +[crio.runtime.runtimes.runc] |
| 161 | +runtime_type = "pod" |
| 162 | +monitor_path = "/path/to/conmonrs" # or will be looked up in $PATH |
| 163 | +``` |
| 164 | + |
| 165 | +That's it, the default configuration will point to an [OpenTelemetry |
| 166 | +collector][collector] [gRPC][grpc] endpoint of `localhost:4317`, which has to be up and |
| 167 | +running as well. There are multiple ways to run OTLP as [described in the |
| 168 | +docs][collector], but it's also possible to `kubectl proxy` into an existing |
| 169 | +instance running within Kubernetes. |
| 170 | + |
| 171 | +[collector]: https://opentelemetry.io/docs/collector/getting-started |
| 172 | + |
| 173 | +If everything is set up, then the collector should log that there are incoming |
| 174 | +traces: |
| 175 | + |
| 176 | +``` |
| 177 | +ScopeSpans #0 |
| 178 | +ScopeSpans SchemaURL: |
| 179 | +InstrumentationScope go.opentelemetry.io/otel/sdk/tracer |
| 180 | +Span #0 |
| 181 | + Trace ID : 71896e69f7d337730dfedb6356e74f01 |
| 182 | + Parent ID : a2a7714534c017e6 |
| 183 | + ID : 1d27dbaf38b9da8b |
| 184 | + Name : github.com/cri-o/cri-o/server.(*Server).filterSandboxList |
| 185 | + Kind : SPAN_KIND_INTERNAL |
| 186 | + Start time : 2022-11-15 09:50:20.060325562 +0000 UTC |
| 187 | + End time : 2022-11-15 09:50:20.060326291 +0000 UTC |
| 188 | + Status code : STATUS_CODE_UNSET |
| 189 | + Status message : |
| 190 | +Span #1 |
| 191 | + Trace ID : 71896e69f7d337730dfedb6356e74f01 |
| 192 | + Parent ID : a837a005d4389579 |
| 193 | + ID : a2a7714534c017e6 |
| 194 | + Name : github.com/cri-o/cri-o/server.(*Server).ListPodSandbox |
| 195 | + Kind : SPAN_KIND_INTERNAL |
| 196 | + Start time : 2022-11-15 09:50:20.060321973 +0000 UTC |
| 197 | + End time : 2022-11-15 09:50:20.060330602 +0000 UTC |
| 198 | + Status code : STATUS_CODE_UNSET |
| 199 | + Status message : |
| 200 | +Span #2 |
| 201 | + Trace ID : fae6742709d51a9b6606b6cb9f381b96 |
| 202 | + Parent ID : 3755d12b32610516 |
| 203 | + ID : 0492afd26519b4b0 |
| 204 | + Name : github.com/cri-o/cri-o/server.(*Server).filterContainerList |
| 205 | + Kind : SPAN_KIND_INTERNAL |
| 206 | + Start time : 2022-11-15 09:50:20.0607746 +0000 UTC |
| 207 | + End time : 2022-11-15 09:50:20.060795505 +0000 UTC |
| 208 | + Status code : STATUS_CODE_UNSET |
| 209 | + Status message : |
| 210 | +Events: |
| 211 | +SpanEvent #0 |
| 212 | + -> Name: log |
| 213 | + -> Timestamp: 2022-11-15 09:50:20.060778668 +0000 UTC |
| 214 | + -> DroppedAttributesCount: 0 |
| 215 | + -> Attributes:: |
| 216 | + -> id: Str(adf791e5-2eb8-4425-b092-f217923fef93) |
| 217 | + -> log.message: Str(No filters were applied, returning full container list) |
| 218 | + -> log.severity: Str(DEBUG) |
| 219 | + -> name: Str(/runtime.v1.RuntimeService/ListContainers) |
| 220 | +``` |
| 221 | + |
| 222 | +I can see that the spans have a trace ID and typically have a parent attached. |
| 223 | +Events such as logs are part of the output as well. In the above case, the kubelet is |
| 224 | +periodically triggering a `ListPodSandbox` RPC to CRI-O caused by the Pod |
| 225 | +Lifecycle Event Generator (PLEG). Displaying those traces can be done via, |
| 226 | +for example, [Jaeger][jaeger]. When running the tracing stack locally, then a Jaeger |
| 227 | +instance should be exposed on `http://localhost:16686` per default. |
| 228 | + |
| 229 | +[jaeger]: https://www.jaegertracing.io/ |
| 230 | + |
| 231 | +The `ListPodSandbox` requests are directly visible within the Jaeger UI: |
| 232 | + |
| 233 | + |
| 234 | + |
| 235 | +That's not too exciting, so I'll run a workload directly via `kubectl`: |
| 236 | + |
| 237 | +```shell |
| 238 | +kubectl run -it --rm --restart=Never --image=alpine alpine -- echo hi |
| 239 | +``` |
| 240 | + |
| 241 | +``` |
| 242 | +hi |
| 243 | +pod "alpine" deleted |
| 244 | +``` |
| 245 | +
|
| 246 | +Looking now at Jaeger, we can see that we have traces for `conmonrs`, `crio` as |
| 247 | +well as the `kubelet` for the `RunPodSandbox` and `CreateContainer` CRI RPCs: |
| 248 | +
|
| 249 | + |
| 250 | +
|
| 251 | +The kubelet and CRI-O spans are connected to each other to make investigation |
| 252 | +easier. If we now take a closer look at the spans, then we can see that CRI-O's |
| 253 | +logs are correctly accosted with the corresponding functionality. For example we |
| 254 | +can extract the container user from the traces like this: |
| 255 | +
|
| 256 | + |
| 257 | +
|
| 258 | +The lower level spans of conmon-rs are also part of this trace. For example |
| 259 | +conmon-rs maintains an internal `read_loop` for handling IO between the |
| 260 | +container and the end user. The logs for reading and writing bytes are part of |
| 261 | +the span. The same applies to the `wait_for_exit_code` span, which tells us that |
| 262 | +the container exited successfully with code `0`: |
| 263 | +
|
| 264 | + |
| 265 | +
|
| 266 | +Having all that information at hand side by side to the filtering capabilities |
| 267 | +of Jaeger makes the whole stack a great solution for debugging container issues! |
| 268 | +Mentioning the "whole stack" also shows the biggest downside of the overall |
| 269 | +approach: Compared to parsing logs it adds a noticeable overhead on top of the |
| 270 | +cluster setup. Users have to maintain a sink like [Elasticsearch][elastic] to |
| 271 | +persist the data, expose the Jaeger UI and possibly take the performance |
| 272 | +drawback into account. Anyways, it's still one of the best ways to increase the |
| 273 | +observability aspect of Kubernetes. |
| 274 | +
|
| 275 | +[elastic]: https://www.elastic.co |
| 276 | +
|
| 277 | +Thank you for reading this blog post, I'm pretty sure we're looking into a |
| 278 | +bright future for OpenTelemetry support in Kubernetes to make troubleshooting |
| 279 | +simpler. |
0 commit comments