Skip to content

Commit 3d14225

Browse files
authored
Merge pull request #37907 from saschagrunert/runtime-otel-blog
Add blog post about runtime observability
2 parents bcaf452 + 56bc6ca commit 3d14225

File tree

5 files changed

+279
-0
lines changed

5 files changed

+279
-0
lines changed
329 KB
Loading
83.4 KB
Loading
331 KB
Loading
Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
---
2+
layout: blog
3+
title: "Boosting Kubernetes container runtime observability with OpenTelemetry"
4+
date: 2022-12-01
5+
slug: runtime-observability-opentelemetry
6+
---
7+
8+
**Authors:** Sascha Grunert
9+
10+
When speaking about observability in the cloud native space, then probably
11+
everyone will mention [OpenTelemetry (OTEL)][otel] at some point in the
12+
conversation. That's great, because the community needs standards to rely on
13+
for developing all cluster components into the same direction. OpenTelemetry
14+
enables us to combine logs, metrics, traces and other contextual information
15+
(called baggage) into a single resource. Cluster administrators or software
16+
engineers can use this resource to get a viewport about what is going on in the
17+
cluster over a defined period of time. But how can Kubernetes itself make use of
18+
this technology stack?
19+
20+
[otel]: https://opentelemetry.io
21+
22+
Kubernetes consists of multiple components where some are independent and others
23+
are stacked together. Looking at the architecture from a container runtime
24+
perspective, then there are from the top to the bottom:
25+
26+
- **kube-apiserver**: Validates and configures data for the API objects
27+
- **kubelet**: Agent running on each node
28+
- **CRI runtime**: Container Runtime Interface (CRI) compatible container runtime
29+
like [CRI-O][crio] or [containerd][containerd]
30+
- **OCI runtime**: Lower level [Open Container Initiative (OCI)][oci] runtime
31+
like [runc][runc] or [crun][crun]
32+
- **Linux kernel** or **Microsoft Windows**: Underlying operating system
33+
34+
[crio]: https://cri-o.io
35+
[containerd]: https://containerd.io
36+
[oci]: https://opencontainers.org
37+
[runc]: https://github.com/opencontainers/runc
38+
[crun]: https://github.com/containers/crun
39+
40+
That means if we encounter a problem with running containers in Kubernetes, then
41+
we start looking at one of those components. Finding the root cause for problems
42+
is one of the most time consuming actions we face with the increased
43+
architectural complexity from today's cluster setups. Even if we know the
44+
component which seems to cause the issue, we still have to take the others into
45+
account to maintain a mental timeline of events which are going on. How do we
46+
achieve that? Well, most folks will probably stick to scraping logs, filtering
47+
them and assembling them together over the components borders. We also have
48+
metrics, right? Correct, but bringing metrics values in correlation with plain
49+
logs makes it even harder to track what is going on. Some metrics are also not
50+
made for debugging purposes. They have been defined based on the end user
51+
perspective of the cluster for linking usable alerts and not for developers
52+
debugging a cluster setup.
53+
54+
OpenTelemetry to the rescue: the project aims to combine signals such as
55+
[traces][traces], [metrics][metrics] and [logs][logs] together to maintain the
56+
right viewport on the cluster state.
57+
58+
[traces]: https://opentelemetry.io/docs/concepts/signals/traces
59+
[metrics]: https://opentelemetry.io/docs/concepts/signals/metrics
60+
[logs]: https://opentelemetry.io/docs/concepts/signals/logs
61+
62+
What is the current state of OpenTelemetry tracing in Kubernetes? From an API
63+
server perspective, we have alpha support for tracing since Kubernetes v1.22,
64+
which will graduate to beta in one of the upcoming releases. Unfortunately the
65+
beta graduation has missed the v1.26 Kubernetes release. The design proposal can
66+
be found in the [_API Server Tracing_ Kubernetes Enhancement Proposal
67+
(KEP)][kep-647] which provides more information about it.
68+
69+
[kep-647]: https://github.com/kubernetes/enhancements/issues/647
70+
71+
The kubelet tracing part is tracked [in another KEP][kep-2831], which was
72+
implemented in an alpha state in Kubernetes v1.25. A beta graduation is not
73+
planned as time of writing, but more may come in the v1.27 release cycle.
74+
There are other side-efforts going on beside both KEPs, for example [klog is
75+
considering OTEL support][klog-otel], which would boost the observability by
76+
linking log messages to existing traces. Within SIG Instrumentation and SIG Node,
77+
we're also discussing [how to link the
78+
kubelet traces together][issue-113414], because right now they're focused on the
79+
[gRPC][grpc] calls between the kubelet and the CRI container runtime.
80+
81+
[kep-647]: https://github.com/kubernetes/enhancements/issues/647
82+
[kep-2831]: https://github.com/kubernetes/enhancements/issues/2831
83+
[klog-otel]: https://github.com/kubernetes/klog/issues/356
84+
[issue-113414]: https://github.com/kubernetes/kubernetes/issues/113414
85+
[grpc]: https://grpc.io
86+
87+
CRI-O features OpenTelemetry tracing support [since v1.23.0][pr-4883] and is
88+
working on continuously improving them, for example by [attaching the logs to the
89+
traces][pr-6294] or extending the [spans to logical parts of the
90+
application][pr-6343]. This helps users of the traces to gain the same
91+
information like parsing the logs, but with enhanced capabilities of scoping and
92+
filtering to other OTEL signals. The CRI-O maintainers are also working on a
93+
container monitoring replacement for [conmon][conmon], which is called
94+
[conmon-rs][conmon-rs] and is purely written in [Rust][rust]. One benefit of
95+
having a Rust implementation is to be able to add features like OpenTelemetry
96+
support, because the crates (libraries) for those already exist. This allows a
97+
tight integration with CRI-O and lets consumers see the most low level tracing
98+
data from their containers.
99+
100+
[pr-4883]: https://github.com/cri-o/cri-o/pull/4883
101+
[pr-6294]: https://github.com/cri-o/cri-o/pull/6294
102+
[pr-6343]: https://github.com/cri-o/cri-o/pull/6343
103+
[conmon]: https://github.com/containers/conmon
104+
[conmon-rs]: https://github.com/containers/conmon-rs
105+
[rust]: https://www.rust-lang.org
106+
107+
The [containerd][containerd] folks added tracing support since v1.6.0, which is
108+
available [by using a plugin][containerd-docs]. Lower level OCI runtimes like
109+
[runc][runc] or [crun][crun] feature no support for OTEL at all and it does not
110+
seem to exist a plan for that. We always have to consider that there is a
111+
performance overhead when collecting the traces as well as exporting them to a
112+
data sink. I still think it would be worth an evaluation on how extended
113+
telemetry collection could look like in OCI runtimes. Let's see if the Rust OCI
114+
runtime [youki][youki-1348] is considering something like that in the future.
115+
116+
[containerd-docs]: https://github.com/containerd/containerd/blob/7def13d/docs/tracing.md
117+
[youki-1348]: https://github.com/containers/youki/issues/1348
118+
119+
I'll show you how to give it a try. For my demo I'll stick to a stack with a single local node
120+
that has runc, conmon-rs, CRI-O, and a kubelet. To enable tracing in the kubelet, I need to
121+
apply the following `KubeletConfiguration`:
122+
123+
```yaml
124+
apiVersion: kubelet.config.k8s.io/v1beta1
125+
kind: KubeletConfiguration
126+
featureGates:
127+
KubeletTracing: true
128+
tracing:
129+
samplingRatePerMillion: 1000000
130+
```
131+
132+
A `samplingRatePerMillion` equally to one million will internally translate to
133+
sampling everything. A similar configuration has to be applied to CRI-O; I can
134+
either start the `crio` binary with `--enable-tracing` and
135+
`--tracing-sampling-rate-per-million 1000000` or we use a drop-in configuration
136+
like this:
137+
138+
```shell
139+
cat /etc/crio/crio.conf.d/99-tracing.conf
140+
```
141+
142+
```toml
143+
[crio.tracing]
144+
enable_tracing = true
145+
tracing_sampling_rate_per_million = 1000000
146+
```
147+
148+
To configure CRI-O to use conmon-rs, you require at least the latest CRI-O
149+
v1.25.x and conmon-rs v0.4.0. Then a configuration drop-in like this can be used
150+
to make CRI-O use conmon-rs:
151+
152+
```shell
153+
cat /etc/crio/crio.conf.d/99-runtimes.conf
154+
```
155+
156+
```toml
157+
[crio.runtime]
158+
default_runtime = "runc"
159+
160+
[crio.runtime.runtimes.runc]
161+
runtime_type = "pod"
162+
monitor_path = "/path/to/conmonrs" # or will be looked up in $PATH
163+
```
164+
165+
That's it, the default configuration will point to an [OpenTelemetry
166+
collector][collector] [gRPC][grpc] endpoint of `localhost:4317`, which has to be up and
167+
running as well. There are multiple ways to run OTLP as [described in the
168+
docs][collector], but it's also possible to `kubectl proxy` into an existing
169+
instance running within Kubernetes.
170+
171+
[collector]: https://opentelemetry.io/docs/collector/getting-started
172+
173+
If everything is set up, then the collector should log that there are incoming
174+
traces:
175+
176+
```
177+
ScopeSpans #0
178+
ScopeSpans SchemaURL:
179+
InstrumentationScope go.opentelemetry.io/otel/sdk/tracer
180+
Span #0
181+
Trace ID : 71896e69f7d337730dfedb6356e74f01
182+
Parent ID : a2a7714534c017e6
183+
ID : 1d27dbaf38b9da8b
184+
Name : github.com/cri-o/cri-o/server.(*Server).filterSandboxList
185+
Kind : SPAN_KIND_INTERNAL
186+
Start time : 2022-11-15 09:50:20.060325562 +0000 UTC
187+
End time : 2022-11-15 09:50:20.060326291 +0000 UTC
188+
Status code : STATUS_CODE_UNSET
189+
Status message :
190+
Span #1
191+
Trace ID : 71896e69f7d337730dfedb6356e74f01
192+
Parent ID : a837a005d4389579
193+
ID : a2a7714534c017e6
194+
Name : github.com/cri-o/cri-o/server.(*Server).ListPodSandbox
195+
Kind : SPAN_KIND_INTERNAL
196+
Start time : 2022-11-15 09:50:20.060321973 +0000 UTC
197+
End time : 2022-11-15 09:50:20.060330602 +0000 UTC
198+
Status code : STATUS_CODE_UNSET
199+
Status message :
200+
Span #2
201+
Trace ID : fae6742709d51a9b6606b6cb9f381b96
202+
Parent ID : 3755d12b32610516
203+
ID : 0492afd26519b4b0
204+
Name : github.com/cri-o/cri-o/server.(*Server).filterContainerList
205+
Kind : SPAN_KIND_INTERNAL
206+
Start time : 2022-11-15 09:50:20.0607746 +0000 UTC
207+
End time : 2022-11-15 09:50:20.060795505 +0000 UTC
208+
Status code : STATUS_CODE_UNSET
209+
Status message :
210+
Events:
211+
SpanEvent #0
212+
-> Name: log
213+
-> Timestamp: 2022-11-15 09:50:20.060778668 +0000 UTC
214+
-> DroppedAttributesCount: 0
215+
-> Attributes::
216+
-> id: Str(adf791e5-2eb8-4425-b092-f217923fef93)
217+
-> log.message: Str(No filters were applied, returning full container list)
218+
-> log.severity: Str(DEBUG)
219+
-> name: Str(/runtime.v1.RuntimeService/ListContainers)
220+
```
221+
222+
I can see that the spans have a trace ID and typically have a parent attached.
223+
Events such as logs are part of the output as well. In the above case, the kubelet is
224+
periodically triggering a `ListPodSandbox` RPC to CRI-O caused by the Pod
225+
Lifecycle Event Generator (PLEG). Displaying those traces can be done via,
226+
for example, [Jaeger][jaeger]. When running the tracing stack locally, then a Jaeger
227+
instance should be exposed on `http://localhost:16686` per default.
228+
229+
[jaeger]: https://www.jaegertracing.io/
230+
231+
The `ListPodSandbox` requests are directly visible within the Jaeger UI:
232+
233+
![ListPodSandbox RPC in the Jaeger UI](list_pod_sandbox.png)
234+
235+
That's not too exciting, so I'll run a workload directly via `kubectl`:
236+
237+
```shell
238+
kubectl run -it --rm --restart=Never --image=alpine alpine -- echo hi
239+
```
240+
241+
```
242+
hi
243+
pod "alpine" deleted
244+
```
245+
246+
Looking now at Jaeger, we can see that we have traces for `conmonrs`, `crio` as
247+
well as the `kubelet` for the `RunPodSandbox` and `CreateContainer` CRI RPCs:
248+
249+
![Container creation in the Jaeger UI](create_container.png)
250+
251+
The kubelet and CRI-O spans are connected to each other to make investigation
252+
easier. If we now take a closer look at the spans, then we can see that CRI-O's
253+
logs are correctly accosted with the corresponding functionality. For example we
254+
can extract the container user from the traces like this:
255+
256+
![CRI-O in the Jaeger UI](crio_spans.png)
257+
258+
The lower level spans of conmon-rs are also part of this trace. For example
259+
conmon-rs maintains an internal `read_loop` for handling IO between the
260+
container and the end user. The logs for reading and writing bytes are part of
261+
the span. The same applies to the `wait_for_exit_code` span, which tells us that
262+
the container exited successfully with code `0`:
263+
264+
![conmon-rs in the Jaeger UI](conmonrs_spans.png)
265+
266+
Having all that information at hand side by side to the filtering capabilities
267+
of Jaeger makes the whole stack a great solution for debugging container issues!
268+
Mentioning the "whole stack" also shows the biggest downside of the overall
269+
approach: Compared to parsing logs it adds a noticeable overhead on top of the
270+
cluster setup. Users have to maintain a sink like [Elasticsearch][elastic] to
271+
persist the data, expose the Jaeger UI and possibly take the performance
272+
drawback into account. Anyways, it's still one of the best ways to increase the
273+
observability aspect of Kubernetes.
274+
275+
[elastic]: https://www.elastic.co
276+
277+
Thank you for reading this blog post, I'm pretty sure we're looking into a
278+
bright future for OpenTelemetry support in Kubernetes to make troubleshooting
279+
simpler.
32.2 KB
Loading

0 commit comments

Comments
 (0)