kubernetes
diff --git a/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/conmonrs_spans.png
329 KB b/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/conmonrs_spans.png
329 KB
diff --git a/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/create_container.png
83.4 KB b/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/create_container.png
83.4 KB
diff --git a/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/crio_spans.png
331 KB b/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/crio_spans.png
331 KB
diff --git a/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/index.md
Lines changed: 279 additions & 0 deletions b/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/index.md
Lines changed: 279 additions & 0 deletions
diff --git a/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/list_pod_sandbox.png
32.2 KB b/‎content/en/blog/_posts/2022-12-01-runtime-observability-opentelemetry/list_pod_sandbox.png
32.2 KB
@@ -0,0 +1,279 @@
+---
+layout: blog
+title: "Boosting Kubernetes container runtime observability with OpenTelemetry"
+date: 2022-12-01
+slug: runtime-observability-opentelemetry
+---
+
+**Authors:** Sascha Grunert
+
+When speaking about observability in the cloud native space, then probably
+everyone will mention [OpenTelemetry (OTEL)][otel] at some point in the
+conversation. That's great, because the community needs standards to rely on
+for developing all cluster components into the same direction. OpenTelemetry
+enables us to combine logs, metrics, traces and other contextual information
+(called baggage) into a single resource. Cluster administrators or software
+engineers can use this resource to get a viewport about what is going on in the
+cluster over a defined period of time. But how can Kubernetes itself make use of
+this technology stack?
+
+[otel]: https://opentelemetry.io
+
+Kubernetes consists of multiple components where some are independent and others
+are stacked together. Looking at the architecture from a container runtime
+perspective, then there are from the top to the bottom:
+
+- **kube-apiserver**: Validates and configures data for the API objects
+- **kubelet**: Agent running on each node
+- **CRI runtime**: Container Runtime Interface (CRI) compatible container runtime
+  like [CRI-O][crio] or [containerd][containerd]
+- **OCI runtime**: Lower level [Open Container Initiative (OCI)][oci] runtime
+  like [runc][runc] or [crun][crun]
+- **Linux kernel** or **Microsoft Windows**: Underlying operating system
+
+[crio]: https://cri-o.io
+[containerd]: https://containerd.io
+[oci]: https://opencontainers.org
+[runc]: https://github.com/opencontainers/runc
+[crun]: https://github.com/containers/crun
+
+That means if we encounter a problem with running containers in Kubernetes, then
+we start looking at one of those components. Finding the root cause for problems
+is one of the most time consuming actions we face with the increased
+architectural complexity from today's cluster setups. Even if we know the
+component which seems to cause the issue, we still have to take the others into
+account to maintain a mental timeline of events which are going on. How do we
+achieve that? Well, most folks will probably stick to scraping logs, filtering
+them and assembling them together over the components borders. We also have
+metrics, right? Correct, but bringing metrics values in correlation with plain
+logs makes it even harder to track what is going on. Some metrics are also not
+made for debugging purposes. They have been defined based on the end user
+perspective of the cluster for linking usable alerts and not for developers
+debugging a cluster setup.
+
+OpenTelemetry to the rescue: the project aims to combine signals such as
+[traces][traces], [metrics][metrics] and [logs][logs] together to maintain the
+right viewport on the cluster state.
+
+[traces]: https://opentelemetry.io/docs/concepts/signals/traces
+[metrics]: https://opentelemetry.io/docs/concepts/signals/metrics
+[logs]: https://opentelemetry.io/docs/concepts/signals/logs
+
+What is the current state of OpenTelemetry tracing in Kubernetes? From an API
+server perspective, we have alpha support for tracing since Kubernetes v1.22,
+which will graduate to beta in one of the upcoming releases. Unfortunately the
+beta graduation has missed the v1.26 Kubernetes release. The design proposal can
+be found in the [_API Server Tracing_ Kubernetes Enhancement Proposal
+(KEP)][kep-647] which provides more information about it.
+
+[kep-647]: https://github.com/kubernetes/enhancements/issues/647
+
+The kubelet tracing part is tracked [in another KEP][kep-2831], which was
+implemented in an alpha state in Kubernetes v1.25. A beta graduation is not
+planned as time of writing, but more may come in the v1.27 release cycle.
+There are other side-efforts going on beside both KEPs, for example [klog is
+considering OTEL support][klog-otel], which would boost the observability by
+linking log messages to existing traces. Within SIG Instrumentation and SIG Node,
+we're also discussing [how to link the
+kubelet traces together][issue-113414], because right now they're focused on the
+[gRPC][grpc] calls between the kubelet and the CRI container runtime.
+
+[kep-647]: https://github.com/kubernetes/enhancements/issues/647
+[kep-2831]: https://github.com/kubernetes/enhancements/issues/2831
+[klog-otel]: https://github.com/kubernetes/klog/issues/356
+[issue-113414]: https://github.com/kubernetes/kubernetes/issues/113414
+[grpc]: https://grpc.io
+
+CRI-O features OpenTelemetry tracing support [since v1.23.0][pr-4883] and is
+working on continuously improving them, for example by [attaching the logs to the
+traces][pr-6294] or extending the [spans to logical parts of the
+application][pr-6343]. This helps users of the traces to gain the same
+information like parsing the logs, but with enhanced capabilities of scoping and
+filtering to other OTEL signals. The CRI-O maintainers are also working on a
+container monitoring replacement for [conmon][conmon], which is called
+[conmon-rs][conmon-rs] and is purely written in [Rust][rust]. One benefit of
+having a Rust implementation is to be able to add features like OpenTelemetry
+support, because the crates (libraries) for those already exist. This allows a
+tight integration with CRI-O and lets consumers see the most low level tracing
+data from their containers.
+
+[pr-4883]: https://github.com/cri-o/cri-o/pull/4883
+[pr-6294]: https://github.com/cri-o/cri-o/pull/6294
+[pr-6343]: https://github.com/cri-o/cri-o/pull/6343
+[conmon]: https://github.com/containers/conmon
+[conmon-rs]: https://github.com/containers/conmon-rs
+[rust]: https://www.rust-lang.org
+
+The [containerd][containerd] folks added tracing support since v1.6.0, which is
+available [by using a plugin][containerd-docs]. Lower level OCI runtimes like
+[runc][runc] or [crun][crun] feature no support for OTEL at all and it does not
+seem to exist a plan for that. We always have to consider that there is a
+performance overhead when collecting the traces as well as exporting them to a
+data sink. I still think it would be worth an evaluation on how extended
+telemetry collection could look like in OCI runtimes. Let's see if the Rust OCI
+runtime [youki][youki-1348] is considering something like that in the future.
+
+[containerd-docs]: https://github.com/containerd/containerd/blob/7def13d/docs/tracing.md
+[youki-1348]: https://github.com/containers/youki/issues/1348
+
+I'll show you how to give it a try. For my demo I'll stick to a stack with a single local node
+that has runc, conmon-rs, CRI-O, and a kubelet. To enable tracing in the kubelet, I need to
+apply the following `KubeletConfiguration`:
+
+```yaml
+apiVersion: kubelet.config.k8s.io/v1beta1
+kind: KubeletConfiguration
+featureGates:
+  KubeletTracing: true
+tracing:
+  samplingRatePerMillion: 1000000
+```
+
+A `samplingRatePerMillion` equally to one million will internally translate to
+sampling everything. A similar configuration has to be applied to CRI-O; I can
+either start the `crio` binary with `--enable-tracing` and
+`--tracing-sampling-rate-per-million 1000000` or we use a drop-in configuration
+like this:
+
+```shell
+cat /etc/crio/crio.conf.d/99-tracing.conf
+```
+
+```toml
+[crio.tracing]
+enable_tracing = true
+tracing_sampling_rate_per_million = 1000000
+```
+
+To configure CRI-O to use conmon-rs, you require at least the latest CRI-O
+v1.25.x and conmon-rs v0.4.0. Then a configuration drop-in like this can be used
+to make CRI-O use conmon-rs:
+
+```shell
+cat /etc/crio/crio.conf.d/99-runtimes.conf
+```
+
+```toml
+[crio.runtime]
+default_runtime = "runc"
+
+[crio.runtime.runtimes.runc]
+runtime_type = "pod"
+monitor_path = "/path/to/conmonrs" # or will be looked up in $PATH
+```
+
+That's it, the default configuration will point to an [OpenTelemetry
+collector][collector] [gRPC][grpc] endpoint of `localhost:4317`, which has to be up and
+running as well. There are multiple ways to run OTLP as [described in the
+docs][collector], but it's also possible to `kubectl proxy` into an existing
+instance running within Kubernetes.
+
+[collector]: https://opentelemetry.io/docs/collector/getting-started
+
+If everything is set up, then the collector should log that there are incoming
+traces:
+
+```
+ScopeSpans #0
+ScopeSpans SchemaURL:
+InstrumentationScope go.opentelemetry.io/otel/sdk/tracer
+Span #0
+    Trace ID       : 71896e69f7d337730dfedb6356e74f01
+    Parent ID      : a2a7714534c017e6
+    ID             : 1d27dbaf38b9da8b
+    Name           : github.com/cri-o/cri-o/server.(*Server).filterSandboxList
+    Kind           : SPAN_KIND_INTERNAL
+    Start time     : 2022-11-15 09:50:20.060325562 +0000 UTC
+    End time       : 2022-11-15 09:50:20.060326291 +0000 UTC
+    Status code    : STATUS_CODE_UNSET
+    Status message :
+Span #1
+    Trace ID       : 71896e69f7d337730dfedb6356e74f01
+    Parent ID      : a837a005d4389579
+    ID             : a2a7714534c017e6
+    Name           : github.com/cri-o/cri-o/server.(*Server).ListPodSandbox
+    Kind           : SPAN_KIND_INTERNAL
+    Start time     : 2022-11-15 09:50:20.060321973 +0000 UTC
+    End time       : 2022-11-15 09:50:20.060330602 +0000 UTC
+    Status code    : STATUS_CODE_UNSET
+    Status message :
+Span #2
+    Trace ID       : fae6742709d51a9b6606b6cb9f381b96
+    Parent ID      : 3755d12b32610516
+    ID             : 0492afd26519b4b0
+    Name           : github.com/cri-o/cri-o/server.(*Server).filterContainerList
+    Kind           : SPAN_KIND_INTERNAL
+    Start time     : 2022-11-15 09:50:20.0607746 +0000 UTC
+    End time       : 2022-11-15 09:50:20.060795505 +0000 UTC
+    Status code    : STATUS_CODE_UNSET
+    Status message :
+Events:
+SpanEvent #0
+     -> Name: log
+     -> Timestamp: 2022-11-15 09:50:20.060778668 +0000 UTC
+     -> DroppedAttributesCount: 0
+     -> Attributes::
+          -> id: Str(adf791e5-2eb8-4425-b092-f217923fef93)
+          -> log.message: Str(No filters were applied, returning full container list)
+          -> log.severity: Str(DEBUG)
+          -> name: Str(/runtime.v1.RuntimeService/ListContainers)
+```
+
+I can see that the spans have a trace ID and typically have a parent attached.
+Events such as logs are part of the output as well. In the above case, the kubelet is
+periodically triggering a `ListPodSandbox` RPC to CRI-O caused by the Pod
+Lifecycle Event Generator (PLEG). Displaying those traces can be done via,
+for example, [Jaeger][jaeger]. When running the tracing stack locally, then a Jaeger
+instance should be exposed on `http://localhost:16686` per default.
+
+[jaeger]: https://www.jaegertracing.io/
+
+The `ListPodSandbox` requests are directly visible within the Jaeger UI:
+
+![ListPodSandbox RPC in the Jaeger UI](list_pod_sandbox.png)
+
+That's not too exciting, so I'll run a workload directly via `kubectl`:
+
+```shell
+kubectl run -it --rm --restart=Never --image=alpine alpine -- echo hi
+```
+
+```
+hi
+pod "alpine" deleted
+```
+
+Looking now at Jaeger, we can see that we have traces for `conmonrs`, `crio` as
+well as the `kubelet` for the `RunPodSandbox` and `CreateContainer` CRI RPCs:
+
+![Container creation in the Jaeger UI](create_container.png)
+
+The kubelet and CRI-O spans are connected to each other to make investigation
+easier. If we now take a closer look at the spans, then we can see that CRI-O's
+logs are correctly accosted with the corresponding functionality. For example we
+can extract the container user from the traces like this:
+
+![CRI-O in the Jaeger UI](crio_spans.png)
+
+The lower level spans of conmon-rs are also part of this trace. For example
+conmon-rs maintains an internal `read_loop` for handling IO between the
+container and the end user. The logs for reading and writing bytes are part of
+the span. The same applies to the `wait_for_exit_code` span, which tells us that
+the container exited successfully with code `0`:
+
+![conmon-rs in the Jaeger UI](conmonrs_spans.png)
+
+Having all that information at hand side by side to the filtering capabilities
+of Jaeger makes the whole stack a great solution for debugging container issues!
+Mentioning the "whole stack" also shows the biggest downside of the overall
+approach: Compared to parsing logs it adds a noticeable overhead on top of the
+cluster setup. Users have to maintain a sink like [Elasticsearch][elastic] to
+persist the data, expose the Jaeger UI and possibly take the performance
+drawback into account. Anyways, it's still one of the best ways to increase the
+observability aspect of Kubernetes.
+
+[elastic]: https://www.elastic.co
+
+Thank you for reading this blog post, I'm pretty sure we're looking into a
+bright future for OpenTelemetry support in Kubernetes to make troubleshooting
+simpler.