Skip to content

Commit 8902dfb

Browse files
Proposal: Distributed tracing for chaos experiments (#4684)
* proposal: Distributed tracing for chaos experiments Signed-off-by: namkyu1999 <[email protected]> * feat: add reference Signed-off-by: namkyu1999 <[email protected]> * feat: add implementation PRs tab Signed-off-by: namkyu1999 <[email protected]> * update: add a new pr Signed-off-by: namkyu1999 <[email protected]> * fix: add a link Signed-off-by: namkyu1999 <[email protected]> --------- Signed-off-by: namkyu1999 <[email protected]> Co-authored-by: Saranya Jena <[email protected]>
1 parent aab8a5e commit 8902dfb

File tree

3 files changed

+105
-0
lines changed

3 files changed

+105
-0
lines changed
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
| title | authors | creation-date | last-updated |
2+
|-------------------------------------------|----------------------------------------------|---------------|--------------|
3+
| Distributed tracing for chaos experiments | [@namkyu1999](https://github.com/namkyu1999) | 2024-06-01 | 2024-06-01 |
4+
5+
# Distributed tracing for chaos experiments
6+
7+
- [Summary](#summary)
8+
- [Motivation](#motivation)
9+
- [Goals](#goals)
10+
- [Non-Goals](#non-goals)
11+
- [Proposal](#proposal)
12+
- [Use Cases](#use-cases)
13+
- [Implementation Details](#implementation-details)
14+
- [Risks and Mitigations](#risks-and-mitigations)
15+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
16+
- [Drawbacks](#drawbacks)
17+
- [Alternatives](#alternatives)
18+
- [References](#references)
19+
- [Implementation PRs](#implementation-prs)
20+
21+
## Summary
22+
23+
This proposal suggests adopting open telemetry sdk into `chaos-operator` and `chaos-runner` for measuring(tracing) the performance of chaos experiments.
24+
25+
## Motivation
26+
27+
The phrase `You can't manage what you don't measure` gives an idea to our project. We offer [monitoring metrics](https://github.com/litmuschaos/litmus/tree/master/monitoring) by exposing `/metrics` endpoint. However, it is not enough to measure the performance of chaos experiments. We need to trace the performance of chaos experiments. There are so many pods(ex. argo, probes, runner ...) are running and completing in a single chaos experiment. We don't know which pod is causing the performance issue so that it is hard to trace the performance of chaos experiments. Distributed tracing helps pinpoint where failures occur and what causes poor performance. It is a key tool for debugging and understanding complex systems.
28+
29+
I was also inspired by [Tekton](https://tekton.dev/)'s [distributed tracing proposal](https://github.com/tektoncd/community/blob/main/teps/0124-distributed-tracing-for-tasks-and-pipelines.md).
30+
31+
### Goals
32+
33+
- Adopt open telemetry sdk into chaos-operator, chaos-runner, and all components running for chaos experiment.
34+
- Implementation of opentelemetry tracing with Jaeger.
35+
- Able to visualize chaos experiment steps in jaeger
36+
- Add documentation to /monitoring and litmus docs.
37+
38+
### Non-Goals
39+
40+
- Not changing the existing chaos-experiment structure.
41+
- Not changing the existing monitoring metrics.
42+
- Not changing the existing API.
43+
44+
## Proposal
45+
46+
### Use Cases
47+
48+
#### Use case 1 - LitmusChaos user
49+
50+
As a user, I want to know what is happening in the chaos experiment so that I can trace the performance of chaos experiments.
51+
52+
#### Use case 2 - OSS Developer
53+
54+
As a developer, I want to know where the performance issue is happening in the chaos experiment so that I can debug and fix the issue.
55+
56+
### Implementation Details
57+
58+
I plan to use open telemetry SDK. But I need to consider the following points.
59+
60+
In general distributed tracing, All the components are communicate via HTTP or gRPC. So they add trace context to the [request header](https://opentelemetry.io/docs/concepts/context-propagation/). But in chaos experiments, we are using the Kubernetes API to create resources. So we need to pass the trace context other than the request header.
61+
62+
I made a simple demo to show how to pass the trace context to the child container using env. Here is the [demo](https://github.com/namkyu1999/async-trace).
63+
64+
![demo-arch](./images/distributed-tracing-demo-arch.png)
65+
66+
In this demo, there are two containers. The first container is a parent container and the second container is a child container created by the parent container using the docker client API. When the child container is created, the parent container passes the trace context to the environment variable. The child container reads the trace context from the environment variable and sends the trace context to the Jaeger. Two containers sending each trace context using OpenTelemetry SDK. And open telemetry consider two trace context as a single trace.
67+
68+
So I will use the same approach in the chaos experiment. I will pass the trace context to the child container using the environment variable. And I will use the OpenTelemetry SDK to send the trace context to the Opentelemtry Collector.
69+
70+
Here's a implementation plan.
71+
- Add OpenTelemetry SDK to chaos-operator.
72+
- Add OpenTelemetry SDK to all components running for chaos experiment.
73+
- Send the trace context to the Opentelemetry Collector.
74+
- Visualize the chaos experiment steps in Jaeger.
75+
- Add documentation to /monitoring and litmus docs.
76+
77+
After the implementation, the chaos experiment steps will be visualized in Jaeger like this.
78+
79+
![result-example](./images/distributed-tracing-example.png)
80+
81+
The API remains unchanged. Enabling tracing is entirely optional for the end user. If tracing is disabled or not configured with the correct tracing backend URL, the reconcilers will function as usual. Therefore, we can categorize this as a non-breaking change.
82+
83+
## Risks and Mitigations
84+
85+
Because the OpenTelemetry SDK performs additional tasks, it can cause latency. So end user can disable the tracing feature.
86+
87+
## Upgrade / Downgrade Strategy
88+
89+
## Drawbacks
90+
91+
## Alternatives
92+
93+
## References
94+
- [Environment Variables as Carrier for Inter-Process Propagation to transport context](https://github.com/open-telemetry/opentelemetry-specification/issues/740)
95+
- [Tekton's distributed tracing proposal](https://github.com/tektoncd/community/blob/main/teps/0124-distributed-tracing-for-tasks-and-pipelines.md)
96+
- [noop tracer](https://github.com/open-telemetry/opentelemetry-go/discussions/2659)
97+
98+
## Implementation PRs
99+
100+
| isMerged | PR |
101+
|----------|--------------------------------------------------------------------------|
102+
| N | [chaos-runner](https://github.com/litmuschaos/chaos-runner/pull/221) |
103+
| N | [chaos-operator](https://github.com/litmuschaos/chaos-operator/pull/498) |
104+
| N | [litmus-go](https://github.com/litmuschaos/litmus-go/pull/706) |
105+
| N | [chaos center](https://github.com/litmuschaos/litmus/pull/4746) |
49.1 KB
Loading
17.3 KB
Loading

0 commit comments

Comments
 (0)