Skip to content

Commit cdfdd63

Browse files
author
Anuraag Agrawal
committed
Add design doc for collector processor experience
1 parent 0116b98 commit cdfdd63

File tree

2 files changed

+217
-0
lines changed

2 files changed

+217
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.idea
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# OpenTelemetry Collector Processor Exploration
2+
3+
## Objective
4+
5+
To describe a user experience and strategies for configuring processors in the OpenTelemetry collector.
6+
7+
## Summary
8+
9+
The OpenTelemetry (OTel) collector is a tool to set up pipelines to receive telemetry from an application and export it
10+
to an observability backend. Part of the pipeline can include processing stages, which executes various business logic
11+
on incoming telemetry before it is exported.
12+
13+
Over time, the collector has added various processors to satisfy different use cases, generally in an ad-hoc way to
14+
support each feature independently. We can improve the experience for users of the collector by consolidating processing
15+
patterns in terms of user experience, and this can be supported by defining a querying model for processors
16+
within the collector core, and likely also for use in SDKs, to simplify implementation and promote the consistent user
17+
experience and best practices.
18+
19+
## Goals and non-goals
20+
21+
Goals:
22+
- List out use cases for processing within the collector
23+
- Consider what could be an ideal configuration experience for users
24+
25+
Non-Goals:
26+
- Merge every processor into one. Many use cases overlap and generalize, but not all of them
27+
- Technical design or implementation of configuration experience. Currently focused on user experience.
28+
29+
## Use cases for processing
30+
31+
### Telemetry mutation
32+
33+
Processors can be used to mutate the telemetry in the collector pipeline. OpenTelemetry SDKs collect detailed telemetry
34+
from applications, and it is common to have to mutate this into a way that is appropriate for an individual use case.
35+
36+
Some types of mutation include
37+
38+
- Remove a forbidden attribute such as `http.request.header.authorization`
39+
- Reduce cardinality of an attribute such as translating `http.target` value of `/user/123451/profile` to `/user/{userId}/profile`
40+
- Decrease the size of the telemetry payload by removing large resource attributes such as `process.command_line`
41+
- Filtering out signals such as by removing all telemetry with a `http.target` of `/health`
42+
- Attach information from resource into telemetry, for example adding certain resource fields as metric dimensions
43+
44+
The processors implementing this use case are `attributesprocessor`, `filterprocessor`, `metricstransformprocessor`,
45+
`resourceprocessor`, `spanprocessor`.
46+
47+
### Metric generation
48+
49+
The collector may generate new metrics based on incoming telemetry. This can be for covering gaps in SDK coverage of
50+
metrics vs spans, or to create new metrics based on existing ones to model the data better for backend-specific
51+
expectations.
52+
53+
- Create new metrics based on information in spans, for example to create a duration metric that is not implemented in the SDK yet
54+
- Apply arithmetic between multiple incoming metrics to produce an output one, for example divide an `amount` and a `capacity` to create a `utilization` metric
55+
56+
The processors implementing this use case are `metricsgenerationprocessor`, `spanmetricsprocessor`.
57+
58+
### Grouping
59+
60+
Some processors are stateful, grouping telemetry over a window of time based on either a trace ID or an attribute value,
61+
or just general batching.
62+
63+
- Batch incoming telemetry before sending to exporters to reduce export requests
64+
- Group spans by trace ID to allow doing tail sampling
65+
- Group telemetry for the same path
66+
67+
The processors implementing this use case are `batchprocessor`, `groupbyattrprocessor`, `groupbytraceprocessor`.
68+
69+
### Telemetry enrichment
70+
71+
OpenTelemetry SDKs focus on collecting application specific data. They also may include resource detectors to populate
72+
environment specific data but the collector is commonly used to fill gaps in coverage of environment specific data.
73+
74+
- Add environment about a cloud provider to `Resource` of all incoming telemetry
75+
76+
The processors implementing this use case are `k8sattributesprocessor`, `resourcedetectionprocessor`.
77+
78+
## Telemetry query language
79+
80+
When looking at the use cases, there are certain common features for telemetry mutation and metric generation.
81+
82+
- Identify the type of signal (span, metric, log, resource), or apply to all signals
83+
- Navigate to a path within the telemetry to operate on it
84+
- Define an operation, and possibly operation arguments
85+
86+
We can try to model these into a query language, in particular allowing the first two points to be shared among all
87+
processing operations, and only have implementation of individual types of processing need to implement operators that
88+
the user can use within an expression.
89+
90+
Telemetry is modeled in the collector as [`pdata`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/model/pdata)
91+
which is roughly a 1:1 mapping of the [OTLP protocol](https://github.com/open-telemetry/opentelemetry-proto/tree/main/opentelemetry/proto).
92+
This data can be navigated using field expressions, which are fields within the protocol separated by dots. For example,
93+
the status message of a span is `status.message`. A map lookup can include the key as a string, for example `attributes["http.status_code"]`.
94+
95+
Virtual fields can be defined for the `type` of a signal (`span`, `metric`, `log`, `resource`) and the resource for a
96+
telemetry signal. For metrics, the structure presented for processing is actual data points, e.g. `NumberDataPoint`,
97+
`HistogramDataPoint`, with the information from higher levels like `Metric` or the data type available as virtual fields.
98+
99+
Navigation can then be used with a simple expression language for identifying telemetry to operate on.
100+
101+
`... where name = "GET /cats"`
102+
`... where type = span and attributes["http.target"] = "/health"`
103+
`... where resource.attributes["deployment"] = "canary"`
104+
`... where type = metric and descriptor.metric_type = gauge`
105+
`... where type = metric and descriptor.metric_name = "http.active_requests"`
106+
107+
Having selected telemetry to operate on, any needed operations can be defined as functions. Known useful functions should
108+
be implemented within the collector itself, provide registration from extension modules to allow customization with
109+
contrib components, and in the future can even allow user plugins possibly through WASM, similar to work in
110+
[HTTP proxies](https://github.com/proxy-wasm/spec). The arguments to operations will primarily be field expressions,
111+
allowing the operation to mutate telemetry as needed.
112+
113+
### Examples
114+
115+
Remove a forbidden attribute such as `http.request.header.authorization` from all telemetry.
116+
117+
`delete(attributes["http.request.header.authorization"])`
118+
119+
Remove a forbidden attribute from spans only
120+
121+
`delete(attributes["http.request.header.authorization"]) where type = span`
122+
123+
Remove all attributes except for some
124+
125+
`keep(attributes, "http.method", "http.status_code") where type = metric`
126+
127+
Reduce cardinality of an attribute
128+
129+
`replace_wildcards("/user/*/list/*", "/user/{userId}/list/{listId}", attributes["http.target"])`
130+
131+
Reduce cardinality of a span name
132+
133+
`replace_wildcards("GET /user/*/list/*", "GET /user/{userId}/list/{listId}", name) where type = span`
134+
135+
Decrease the size of the telemetry payload by removing large resource attributes
136+
137+
`delete(attributes["process.command_line"]) where type = resource)`
138+
139+
Filtering out signals such as by removing all telemetry with a `http.target` of `/health`
140+
141+
`drop() where attributes["http.target"] = "/health"`
142+
143+
Attach information from resource into telemetry, for example adding certain resource fields as metric attributes
144+
145+
`set(attributes["k8s_pod"], resource.attributes["k8s.pod.name"]) where type = metric`
146+
147+
Stateful processing can also be modeled by the language. The processor implementation would set up the state while
148+
parsing the configuration.
149+
150+
Create duration_metric with two attributes copied from a span
151+
152+
```
153+
create_histogram("duration", end_time_nanos - start_time_nanos) where type = span
154+
keep(attributes, "http.method") where type = metric and descriptor.metric_name = "duration
155+
```
156+
157+
Group spans by trace ID
158+
159+
`group_by(trace_id, 2m) where type = span`
160+
161+
Create utilization metric from base metrics. Because navigation expressions only operate on a single piece of telemetry,
162+
helper functions for reading values from other metrics need to be provided.
163+
164+
`create_gauge("pod.cpu.utilized", read_gauge("pod.cpu.usage") / read_gauge("node.cpu.limit") where type = metric`
165+
166+
A lot of processing. Queries are executed in order. While initially performance may degrade compared to more specialized
167+
processors, the expectation is that over time, the query processor's engine would improve to be able to apply optimizations
168+
across queries, compile into machine code, etc.
169+
170+
```yaml
171+
receivers:
172+
otlp:
173+
174+
exporters:
175+
otlp:
176+
177+
processors:
178+
query:
179+
# Assuming group_by is defined in a contrib extension module, not baked into the "query" processor
180+
extensions: [group_by]
181+
expressions:
182+
- drop() where attributes["http.target"] = "/health"
183+
- delete(attributes["http.request.header.authorization"])
184+
- replace_wildcards("/user/*/list/*", "/user/{userId}/list/{listId}", attributes["http.target"])
185+
- set(attributes["k8s_pod"], resource.attributes["k8s.pod.name"]) where type = metric
186+
- group_by(trace_id, 2m) where type = span
187+
188+
pipelines:
189+
- receivers: [otlp]
190+
exporters: [otlp]
191+
processors: [query]
192+
```
193+
194+
## Declarative configuration
195+
196+
The telemetry query language presents an SQL-like experience for defining telemetry transformations - it is made up of
197+
the three primary components described above, however, and can be presented declaratively instead depending on what makes
198+
sense as a user experience.
199+
200+
```yaml
201+
- type: span
202+
filter:
203+
match:
204+
path: status.code
205+
value: OK
206+
operation:
207+
name: drop
208+
- type: all
209+
operation:
210+
name: delete
211+
args:
212+
- attributes["http.request.header.authorization"]
213+
```
214+
215+
An implementation of the query language would likely parse expressions into this sort of structure so given an SQL-like
216+
implementation, it would likely be little overhead to support a YAML approach in addition.

0 commit comments

Comments
 (0)