|
| 1 | +# OpenTelemetry Collector Processor Exploration |
| 2 | + |
| 3 | +## Objective |
| 4 | + |
| 5 | +To describe a user experience and strategies for configuring processors in the OpenTelemetry collector. |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +The OpenTelemetry (OTel) collector is a tool to set up pipelines to receive telemetry from an application and export it |
| 10 | +to an observability backend. Part of the pipeline can include processing stages, which executes various business logic |
| 11 | +on incoming telemetry before it is exported. |
| 12 | + |
| 13 | +Over time, the collector has added various processors to satisfy different use cases, generally in an ad-hoc way to |
| 14 | +support each feature independently. We can improve the experience for users of the collector by consolidating processing |
| 15 | +patterns in terms of user experience, and this can be supported by defining a querying model for processors |
| 16 | +within the collector core, and likely also for use in SDKs, to simplify implementation and promote the consistent user |
| 17 | +experience and best practices. |
| 18 | + |
| 19 | +## Goals and non-goals |
| 20 | + |
| 21 | +Goals: |
| 22 | +- List out use cases for processing within the collector |
| 23 | +- Consider what could be an ideal configuration experience for users |
| 24 | + |
| 25 | +Non-Goals: |
| 26 | +- Merge every processor into one. Many use cases overlap and generalize, but not all of them |
| 27 | +- Technical design or implementation of configuration experience. Currently focused on user experience. |
| 28 | + |
| 29 | +## Use cases for processing |
| 30 | + |
| 31 | +### Telemetry mutation |
| 32 | + |
| 33 | +Processors can be used to mutate the telemetry in the collector pipeline. OpenTelemetry SDKs collect detailed telemetry |
| 34 | +from applications, and it is common to have to mutate this into a way that is appropriate for an individual use case. |
| 35 | + |
| 36 | +Some types of mutation include |
| 37 | + |
| 38 | +- Remove a forbidden attribute such as `http.request.header.authorization` |
| 39 | +- Reduce cardinality of an attribute such as translating `http.target` value of `/user/123451/profile` to `/user/{userId}/profile` |
| 40 | +- Decrease the size of the telemetry payload by removing large resource attributes such as `process.command_line` |
| 41 | +- Filtering out signals such as by removing all telemetry with a `http.target` of `/health` |
| 42 | +- Attach information from resource into telemetry, for example adding certain resource fields as metric dimensions |
| 43 | + |
| 44 | +The processors implementing this use case are `attributesprocessor`, `filterprocessor`, `metricstransformprocessor`, |
| 45 | +`resourceprocessor`, `spanprocessor`. |
| 46 | + |
| 47 | +### Metric generation |
| 48 | + |
| 49 | +The collector may generate new metrics based on incoming telemetry. This can be for covering gaps in SDK coverage of |
| 50 | +metrics vs spans, or to create new metrics based on existing ones to model the data better for backend-specific |
| 51 | +expectations. |
| 52 | + |
| 53 | +- Create new metrics based on information in spans, for example to create a duration metric that is not implemented in the SDK yet |
| 54 | +- Apply arithmetic between multiple incoming metrics to produce an output one, for example divide an `amount` and a `capacity` to create a `utilization` metric |
| 55 | + |
| 56 | +The processors implementing this use case are `metricsgenerationprocessor`, `spanmetricsprocessor`. |
| 57 | + |
| 58 | +### Grouping |
| 59 | + |
| 60 | +Some processors are stateful, grouping telemetry over a window of time based on either a trace ID or an attribute value, |
| 61 | +or just general batching. |
| 62 | + |
| 63 | +- Batch incoming telemetry before sending to exporters to reduce export requests |
| 64 | +- Group spans by trace ID to allow doing tail sampling |
| 65 | +- Group telemetry for the same path |
| 66 | + |
| 67 | +The processors implementing this use case are `batchprocessor`, `groupbyattrprocessor`, `groupbytraceprocessor`. |
| 68 | + |
| 69 | +### Telemetry enrichment |
| 70 | + |
| 71 | +OpenTelemetry SDKs focus on collecting application specific data. They also may include resource detectors to populate |
| 72 | +environment specific data but the collector is commonly used to fill gaps in coverage of environment specific data. |
| 73 | + |
| 74 | +- Add environment about a cloud provider to `Resource` of all incoming telemetry |
| 75 | + |
| 76 | +The processors implementing this use case are `k8sattributesprocessor`, `resourcedetectionprocessor`. |
| 77 | + |
| 78 | +## Telemetry query language |
| 79 | + |
| 80 | +When looking at the use cases, there are certain common features for telemetry mutation and metric generation. |
| 81 | + |
| 82 | +- Identify the type of signal (span, metric, log, resource), or apply to all signals |
| 83 | +- Navigate to a path within the telemetry to operate on it |
| 84 | +- Define an operation, and possibly operation arguments |
| 85 | + |
| 86 | +We can try to model these into a query language, in particular allowing the first two points to be shared among all |
| 87 | +processing operations, and only have implementation of individual types of processing need to implement operators that |
| 88 | +the user can use within an expression. |
| 89 | + |
| 90 | +Telemetry is modeled in the collector as [`pdata`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/model/pdata) |
| 91 | +which is roughly a 1:1 mapping of the [OTLP protocol](https://github.com/open-telemetry/opentelemetry-proto/tree/main/opentelemetry/proto). |
| 92 | +This data can be navigated using field expressions, which are fields within the protocol separated by dots. For example, |
| 93 | +the status message of a span is `status.message`. A map lookup can include the key as a string, for example `attributes["http.status_code"]`. |
| 94 | + |
| 95 | +Virtual fields can be defined for the `type` of a signal (`span`, `metric`, `log`, `resource`) and the resource for a |
| 96 | +telemetry signal. For metrics, the structure presented for processing is actual data points, e.g. `NumberDataPoint`, |
| 97 | +`HistogramDataPoint`, with the information from higher levels like `Metric` or the data type available as virtual fields. |
| 98 | + |
| 99 | +Navigation can then be used with a simple expression language for identifying telemetry to operate on. |
| 100 | + |
| 101 | +`... where name = "GET /cats"` |
| 102 | +`... where type = span and attributes["http.target"] = "/health"` |
| 103 | +`... where resource.attributes["deployment"] = "canary"` |
| 104 | +`... where type = metric and descriptor.metric_type = gauge` |
| 105 | +`... where type = metric and descriptor.metric_name = "http.active_requests"` |
| 106 | + |
| 107 | +Having selected telemetry to operate on, any needed operations can be defined as functions. Known useful functions should |
| 108 | +be implemented within the collector itself, provide registration from extension modules to allow customization with |
| 109 | +contrib components, and in the future can even allow user plugins possibly through WASM, similar to work in |
| 110 | +[HTTP proxies](https://github.com/proxy-wasm/spec). The arguments to operations will primarily be field expressions, |
| 111 | +allowing the operation to mutate telemetry as needed. |
| 112 | + |
| 113 | +### Examples |
| 114 | + |
| 115 | +Remove a forbidden attribute such as `http.request.header.authorization` from all telemetry. |
| 116 | + |
| 117 | +`delete(attributes["http.request.header.authorization"])` |
| 118 | + |
| 119 | +Remove a forbidden attribute from spans only |
| 120 | + |
| 121 | +`delete(attributes["http.request.header.authorization"]) where type = span` |
| 122 | + |
| 123 | +Remove all attributes except for some |
| 124 | + |
| 125 | +`keep(attributes, "http.method", "http.status_code") where type = metric` |
| 126 | + |
| 127 | +Reduce cardinality of an attribute |
| 128 | + |
| 129 | +`replace_wildcards("/user/*/list/*", "/user/{userId}/list/{listId}", attributes["http.target"])` |
| 130 | + |
| 131 | +Reduce cardinality of a span name |
| 132 | + |
| 133 | +`replace_wildcards("GET /user/*/list/*", "GET /user/{userId}/list/{listId}", name) where type = span` |
| 134 | + |
| 135 | +Decrease the size of the telemetry payload by removing large resource attributes |
| 136 | + |
| 137 | +`delete(attributes["process.command_line"]) where type = resource)` |
| 138 | + |
| 139 | +Filtering out signals such as by removing all telemetry with a `http.target` of `/health` |
| 140 | + |
| 141 | +`drop() where attributes["http.target"] = "/health"` |
| 142 | + |
| 143 | +Attach information from resource into telemetry, for example adding certain resource fields as metric attributes |
| 144 | + |
| 145 | +`set(attributes["k8s_pod"], resource.attributes["k8s.pod.name"]) where type = metric` |
| 146 | + |
| 147 | +Stateful processing can also be modeled by the language. The processor implementation would set up the state while |
| 148 | +parsing the configuration. |
| 149 | + |
| 150 | +Create duration_metric with two attributes copied from a span |
| 151 | + |
| 152 | +``` |
| 153 | +create_histogram("duration", end_time_nanos - start_time_nanos) where type = span |
| 154 | +keep(attributes, "http.method") where type = metric and descriptor.metric_name = "duration |
| 155 | +``` |
| 156 | + |
| 157 | +Group spans by trace ID |
| 158 | + |
| 159 | +`group_by(trace_id, 2m) where type = span` |
| 160 | + |
| 161 | +Create utilization metric from base metrics. Because navigation expressions only operate on a single piece of telemetry, |
| 162 | +helper functions for reading values from other metrics need to be provided. |
| 163 | + |
| 164 | +`create_gauge("pod.cpu.utilized", read_gauge("pod.cpu.usage") / read_gauge("node.cpu.limit") where type = metric` |
| 165 | + |
| 166 | +A lot of processing. Queries are executed in order. While initially performance may degrade compared to more specialized |
| 167 | +processors, the expectation is that over time, the query processor's engine would improve to be able to apply optimizations |
| 168 | +across queries, compile into machine code, etc. |
| 169 | + |
| 170 | +```yaml |
| 171 | +receivers: |
| 172 | + otlp: |
| 173 | + |
| 174 | +exporters: |
| 175 | + otlp: |
| 176 | + |
| 177 | +processors: |
| 178 | + query: |
| 179 | + # Assuming group_by is defined in a contrib extension module, not baked into the "query" processor |
| 180 | + extensions: [group_by] |
| 181 | + expressions: |
| 182 | + - drop() where attributes["http.target"] = "/health" |
| 183 | + - delete(attributes["http.request.header.authorization"]) |
| 184 | + - replace_wildcards("/user/*/list/*", "/user/{userId}/list/{listId}", attributes["http.target"]) |
| 185 | + - set(attributes["k8s_pod"], resource.attributes["k8s.pod.name"]) where type = metric |
| 186 | + - group_by(trace_id, 2m) where type = span |
| 187 | + |
| 188 | +pipelines: |
| 189 | + - receivers: [otlp] |
| 190 | + exporters: [otlp] |
| 191 | + processors: [query] |
| 192 | +``` |
| 193 | +
|
| 194 | +## Declarative configuration |
| 195 | +
|
| 196 | +The telemetry query language presents an SQL-like experience for defining telemetry transformations - it is made up of |
| 197 | +the three primary components described above, however, and can be presented declaratively instead depending on what makes |
| 198 | +sense as a user experience. |
| 199 | +
|
| 200 | +```yaml |
| 201 | +- type: span |
| 202 | + filter: |
| 203 | + match: |
| 204 | + path: status.code |
| 205 | + value: OK |
| 206 | + operation: |
| 207 | + name: drop |
| 208 | +- type: all |
| 209 | + operation: |
| 210 | + name: delete |
| 211 | + args: |
| 212 | + - attributes["http.request.header.authorization"] |
| 213 | +``` |
| 214 | +
|
| 215 | +An implementation of the query language would likely parse expressions into this sort of structure so given an SQL-like |
| 216 | +implementation, it would likely be little overhead to support a YAML approach in addition. |
0 commit comments