You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: oteps/4333-recording-exceptions-on-logs.md
+40-16Lines changed: 40 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,15 +6,18 @@ This OTEP provides guidance on how to record exceptions using OpenTelemetry logs
6
6
7
7
Today OTel supports recording exceptions using span events available through Trace API. Outside of OTel world, exceptions are usually recorded by user apps and libraries using logging libraries and may be recorded as OTel logs via logging bridge.
8
8
9
-
Log-based exception events have the following advantages over span events:
9
+
Exceptions recorded on logs have the following advantages over span events:
10
10
- they can be recorded for operations that don't have any tracing instrumentation
11
11
- they can be sampled along with or separately from spans
12
12
- they can have different severity levels to reflect how critical the exception is
13
13
- they are already reported natively by many frameworks and libraries
14
14
15
-
Exception events are essential for troubleshooting. Regardless of how they are recorded, they could be noisy:
16
-
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading - individual occurrence of transient errors are not necessarily indicative of a problem.
17
-
- exception events can be huge due to stack traces. They can frequently reach several KBs resulting in high costs associated with ingesting and storing exception events. It's also common to log exceptions multiple times while they bubble up leading to duplication and aggravating the verbosity problem.
15
+
Recording exception on logs is essential for troubleshooting. But regardless of how they are recorded, they could be noisy:
16
+
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading -
17
+
individual occurrence of transient errors are not necessarily indicative of a problem.
18
+
- exception stack traces can be huge. Corresponding attribute value can frequently reach several KBs resulting in high costs
19
+
associated with ingesting and storing such logs. It's also common to log exceptions multiple times while they bubble up
20
+
leading to duplication and aggravating the verbosity problem.
18
21
19
22
In this OTEP, we'll provide guidance around recording exceptions that minimizes duplication, allows to reduce noise with configuration and
20
23
allows to capture exceptions in absence of a recorded span.
@@ -29,7 +32,7 @@ This guidance applies to general-purpose instrumentations including native ones.
29
32
This rule ensures that exception logs can be recorded independently from traces and covers cases when no span exists,
30
33
or when the corresponding span is not recorded.
31
34
32
-
2.Exception should be logged with appropriate severity depending on the available context.
35
+
2.An exception should be logged with appropriate severity depending on the available context.
33
36
34
37
- Exceptions that don't indicate any issue should be recorded with severity not higher than `Info`.
35
38
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.
@@ -169,29 +172,50 @@ TODO
169
172
1. Breaking change for any component following existing [exception guidance](https://github.com/open-telemetry/opentelemetry-specification/blob/a265ae0628177be25dc477ea8fe200f1c825b871/specification/trace/exceptions.md) which recommends recording exceptions as span events in every instrumentation that detects them.
170
173
171
174
**Mitigation:**
172
-
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log events conversion, but that's not enough - instrumentations will have to change their behavior to report exception events with appropriate severity (or stop reporting exceptions).
175
+
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion,
176
+
but that's not enough - instrumentations will have to change their behavior to report exception logs
177
+
with appropriate severity (or stop reporting them).
173
178
- TODO: document opt-in mechanism similar to `OTEL_SEMCONV_STABILITY_OPT_IN`
174
179
175
-
2. Recording exceptions as log-based events would result in UX degradation for users leveraging trace-only backends such as Jaeger.
180
+
1. Recording exceptions as log-based events would result in UX degradation for users
181
+
leveraging trace-only backends such as Jaeger.
176
182
177
183
**Mitigation:**
178
-
- OpenTelemetry API and/or SDK may provide span events -> log events conversion. See also [Event vision OTEP](https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events).
184
+
- OpenTelemetry API and/or SDK may provide span events -> log events conversion.
185
+
See also [Event vision OTEP](./0265-event-vision.md#relationship-to-span-events).
179
186
180
187
## Prior art and alternatives
181
188
182
189
Alternatives:
183
190
184
-
1. Record exceptions only when exception is handled (or remains unhandled). This relies on the user applications to log them correctly and consistently, it also makes it impossible to add context available deep in the stack where exception happened.
185
-
2. Record exception events whenever exception is detected (even if exception is handled or rethrown), use additional attributes and/or severity so that it can be filtered out by the processing pipeline. This OTEP does not prevent evolution in this direction.
186
-
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume. We can record exception type and message whenever caller feels like recording exception information and only record stacktrace when the exception is thrown. This OTEP does not prevent evolution in this direction.
187
-
4. OTel may deduplicate exception events and mark exception instances as logged (augment exception instance or keep a small cache of recently logged exceptions). This can potentially mitigate the problem for existing application when it logs exceptions extensively. We should still provide guidance for the greenfield applications and libraries to optimize logging.
191
+
1. Record exceptions only when exception is handled (or remains unhandled). This relies
192
+
on the user applications to log them correctly and consistently, it also makes
193
+
it impossible to add context available deep in the stack where exception happened.
194
+
2. Record exception events whenever exception is detected (even if exception is handled or rethrown),
195
+
use additional attributes and/or severity so that it can be filtered out by the processing pipeline.
196
+
This OTEP does not prevent evolution in this direction.
197
+
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume.
198
+
We can record exception type and message whenever caller feels like recording exception information
199
+
and only record stacktrace when the exception is thrown.
200
+
This OTEP does not prevent evolution in this direction.
201
+
4. OTel may deduplicate exception events by marking exception instances as logged
202
+
(augment exception instance or keep a small cache of recently logged exceptions).
203
+
This can potentially mitigate the problem for existing application when it logs exceptions extensively.
204
+
We should still provide optimal guidance for the greenfield applications and libraries.
188
205
189
206
## Open questions
190
207
191
-
1. This OTEP assumes that client libraries (in general) are already instrumented with logs natively. It's valid for some environments (e.g. .NET, Java, or Python)
192
-
which have standard (or widely used) structured logging libraries. In languages and ecosystem without common logging libraries, we cannot rely on exceptions
193
-
to be logged where they are thrown.
194
-
As a result instrumentation libraries may need to log exceptions every time they see them, resulting in possible duplication.
208
+
1. This OTEP assumes that the majority of client libraries are already instrumented
209
+
with logs natively. It should be the case for some environments (e.g. .NET, Java,
210
+
Python, Golang, or Rust) which have standard or widely used structured logging
211
+
libraries. In languages and ecosystem without common logging libraries,
212
+
we cannot rely on exceptions to be logged where they are thrown.
213
+
214
+
As a result instrumentation libraries may need to log exceptions every time
215
+
they see them, resulting in possible duplication.
216
+
217
+
Are we aware of environments where we don't have widely available logging libs
0 commit comments