Skip to content

Commit e092095

Browse files
author
Liudmila Molkova
committed
cleanups
1 parent 20ef4b0 commit e092095

File tree

1 file changed

+40
-16
lines changed

1 file changed

+40
-16
lines changed

oteps/4333-recording-exceptions-on-logs.md

Lines changed: 40 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,18 @@ This OTEP provides guidance on how to record exceptions using OpenTelemetry logs
66

77
Today OTel supports recording exceptions using span events available through Trace API. Outside of OTel world, exceptions are usually recorded by user apps and libraries using logging libraries and may be recorded as OTel logs via logging bridge.
88

9-
Log-based exception events have the following advantages over span events:
9+
Exceptions recorded on logs have the following advantages over span events:
1010
- they can be recorded for operations that don't have any tracing instrumentation
1111
- they can be sampled along with or separately from spans
1212
- they can have different severity levels to reflect how critical the exception is
1313
- they are already reported natively by many frameworks and libraries
1414

15-
Exception events are essential for troubleshooting. Regardless of how they are recorded, they could be noisy:
16-
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading - individual occurrence of transient errors are not necessarily indicative of a problem.
17-
- exception events can be huge due to stack traces. They can frequently reach several KBs resulting in high costs associated with ingesting and storing exception events. It's also common to log exceptions multiple times while they bubble up leading to duplication and aggravating the verbosity problem.
15+
Recording exception on logs is essential for troubleshooting. But regardless of how they are recorded, they could be noisy:
16+
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading -
17+
individual occurrence of transient errors are not necessarily indicative of a problem.
18+
- exception stack traces can be huge. Corresponding attribute value can frequently reach several KBs resulting in high costs
19+
associated with ingesting and storing such logs. It's also common to log exceptions multiple times while they bubble up
20+
leading to duplication and aggravating the verbosity problem.
1821

1922
In this OTEP, we'll provide guidance around recording exceptions that minimizes duplication, allows to reduce noise with configuration and
2023
allows to capture exceptions in absence of a recorded span.
@@ -29,7 +32,7 @@ This guidance applies to general-purpose instrumentations including native ones.
2932
This rule ensures that exception logs can be recorded independently from traces and covers cases when no span exists,
3033
or when the corresponding span is not recorded.
3134

32-
2. Exception should be logged with appropriate severity depending on the available context.
35+
2. An exception should be logged with appropriate severity depending on the available context.
3336

3437
- Exceptions that don't indicate any issue should be recorded with severity not higher than `Info`.
3538
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.
@@ -169,29 +172,50 @@ TODO
169172
1. Breaking change for any component following existing [exception guidance](https://github.com/open-telemetry/opentelemetry-specification/blob/a265ae0628177be25dc477ea8fe200f1c825b871/specification/trace/exceptions.md) which recommends recording exceptions as span events in every instrumentation that detects them.
170173

171174
**Mitigation:**
172-
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log events conversion, but that's not enough - instrumentations will have to change their behavior to report exception events with appropriate severity (or stop reporting exceptions).
175+
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log-based events conversion,
176+
but that's not enough - instrumentations will have to change their behavior to report exception logs
177+
with appropriate severity (or stop reporting them).
173178
- TODO: document opt-in mechanism similar to `OTEL_SEMCONV_STABILITY_OPT_IN`
174179

175-
2. Recording exceptions as log-based events would result in UX degradation for users leveraging trace-only backends such as Jaeger.
180+
1. Recording exceptions as log-based events would result in UX degradation for users
181+
leveraging trace-only backends such as Jaeger.
176182

177183
**Mitigation:**
178-
- OpenTelemetry API and/or SDK may provide span events -> log events conversion. See also [Event vision OTEP](https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events).
184+
- OpenTelemetry API and/or SDK may provide span events -> log events conversion.
185+
See also [Event vision OTEP](./0265-event-vision.md#relationship-to-span-events).
179186

180187
## Prior art and alternatives
181188

182189
Alternatives:
183190

184-
1. Record exceptions only when exception is handled (or remains unhandled). This relies on the user applications to log them correctly and consistently, it also makes it impossible to add context available deep in the stack where exception happened.
185-
2. Record exception events whenever exception is detected (even if exception is handled or rethrown), use additional attributes and/or severity so that it can be filtered out by the processing pipeline. This OTEP does not prevent evolution in this direction.
186-
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume. We can record exception type and message whenever caller feels like recording exception information and only record stacktrace when the exception is thrown. This OTEP does not prevent evolution in this direction.
187-
4. OTel may deduplicate exception events and mark exception instances as logged (augment exception instance or keep a small cache of recently logged exceptions). This can potentially mitigate the problem for existing application when it logs exceptions extensively. We should still provide guidance for the greenfield applications and libraries to optimize logging.
191+
1. Record exceptions only when exception is handled (or remains unhandled). This relies
192+
on the user applications to log them correctly and consistently, it also makes
193+
it impossible to add context available deep in the stack where exception happened.
194+
2. Record exception events whenever exception is detected (even if exception is handled or rethrown),
195+
use additional attributes and/or severity so that it can be filtered out by the processing pipeline.
196+
This OTEP does not prevent evolution in this direction.
197+
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume.
198+
We can record exception type and message whenever caller feels like recording exception information
199+
and only record stacktrace when the exception is thrown.
200+
This OTEP does not prevent evolution in this direction.
201+
4. OTel may deduplicate exception events by marking exception instances as logged
202+
(augment exception instance or keep a small cache of recently logged exceptions).
203+
This can potentially mitigate the problem for existing application when it logs exceptions extensively.
204+
We should still provide optimal guidance for the greenfield applications and libraries.
188205

189206
## Open questions
190207

191-
1. This OTEP assumes that client libraries (in general) are already instrumented with logs natively. It's valid for some environments (e.g. .NET, Java, or Python)
192-
which have standard (or widely used) structured logging libraries. In languages and ecosystem without common logging libraries, we cannot rely on exceptions
193-
to be logged where they are thrown.
194-
As a result instrumentation libraries may need to log exceptions every time they see them, resulting in possible duplication.
208+
1. This OTEP assumes that the majority of client libraries are already instrumented
209+
with logs natively. It should be the case for some environments (e.g. .NET, Java,
210+
Python, Golang, or Rust) which have standard or widely used structured logging
211+
libraries. In languages and ecosystem without common logging libraries,
212+
we cannot rely on exceptions to be logged where they are thrown.
213+
214+
As a result instrumentation libraries may need to log exceptions every time
215+
they see them, resulting in possible duplication.
216+
217+
Are we aware of environments where we don't have widely available logging libs
218+
making this OTEP less relevant for them?
195219

196220
## Future possibilities
197221

0 commit comments

Comments
 (0)