Skip to content

Commit 20ef4b0

Browse files
author
Liudmila Molkova
committed
feedback and cleanup
1 parent 6abb88d commit 20ef4b0

File tree

1 file changed

+52
-42
lines changed

1 file changed

+52
-42
lines changed

oteps/4333-recording-exceptions-on-logs.md

Lines changed: 52 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# Recording exceptions and errors with OpenTelemetry
1+
# Recording exceptions and errors on logs
22

33
This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.
44

55
## Motivation
66

7-
OTel recommends recording exceptions using span events available through Trace API. Outside of OTel world, exceptions are usually recorded by user apps and libraries using logging libraries.
7+
Today OTel supports recording exceptions using span events available through Trace API. Outside of OTel world, exceptions are usually recorded by user apps and libraries using logging libraries and may be recorded as OTel logs via logging bridge.
88

99
Log-based exception events have the following advantages over span events:
1010
- they can be recorded for operations that don't have any tracing instrumentation
@@ -13,56 +13,52 @@ Log-based exception events have the following advantages over span events:
1313
- they are already reported natively by many frameworks and libraries
1414

1515
Exception events are essential for troubleshooting. Regardless of how they are recorded, they could be noisy:
16-
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading - individual occurrence of transitive errors are not indicative of any problems.
17-
- exception events can be huge due to stack traces. They can frequently reach several KBs resulting in high costs associated with ingesting and storing exception events. It's also common to log exceptions multiple times while they bubble up leading to duplication and aggravating verbosity problem.
16+
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading - individual occurrence of transient errors are not necessarily indicative of a problem.
17+
- exception events can be huge due to stack traces. They can frequently reach several KBs resulting in high costs associated with ingesting and storing exception events. It's also common to log exceptions multiple times while they bubble up leading to duplication and aggravating the verbosity problem.
1818

19-
In this OTEP, we'll provide the guidance around recording exceptions that minimizes duplication, allows to reduce noise with configuration and capture exceptions in absence of a recorded span.
19+
In this OTEP, we'll provide guidance around recording exceptions that minimizes duplication, allows to reduce noise with configuration and
20+
allows to capture exceptions in absence of a recorded span.
2021

2122
This guidance applies to general-purpose instrumentations including native ones. Application developers should consider following it as a starting point, but they are expected to adjust it to their needs.
2223

2324
## Guidance
2425

25-
1. Exceptions should be recorded as [log-based `exception` events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
26+
1. Exceptions should be recorded as [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
27+
or [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/events.md)
2628

27-
This rule ensures that exception events can be recorded independently from traces and cover cases when no span exists,
29+
This rule ensures that exception logs can be recorded independently from traces and covers cases when no span exists,
2830
or when the corresponding span is not recorded.
2931

30-
2. Exception event should be logged with appropriate severity depending on the available context.
32+
2. Exception should be logged with appropriate severity depending on the available context.
3133

3234
- Exceptions that don't indicate any issue should be recorded with severity not higher than `Info`.
3335
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.
3436

35-
This rule enables typical log level mechanisms to control exception event volume.
37+
This rule enables typical logging mechanisms to control logs volume.
3638

37-
3. Exception event should be recorded when the exception instance is created and thrown for the first time.
39+
3. An exception log should be recorded when the exception instance is created and thrown for the first time.
3840
This includes new exception instances that wrap other exception(s).
3941

40-
This rule ensures that exception event is recorded at least once for each exception thrown.
42+
This rule ensures that an exception log is recorded at least once for each exception thrown.
4143

42-
4. Exception events should not be recorded when exception is handled or rethrown as is, except the following cases:
43-
- exceptions handled in global exception handlers (see p5 below)
44-
- exceptions from the code that doesn't record exception events in the compatible with OTel way.
44+
4. An exception log should not be recorded when an exception is handled or rethrown as is, except the following cases:
45+
- exceptions handled in global exception handlers (see #5 below)
46+
- exceptions from code that doesn't record exception logs in a way that is compatible with OTel.
4547

46-
This rule ensures that exception event is recorded at most once for each *handled* exception.
48+
This rule ensures that an exception log is recorded at most once for each *handled* exception.
4749

48-
5. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually create local root spans, should record exception events for unhandled exceptions with `Error` severity and [`exception.escaped`](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/attributes-registry/exception.md#exception-escaped) flag set to `true`.
50+
5. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually create local root spans, should record logs
51+
for unhandled exceptions with `Error` severity and [`exception.escaped`](https://github.com/open-telemetry/semantic-conventions/blob/v1.29.0/docs/attributes-registry/exception.md) flag set to `true`.
4952

50-
Some runtimes and frameworks provide global exception handler which could be used to record exception events. The priority should be given to the instrumentation point where the operation context is available.
53+
Some runtimes and frameworks provide global exception handler that can be used to record exception logs. Priority should be given to the instrumentation point where the operation context is available.
5154

52-
<!-- TODO: do we need an `unhandled` attribute instead of `exception.escaped`? -->
55+
<!-- TODO: do we need an `exception.unhandled` attribute instead of `exception.escaped`? -->
5356

5457
This allows to record unhandled exception with proper severity and distinguish them from handled ones.
5558

56-
6. User applications and instrumentations are encouraged to put additional attributes on exception events to describe the context exception was thrown in. They are also encouraged to define their own error events and enrich them with `exception.*` attributes.
57-
58-
### Log configuration scenarios
59-
60-
OpenTelemetry SDK should provide configuration options allowing (but not limited to):
61-
62-
- Record unhandled exceptions only
63-
- Record exceptions based on the log severity
64-
- Record exception events, but omit the stack trace based on (at least) the log level. It should be possible to optimize instrumentation and avoid collecting the attribute. See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that records stack trace conditionally.
65-
- Record all available exceptions with all the details
59+
6. When recording exception on logs, user applications and instrumentations are encouraged to put additional attributes
60+
to describe the context that the exception was thrown in.
61+
They are also encouraged to define their own error events and enrich them with `exception.*` attributes.
6662

6763
### Examples
6864

@@ -104,7 +100,7 @@ try {
104100
#### Recording exceptions inside the library (native instrumentation)
105101

106102
It's a common practice to record exceptions using logging libraries. Client libraries that are natively instrumented with OpenTelemetry should
107-
leverage OTel Events/Logs API for their logging purposes.
103+
leverage OTel Events/Logs API for their exception logging purposes.
108104

109105
```java
110106
public class MyClient {
@@ -124,7 +120,7 @@ public class MyClient {
124120

125121
MyClientException ex = new MyClientException(response.statusCode(), readErrorInfo(response));
126122

127-
logger.eventBuilder("exception")
123+
logger.logRecordBuilder()
128124
.setSeverity(Severity.INFO)
129125
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
130126
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
@@ -147,10 +143,10 @@ public class Connection {
147143
try {
148144
return socketChannel.write(content);
149145
} catch (Throwable ex) {
150-
// we're rethrowing an exception here since the underlying
151-
// platform code may or may not record exception logs depending on JRE,
146+
// we're re-throwing the exception here, but still recording it on logs
147+
// since the underlying platform code may or may not record exception logs depending on JRE,
152148
// configuration, and other implementation details
153-
logger.eventBuilder("exception")
149+
logger.logRecordBuilder()
154150
.setSeverity(Severity.INFO)
155151
.addAttribute("connection.id", this.getId())
156152
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
@@ -170,7 +166,7 @@ TODO
170166

171167
## Trade-offs and mitigations
172168

173-
1. Breaking change to existing [exception guidance](https://github.com/open-telemetry/opentelemetry-specification/blob/a265ae0628177be25dc477ea8fe200f1c825b871/specification/trace/exceptions.md) which recommends recording exceptions as span events in every instrumentation that detects them.
169+
1. Breaking change for any component following existing [exception guidance](https://github.com/open-telemetry/opentelemetry-specification/blob/a265ae0628177be25dc477ea8fe200f1c825b871/specification/trace/exceptions.md) which recommends recording exceptions as span events in every instrumentation that detects them.
174170

175171
**Mitigation:**
176172
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log events conversion, but that's not enough - instrumentations will have to change their behavior to report exception events with appropriate severity (or stop reporting exceptions).
@@ -179,26 +175,40 @@ TODO
179175
2. Recording exceptions as log-based events would result in UX degradation for users leveraging trace-only backends such as Jaeger.
180176

181177
**Mitigation:**
182-
- OpenTelemetry API and/or SDK in the future may provide span events -> log events conversion. See also [Event vision OTEP](https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events).
178+
- OpenTelemetry API and/or SDK may provide span events -> log events conversion. See also [Event vision OTEP](https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events).
183179

184180
## Prior art and alternatives
185181

186182
Alternatives:
187183

188-
1. Record exceptions only when exception is handled (or remains unhandled). This relies on the user applications to log them correctly and consistently, it also makes it impossible to add context available deep in the stack when exception happens.
184+
1. Record exceptions only when exception is handled (or remains unhandled). This relies on the user applications to log them correctly and consistently, it also makes it impossible to add context available deep in the stack where exception happened.
189185
2. Record exception events whenever exception is detected (even if exception is handled or rethrown), use additional attributes and/or severity so that it can be filtered out by the processing pipeline. This OTEP does not prevent evolution in this direction.
190-
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume. We can record exception type and message whenever caller feels like recording exception event. This OTEP does not prevent evolution in this direction.
191-
4. OTel may deduplicate exception events and mark exception instances as logged (augment exception instance or keep a small cache of recently logged exceptions). This can potentially mitigate the problem for existing application when it and its dependencies log exceptions extensively. It does not though help with guidance for the greenfield applications and libraries.
186+
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume. We can record exception type and message whenever caller feels like recording exception information and only record stacktrace when the exception is thrown. This OTEP does not prevent evolution in this direction.
187+
4. OTel may deduplicate exception events and mark exception instances as logged (augment exception instance or keep a small cache of recently logged exceptions). This can potentially mitigate the problem for existing application when it logs exceptions extensively. We should still provide guidance for the greenfield applications and libraries to optimize logging.
192188

193189
## Open questions
194190

195-
1. This OTEP assumes that client libraries (in general) are already instrumented with logs natively. It's valid for some environments (e.g. .NET, Java, Python, ?) which have standard or widely used structured logging libraries. Are there environments where it's not the case?
196-
- E.g. in JS it's not common to depend on the logging lib.
191+
1. This OTEP assumes that client libraries (in general) are already instrumented with logs natively. It's valid for some environments (e.g. .NET, Java, or Python)
192+
which have standard (or widely used) structured logging libraries. In languages and ecosystem without common logging libraries, we cannot rely on exceptions
193+
to be logged where they are thrown.
194+
As a result instrumentation libraries may need to log exceptions every time they see them, resulting in possible duplication.
197195

198196
## Future possibilities
199197

200-
1. OpenTelemetry API should be extended to provide convenience methods to
198+
1. OpenTelemetry should provide configuration options and APIs allowing (but not limited) to:
199+
200+
- Record unhandled exceptions only
201+
- Record exceptions based on the log severity
202+
- Record exception logs, but omit the stack trace based on (at least) the log level.
203+
See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that records stack trace conditionally.
204+
- Record all available exceptions with all the details
205+
206+
It should be possible to optimize instrumentation and avoid collecting exception information
207+
(such as stack trace) when the corresponding exception log is not going to be recorded.
208+
209+
2. OpenTelemetry API should be extended to provide convenience methods to
201210
- record log-based event from exception instance
202211
- attach exception information to any event or log
203212

204-
2. Log-based events allow to capture exception stack trace as structured event body instead of a string attribute. It can be easier to process and consume exception events with structured stack traces.
213+
3. Exception stacktraces can be recorded in structured form instead of their
214+
string representation. It may be easier to process and consume them in this form.

0 commit comments

Comments
 (0)