Skip to content

Commit 71cef76

Browse files
author
Liudmila Molkova
committed
OTEP: Recording exceptions and errors with OpenTelemetry
1 parent b9b6d8f commit 71cef76

File tree

1 file changed

+204
-0
lines changed

1 file changed

+204
-0
lines changed
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# Recording exceptions and errors with OpenTelemetry
2+
3+
This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.
4+
5+
## Motivation
6+
7+
OTel recommends recording exceptions using span events available through Trace API. Outside of OTel world, exceptions are usually recorded by user apps and libraries using logging libraries.
8+
9+
Log-based exception events have the following advantages over span events:
10+
- they can be recorded for operations that don't have any tracing instrumentation
11+
- they can be sampled along with or separately from spans
12+
- they can have different severity levels to reflect how critical the exception is
13+
- they are already reported natively by many frameworks and libraries
14+
15+
Exception events are essential for troubleshooting. Regardless of how they are recorded, they could be noisy:
16+
- distributed applications experience transient errors at the rate proportional to their scale and exceptions in logs could be misleading - individual occurrence of transitive errors are not indicative of any problems.
17+
- exception events can be huge due to stack traces. They can frequently reach several KBs resulting in high costs associated with ingesting and storing exception events. It's also common to log exceptions multiple times while they bubble up leading to duplication and aggravating verbosity problem.
18+
19+
In this OTEP, we'll provide the guidance around recording exceptions that minimizes duplication, allows to reduce noise with configuration and capture exceptions in absence of a recorded span.
20+
21+
This guidance applies to general-purpose instrumentations including native ones. Application developers should consider following it as a starting point, but they are expected to adjust it to their needs.
22+
23+
## Guidance
24+
25+
1. Exceptions should be recorded as [log-based `exception` events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
26+
27+
This rule ensures that exception events can be recorded independently from traces and cover cases when no span exists,
28+
or when the corresponding span is not recorded.
29+
30+
2. Exception event should be logged with appropriate severity depending on the available context.
31+
32+
- Exceptions that don't indicate any issue should be recorded with severity not higher than `Info`.
33+
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.
34+
35+
This rule enables typical log level mechanisms to control exception event volume.
36+
37+
3. Exception event should be recorded when the exception instance is created and thrown for the first time.
38+
This includes new exception instances that wrap other exception(s).
39+
40+
This rule ensures that exception event is recorded at least once for each exception thrown.
41+
42+
4. Exception events should not be recorded when exception is handled or rethrown as is, except the following cases:
43+
- exceptions handled in global exception handlers (see p5 below)
44+
- exceptions from the code that doesn't record exception events in the compatible with OTel way.
45+
46+
This rule ensures that exception event is recorded at most once for each *handled* exception.
47+
48+
5. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually create local root spans, should record exception events for unhandled exceptions with `Error` severity and [`exception.escaped`](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/attributes-registry/exception.md#exception-escaped) flag set to `true`.
49+
50+
Some runtimes and frameworks provide global exception handler which could be used to record exception events. The priority should be given to the instrumentation point where the operation context is available.
51+
52+
<!-- TODO: do we need an `unhandled` attribute instead of `exception.escaped`? -->
53+
54+
This allows to record unhandled exception with proper severity and distinguish them from handled ones.
55+
56+
6. User applications and instrumentations are encouraged to put additional attributes on exception events to describe the context exception was thrown in. They are also encouraged to define their own error events and enrich them with `exception.*` attributes.
57+
58+
### Log configuration scenarios
59+
60+
OpenTelemetry SDK should provide configuration options allowing (but not limited to):
61+
62+
- Record unhandled exceptions only
63+
- Record exceptions based on the log severity
64+
- Record exception events, but omit the stack trace based on (at least) the log level. It should be possible to optimize instrumentation and avoid collecting the attribute. See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that records stack trace conditionally.
65+
- Record all available exceptions with all the details
66+
67+
### Examples
68+
69+
#### Catching exception from client library call
70+
71+
If underlying client library is already recording exceptions using OTel-compatible logger facade, the caller should not record exception event.
72+
73+
```java
74+
StorageClient client = createClient(endpoint, credential);
75+
...
76+
try {
77+
// let's assume that the underlying client library is already
78+
// recording exceptions using OTel-compatible logger facade.
79+
BinaryData content = client.download(contentId);
80+
81+
return response(content, HttpStatus.OK);
82+
} catch (ContentNotFoundException ex) {
83+
// we don't record exception here, but may record a log record without exception info
84+
logger.logRecordBuilder()
85+
// let's assume it's expected that some content can disappear
86+
.severityNumber(Severity.INFO)
87+
.addAttribute(AttributeKey.stringKey("error.type"), ex.getClass().getCanonicalName())
88+
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
89+
.emit();
90+
91+
return response(HttpStatus.NOT_FOUND);
92+
} catch (NotAuthorizedException ex) {
93+
// let's assume it's really unexpected - service lost access to the underlying storage
94+
logger.logRecordBuilder()
95+
.severityNumber(Severity.ERROR)
96+
.addAttribute(AttributeKey.stringKey("error.type"), ex.getClass().getCanonicalName())
97+
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
98+
.emit();
99+
100+
return response(HttpStatus.INTERNAL_SERVER_ERROR);
101+
}
102+
```
103+
104+
#### Recording exceptions inside the library (native instrumentation)
105+
106+
It's a common practice to record exceptions using logging libraries. Client libraries that are natively instrumented with OpenTelemetry should
107+
leverage OTel Events/Logs API for their logging purposes.
108+
109+
```java
110+
public class MyClient {
111+
112+
private final Logger logger;
113+
...
114+
public BinaryData getContent(String contentId) {
115+
116+
// let's assume as have logging interceptor in our
117+
// http client pipeline and exception are logged there
118+
// or HTTP client already records exceptions using logger
119+
// supported by OTel
120+
HttpResponse response = client.get(contentId);
121+
if (response.statusCode == 200) {
122+
return readContent(response);
123+
}
124+
125+
MyClientException ex = new MyClientException(response.statusCode(), readErrorInfo(response));
126+
127+
logger.eventBuilder("exception")
128+
.setSeverity(Severity.INFO)
129+
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
130+
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
131+
.addAttribute(AttributeKey.stringKey("exception.name"), ex.getMessage())
132+
.addAttribute(AttributeKey.stringKey("exception.stacktrace"), getStackTrace(ex))
133+
.emit();
134+
}
135+
}
136+
```
137+
138+
Some libraries or runtimes don't record exception logs or record them in ways that are not compatible with OpenTelemetry. In this case, it's recommended to
139+
record exception event when the exception is rethrown.
140+
141+
```java
142+
public class Connection {
143+
144+
private final Logger logger;
145+
...
146+
public long send(ByteBuffer content) {
147+
try {
148+
return socketChannel.write(content);
149+
} catch (Throwable ex) {
150+
// we're rethrowing an exception here since the underlying
151+
// platform code may or may not record exception logs depending on JRE,
152+
// configuration, and other implementation details
153+
logger.eventBuilder("exception")
154+
.setSeverity(Severity.INFO)
155+
.addAttribute("connection.id", this.getId())
156+
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
157+
.addAttribute(AttributeKey.stringKey("exception.name"), ex.getMessage())
158+
.addAttribute(AttributeKey.stringKey("exception.stacktrace"), getStackTrace(ex))
159+
.emit();
160+
161+
throw ex;
162+
}
163+
}
164+
}
165+
```
166+
167+
#### HTTP server instrumentation
168+
169+
TODO
170+
171+
## Trade-offs and mitigations
172+
173+
1. Breaking change to existing [exception guidance](https://github.com/open-telemetry/opentelemetry-specification/blob/a265ae0628177be25dc477ea8fe200f1c825b871/specification/trace/exceptions.md) which recommends recording exceptions as span events in every instrumentation that detects them.
174+
175+
**Mitigation:**
176+
- OpenTelemetry API and/or SDK in the future may provide opt-in span events -> log events conversion, but that's not enough - instrumentations will have to change their behavior to report exception events with appropriate severity (or stop reporting exceptions).
177+
- TODO: document opt-in mechanism similar to `OTEL_SEMCONV_STABILITY_OPT_IN`
178+
179+
2. Recording exceptions as log-based events would result in UX degradation for users leveraging trace-only backends such as Jaeger.
180+
181+
**Mitigation:**
182+
- OpenTelemetry API and/or SDK in the future may provide span events -> log events conversion. See also [Event vision OTEP](https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0265-event-vision.md#relationship-to-span-events).
183+
184+
## Prior art and alternatives
185+
186+
Alternatives:
187+
188+
1. Record exceptions only when exception is handled (or remains unhandled). This relies on the user applications to log them correctly and consistently, it also makes it impossible to add context available deep in the stack when exception happens.
189+
2. Record exception events whenever exception is detected (even if exception is handled or rethrown), use additional attributes and/or severity so that it can be filtered out by the processing pipeline. This OTEP does not prevent evolution in this direction.
190+
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume. We can record exception type and message whenever caller feels like recording exception event. This OTEP does not prevent evolution in this direction.
191+
4. OTel may deduplicate exception events and mark exception instances as logged (augment exception instance or keep a small cache of recently logged exceptions). This can potentially mitigate the problem for existing application when it and its dependencies log exceptions extensively. It does not though help with guidance for the greenfield applications and libraries.
192+
193+
## Open questions
194+
195+
1. This OTEP assumes that client libraries (in general) are already instrumented with logs natively. It's valid for some environments (e.g. .NET, Java, Python, ?) which have standard or widely used structured logging libraries. Are there environments where it's not the case?
196+
- E.g. in JS it's not common to depend on the logging lib.
197+
198+
## Future possibilities
199+
200+
1. OpenTelemetry API should be extended to provide convenience methods to
201+
- record log-based event from exception instance
202+
- attach exception information to any event or log
203+
204+
2. Log-based events allow to capture exception stack trace as structured event body instead of a string attribute. It can be easier to process and consume exception events with structured stack traces.

0 commit comments

Comments
 (0)