Skip to content

Commit 569313a

Browse files
author
Liudmila Molkova
committed
feedback: recording all exceptions by default is too noisy
1 parent e092095 commit 569313a

File tree

1 file changed

+95
-85
lines changed

1 file changed

+95
-85
lines changed

oteps/4333-recording-exceptions-on-logs.md

Lines changed: 95 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -26,116 +26,147 @@ This guidance applies to general-purpose instrumentations including native ones.
2626

2727
## Guidance
2828

29-
1. Exceptions should be recorded as [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
30-
or [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/events.md)
29+
This guidance boils down to the following:
3130

32-
This rule ensures that exception logs can be recorded independently from traces and covers cases when no span exists,
33-
or when the corresponding span is not recorded.
31+
- we should record full exception details including stack traces only for unhandled exceptions (by default).
32+
- we should log error details and context when the error happens. These records don't need need to include exception stack traces unless this exception is unhandled.
33+
- we should avoid logging the same error as it propagates up through the stack.
34+
- we should log errors with appropriate severity ranging from `Trace` to `Fatal`.
3435

35-
2. An exception should be logged with appropriate severity depending on the available context.
36+
> [!NOTE]
37+
>
38+
> Based on this guidance non-native instrumentations should record exceptions in top-level instrumentations only (#2 above)
3639
37-
- Exceptions that don't indicate any issue should be recorded with severity not higher than `Info`.
38-
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.
40+
### Details
3941

40-
This rule enables typical logging mechanisms to control logs volume.
42+
1. Exceptions should be recorded as [logs](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/exceptions/exceptions-logs.md)
43+
or [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/events.md)
4144

42-
3. An exception log should be recorded when the exception instance is created and thrown for the first time.
43-
This includes new exception instances that wrap other exception(s).
45+
2. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually create local root spans, should record logs
46+
for unhandled exceptions with `Error` severity and [`exception.escaped`](https://github.com/open-telemetry/semantic-conventions/blob/v1.29.0/docs/attributes-registry/exception.md) flag set to `true`.
4447

45-
This rule ensures that an exception log is recorded at least once for each exception thrown.
48+
<!-- TODO: do we need an `exception.unhandled` attribute instead of `exception.escaped`? -->
49+
Some runtimes and frameworks provide global exception handler that can be used to record exception logs. Priority should be given to the instrumentation point where the operation context is available.
4650

47-
4. An exception log should not be recorded when an exception is handled or rethrown as is, except the following cases:
48-
- exceptions handled in global exception handlers (see #5 below)
49-
- exceptions from code that doesn't record exception logs in a way that is compatible with OTel.
51+
3. It's recommended to record exception stack traces only for unhandled exceptions in cases outlined in #2 above.
5052

51-
This rule ensures that an exception log is recorded at most once for each *handled* exception.
53+
4. Native instrumentations should record log describing an error and the context it happened in
54+
when this error is detected. Corresponding log record should not contain exception stack
55+
traces (if an exception was thrown/caught) unless such exceptions usually remain unhandled.
5256

53-
5. Instrumentations for incoming requests, message processing, background job execution, or others that wrap user code and usually create local root spans, should record logs
54-
for unhandled exceptions with `Error` severity and [`exception.escaped`](https://github.com/open-telemetry/semantic-conventions/blob/v1.29.0/docs/attributes-registry/exception.md) flag set to `true`.
57+
5. An error should be logged with appropriate severity depending on the available context.
5558

56-
Some runtimes and frameworks provide global exception handler that can be used to record exception logs. Priority should be given to the instrumentation point where the operation context is available.
59+
- Error that don't indicate any issue should be recorded with severity not higher than `Info`.
60+
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.
61+
- Unhandled exceptions that don't result in application shutdown should be recorded with severity `Error`
62+
- Errors that result in application shutdown should be recorded with severity `Fatal`
5763

58-
<!-- TODO: do we need an `exception.unhandled` attribute instead of `exception.escaped`? -->
64+
6. Instrumentations should not log errors or exceptions that are handled or
65+
are propagated as is, except ones handled in global exception handlers (see #2 below)
5966

60-
This allows to record unhandled exception with proper severity and distinguish them from handled ones.
67+
If a new exception is created based on the original one or a new details about the error become available,
68+
instrumentation may record another error (without stack trace)
6169

62-
6. When recording exception on logs, user applications and instrumentations are encouraged to put additional attributes
70+
7. When recording exception on logs, user applications and instrumentations are encouraged to put additional attributes
6371
to describe the context that the exception was thrown in.
6472
They are also encouraged to define their own error events and enrich them with `exception.*` attributes.
6573

66-
### Examples
74+
## API changes
6775

68-
#### Catching exception from client library call
76+
It should not be an instrumentation library concern to decide whether exception stack trace should be recorded or not.
77+
Library may write logs providing exception instance through a log bridge and not be aware of this guidance.
6978

70-
If underlying client library is already recording exceptions using OTel-compatible logger facade, the caller should not record exception event.
79+
It also maybe desirable by some vendors/apps to record all the exception details.
80+
81+
OTel Logs API should provide additional methods that enrich log record with exception details such as
82+
`addException(exception)` (`addUnhandledException`, etc), similar to [RecordException](../specification/trace/api.md?plain=1#L682)
83+
method on span.
84+
85+
OTel SDK should implement such methods and set exception attributes based on configuration
86+
and in the optimal way. This would also provide a more scalable way to evolve this guidance
87+
and extend configuration options if necessary.
88+
89+
### Examples
90+
91+
#### Catching exception from client library in a user application
7192

7293
```java
7394
StorageClient client = createClient(endpoint, credential);
7495
...
7596
try {
76-
// let's assume that the underlying client library is already
77-
// recording exceptions using OTel-compatible logger facade.
7897
BinaryData content = client.download(contentId);
7998

8099
return response(content, HttpStatus.OK);
81100
} catch (ContentNotFoundException ex) {
82101
// we don't record exception here, but may record a log record without exception info
83102
logger.logRecordBuilder()
84-
// let's assume it's expected that some content can disappear
85-
.severityNumber(Severity.INFO)
86-
.addAttribute(AttributeKey.stringKey("error.type"), ex.getClass().getCanonicalName())
87103
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
104+
.severityNumber(Severity.INFO)
105+
// let's assume it's expected that some content can disappear
106+
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
107+
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
108+
// ideally we should provide the following method for convenience, optimization,
109+
// and to support different behavior behind config options
110+
//.addException(ex)
88111
.emit();
89112

90113
return response(HttpStatus.NOT_FOUND);
91114
} catch (NotAuthorizedException ex) {
92115
// let's assume it's really unexpected - service lost access to the underlying storage
116+
// since we're returning error response without an exception, we
93117
logger.logRecordBuilder()
94118
.severityNumber(Severity.ERROR)
95-
.addAttribute(AttributeKey.stringKey("error.type"), ex.getClass().getCanonicalName())
96119
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
120+
// ideally we should provide the following method for convenience, optimization,
121+
// and to support different behavior behind config options
122+
//.addException(ex)
123+
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
124+
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
97125
.emit();
98126

99127
return response(HttpStatus.INTERNAL_SERVER_ERROR);
100128
}
101129
```
102130

103-
#### Recording exceptions inside the library (native instrumentation)
131+
#### Recording error inside the library (native instrumentation)
104132

105133
It's a common practice to record exceptions using logging libraries. Client libraries that are natively instrumented with OpenTelemetry should
106134
leverage OTel Events/Logs API for their exception logging purposes.
107135

108136
```java
109-
public class MyClient {
137+
public class StorageClient {
110138

111139
private final Logger logger;
112140
...
113141
public BinaryData getContent(String contentId) {
114-
115-
// let's assume as have logging interceptor in our
116-
// http client pipeline and exception are logged there
117-
// or HTTP client already records exceptions using logger
118-
// supported by OTel
119-
HttpResponse response = client.get(contentId);
120-
if (response.statusCode == 200) {
121-
return readContent(response);
122-
}
123-
124-
MyClientException ex = new MyClientException(response.statusCode(), readErrorInfo(response));
125-
126-
logger.logRecordBuilder()
127-
.setSeverity(Severity.INFO)
142+
HttpResponse response = client.get(contentId);
143+
if (response.statusCode() == 200) {
144+
return readContent(response);
145+
}
146+
147+
logger.logRecordBuilder()
148+
// we may set different levels depending on the status code, but in general
149+
// we expect caller to handle the error, so this is at most warning
150+
.setSeverity(Severity.WARN)
128151
.addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId)
152+
.addAttribute(AttributeKey.stringKey("http.response.status_code"), response.statusCode())
153+
// ideally we should provide the following method for convenience, optimization,
154+
// and to support different behavior behind config options
155+
//.addException(ex)
129156
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
130-
.addAttribute(AttributeKey.stringKey("exception.name"), ex.getMessage())
131-
.addAttribute(AttributeKey.stringKey("exception.stacktrace"), getStackTrace(ex))
157+
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
132158
.emit();
159+
160+
if (response.statusCode() == 404) {
161+
throw new ContentNotFoundException(readErrorInfo(response));
162+
}
163+
164+
...
133165
}
134166
}
135167
```
136168

137-
Some libraries or runtimes don't record exception logs or record them in ways that are not compatible with OpenTelemetry. In this case, it's recommended to
138-
record exception event when the exception is rethrown.
169+
Network level errors are part of normal life, we should consider using low severity for them
139170

140171
```java
141172
public class Connection {
@@ -146,15 +177,12 @@ public class Connection {
146177
try {
147178
return socketChannel.write(content);
148179
} catch (Throwable ex) {
149-
// we're re-throwing the exception here, but still recording it on logs
150-
// since the underlying platform code may or may not record exception logs depending on JRE,
151-
// configuration, and other implementation details
152180
logger.logRecordBuilder()
181+
// we'll retry it, so it's Info or lower
153182
.setSeverity(Severity.INFO)
154183
.addAttribute("connection.id", this.getId())
155184
.addAttribute(AttributeKey.stringKey("exception.type"), ex.getClass().getCanonicalName())
156-
.addAttribute(AttributeKey.stringKey("exception.name"), ex.getMessage())
157-
.addAttribute(AttributeKey.stringKey("exception.stacktrace"), getStackTrace(ex))
185+
.addAttribute(AttributeKey.stringKey("exception.message"), ex.getMessage())
158186
.emit();
159187

160188
throw ex;
@@ -163,7 +191,7 @@ public class Connection {
163191
}
164192
```
165193

166-
#### HTTP server instrumentation
194+
#### HTTP server instrumentation/global exception handler
167195

168196
TODO
169197

@@ -188,51 +216,33 @@ TODO
188216

189217
Alternatives:
190218

191-
1. Record exceptions only when exception is handled (or remains unhandled). This relies
192-
on the user applications to log them correctly and consistently, it also makes
193-
it impossible to add context available deep in the stack where exception happened.
194-
2. Record exception events whenever exception is detected (even if exception is handled or rethrown),
195-
use additional attributes and/or severity so that it can be filtered out by the processing pipeline.
196-
This OTEP does not prevent evolution in this direction.
197-
3. (Variation of 2) Exception stack traces are the most problematic in terms of volume.
198-
We can record exception type and message whenever caller feels like recording exception information
199-
and only record stacktrace when the exception is thrown.
200-
This OTEP does not prevent evolution in this direction.
201-
4. OTel may deduplicate exception events by marking exception instances as logged
219+
1. Deduplicate exception info on logs. We can mark exception instances as logged
202220
(augment exception instance or keep a small cache of recently logged exceptions).
203221
This can potentially mitigate the problem for existing application when it logs exceptions extensively.
204222
We should still provide optimal guidance for the greenfield applications and libraries.
205223

206-
## Open questions
207-
208-
1. This OTEP assumes that the majority of client libraries are already instrumented
209-
with logs natively. It should be the case for some environments (e.g. .NET, Java,
210-
Python, Golang, or Rust) which have standard or widely used structured logging
211-
libraries. In languages and ecosystem without common logging libraries,
212-
we cannot rely on exceptions to be logged where they are thrown.
224+
2. Log full exception info only when exception is thrown for the first time
225+
(including new exceptions wrapping original ones). This results in at-most-once
226+
logging, but even this is known to be problematic since absolute majority of exceptions
227+
are handled.
228+
It also relies on the assumption that most libraries will follow this guidance.
213229

214-
As a result instrumentation libraries may need to log exceptions every time
215-
they see them, resulting in possible duplication.
230+
## Open questions
216231

217-
Are we aware of environments where we don't have widely available logging libs
218-
making this OTEP less relevant for them?
232+
TBD
219233

220234
## Future possibilities
221235

222236
1. OpenTelemetry should provide configuration options and APIs allowing (but not limited) to:
223237

224-
- Record unhandled exceptions only
225-
- Record exceptions based on the log severity
238+
- Record unhandled exceptions only (the default documented in this guidance)
239+
- Record exception info based on the log severity
226240
- Record exception logs, but omit the stack trace based on (at least) the log level.
227241
See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that records stack trace conditionally.
228242
- Record all available exceptions with all the details
229243

230244
It should be possible to optimize instrumentation and avoid collecting exception information
231245
(such as stack trace) when the corresponding exception log is not going to be recorded.
232246

233-
2. OpenTelemetry API should be extended to provide convenience methods to
234-
- record log-based event from exception instance
235-
- attach exception information to any event or log
236-
237-
3. Exception stacktraces can be recorded in structured form instead of their
247+
2. Exception stack traces can be recorded in structured form instead of their
238248
string representation. It may be easier to process and consume them in this form.

0 commit comments

Comments
 (0)