-
Notifications
You must be signed in to change notification settings - Fork 933
OTEP: Recording exceptions as log records #4333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
lmolkova
wants to merge
35
commits into
open-telemetry:main
Choose a base branch
from
lmolkova:exceptions-on-logs-otep
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
215667e
OTEP: Recording exceptions and errors with OpenTelemetry
80f0706
filename
d521375
feedback and cleanup
1e4b690
cleanups
805a6d3
feedback: recording all exceptions by default is too noisy
2f3e07a
minor fixes
8226b94
minor fixes
96b5dfa
clean up
7702ad8
changelog and lint
573176e
more cleanups and lint
5f7d3c8
ore fixes
bab37fb
more cleanups and lint
afa9714
Apply suggestions from code review
725ae77
feedback
30a0745
remove the note
4152d95
Apply suggestions from code review
4eb90e3
feedback: more details on severity, language-specific, replacement fo…
da4ba16
more feedback, define error/exception
578354e
lint
fe3b6ae
Apply suggestions from code review
e3db414
up
f6d4c0d
add migration section, update trade-offs and minitations, clean up
9f88bee
review, another round
ffedeb8
toc
3fa84ba
more nits
a187861
more nits
e5b2d23
Update oteps/4333-recording-exceptions-on-logs.md
ec9c199
Apply suggestions from code review
1336f01
Apply suggestions from code review
f165e90
clarifications
0ea191d
more nits
76921c8
clean up migration
0b361ca
remove sections that are no longer necessary, clean up
ec29ff0
remove sections that are no longer necessary, clean up
0be3bc3
lint
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,377 @@ | ||
| # Recording exceptions and errors in logs | ||
|
|
||
| <!-- toc --> | ||
|
|
||
| - [Motivation](#motivation) | ||
| - [Guidance](#guidance) | ||
| * [Details](#details) | ||
| - [API changes](#api-changes) | ||
| - [SDK changes](#sdk-changes) | ||
| - [Examples](#examples) | ||
| * [Logging errors from client library in a user application](#logging-errors-from-client-library-in-a-user-application) | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * [Logging errors inside the natively instrumented library](#logging-errors-inside-the-natively-instrumented-library) | ||
| * [Logging errors in messaging processor](#logging-errors-in-messaging-processor) | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| + [Natively instrumented library](#natively-instrumented-library) | ||
| + [Instrumentation library](#instrumentation-library) | ||
| - [Prototypes](#prototypes) | ||
| - [Prior art and alternatives](#prior-art-and-alternatives) | ||
| - [Future possibilities](#future-possibilities) | ||
|
|
||
| <!-- tocstop --> | ||
|
|
||
| This OTEP continues [Span Event API deprecation plan OTEP](./4430-span-event-api-deprecation-plan.md) | ||
| and provides guidance on how to record errors and exceptions using OpenTelemetry Logs, | ||
| focusing on minimizing duplication and providing context to reduce noise. | ||
|
Comment on lines
+22
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just ran across another benefit, as it will allow users to capture all exceptions (if they want) even when doing sampling on traces. |
||
|
|
||
| > [!NOTE] | ||
| > Throughout this OTEP, the terms exception and error are defined as follows: | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| > | ||
| > - *Error* refers to a general concept describing any non-success condition, | ||
| > which may manifest as an exception, non-successful status code, or an invalid | ||
| > response. | ||
| > - *Exception* specifically refers to runtime exceptions and their associated stack traces. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Motivation | ||
|
|
||
| Today, OTel supports recording *exceptions* using span events available through | ||
| the Trace API that is [being deprecated](./4430-span-event-api-deprecation-plan.md). | ||
| Outside the OTel world, *exceptions* and *errors* are usually recorded by user apps | ||
| and libraries using logging libraries, and may be recorded as OTel logs via a logging bridge. | ||
|
|
||
| Recording errors is essential for troubleshooting, but they can be noisy: | ||
|
|
||
| - Distributed applications experience transient errors at a rate proportional to their scale, and | ||
| errors in logs can be misleading. Individual occurrences of transient errors | ||
| are not necessarily indicative of a problem. | ||
| - Exception stack traces can be huge. The corresponding attribute value can frequently reach several KBs, resulting in high costs | ||
| associated with ingesting and storing them. It's also common to log errors multiple times | ||
| as they bubble up, leading to duplication and aggravating the verbosity problem. | ||
| - Severity depends on the context and, in the general case, is not known at the time the error | ||
| occurs since errors are frequently handled (suppressed, retried, ignored) by the caller. | ||
|
|
||
| In this OTEP, we'll provide guidance around recording errors that minimizes duplication, | ||
| allows reducing noise with configuration, and allows capturing errors in the | ||
| absence of a recorded span. | ||
|
|
||
| This guidance applies to general-purpose instrumentations, including natively | ||
| instrumented libraries. | ||
|
|
||
| Application developers should consider following it as a starting point, but | ||
| they are encouraged to adjust it to their needs. | ||
|
|
||
| ## Guidance | ||
|
|
||
| This guidance boils down to the following: | ||
|
|
||
| Instrumentations SHOULD record error information along with relevant context as | ||
| a log record with appropriate severity. | ||
|
|
||
| Instrumentations SHOULD set severity to `Error` or higher only when the log describes a | ||
| problem affecting application functionality, availability, performance, security, or | ||
| another aspect that is important for the given type of application. | ||
|
|
||
| When an instrumentation records an exception, it SHOULD provide | ||
| the whole exception instance to the OTel SDK so the SDK can record it fully or | ||
| partially based on the provided configuration. The default SDK behavior SHOULD | ||
| be to record exception stack traces when logging exceptions at `Error` or higher severity. | ||
|
|
||
| ### Details | ||
|
|
||
| 1. Errors SHOULD be recorded as [logs](https://github.com/open-telemetry/semantic-conventions/blob/v1.36.0/docs/exceptions/exceptions-logs.md) | ||
| or as [log-based events](https://github.com/open-telemetry/semantic-conventions/blob/v1.36.0/docs/general/events.md). | ||
|
|
||
| 2. Instrumentations for incoming requests, message processing, background job execution, or others that wrap application code and usually | ||
| create local root spans, SHOULD record logs for unhandled errors with `Error` severity. | ||
|
|
||
| Some runtimes provide a global exception handler that can be used to log exceptions. | ||
| Priority should be given to the instrumentation point where the operation context is available. | ||
| Language SIGs are encouraged to provide runtime-specific guidance. For example, here is the | ||
| [.NET guidance](https://github.com/open-telemetry/opentelemetry-dotnet/blob/610045298873397e55e0df6cd777d4901ace1f63/docs/trace/reporting-exceptions/README.md#unhandled-exception) | ||
| for recording exceptions on traces. | ||
|
|
||
| 3. Natively instrumented libraries SHOULD record a log describing an error and the context in which it occurred | ||
| as soon as the error is detected (or where the most context is available). | ||
|
|
||
| 4. It is NOT RECOMMENDED to record the same error as it propagates through the call stack, or | ||
| to attach the same instance of an exception to multiple log records. | ||
|
|
||
| 5. An error SHOULD be logged with appropriate severity depending on the available context. | ||
|
|
||
| - Errors that don't indicate actual issues SHOULD be recorded with | ||
| severity not higher than `Info`. | ||
|
|
||
| Such errors can be used to control application logic and have a minor impact, if any, | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| on application functionality, availability, or performance (beyond the performance hit introduced | ||
| if an exception is used to control application logic). | ||
|
|
||
| Examples: | ||
|
|
||
| - An error is returned when checking for optional dependency or resource existence. | ||
| - An exception is thrown on the server when the client disconnects before reading | ||
| the full response from the server. | ||
|
|
||
| - Errors that are expected to be retried or handled by the caller or another | ||
| layer of the component SHOULD be recorded with severity not higher than `Warn`. | ||
|
|
||
| Such errors represent transient failures that are common and expected in | ||
| distributed applications. They typically increase the latency of individual | ||
| operations and have a minor impact on overall application availability. | ||
|
|
||
| Examples: | ||
|
|
||
| - An attempt to connect to the required remote dependency times out. | ||
| - A remote dependency returns a 401 "Unauthorized" response code. | ||
| - Writing data to a file results in an IO exception. | ||
| - A remote dependency returned a 503 "Service Unavailable" response for 5 consecutive times, | ||
| retry attempts are exhausted, and the corresponding operation has failed. | ||
|
|
||
| - Unhandled (by the application code) errors that don't result in application | ||
| shutdown SHOULD be recorded with severity `Error`. | ||
|
|
||
| These errors are not expected and may indicate a bug in the application logic | ||
| that this application instance was not able to recover from, or a gap in the error | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| handling logic. | ||
|
|
||
| Examples: | ||
|
|
||
| - A background job terminates with an exception. | ||
| - An HTTP framework error handler catches an exception thrown by the application code. | ||
|
|
||
| Note: Some frameworks use exceptions as a communication mechanism when a request fails. For example, | ||
| Spring users can throw a [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html) | ||
| exception to return an unsuccessful status code. Such exceptions represent errors already handled by the application code. | ||
| Application code, in this case, is expected to log this at the appropriate severity. | ||
| General-purpose instrumentation MAY record such errors, but at a severity not higher than `Warn`. | ||
|
|
||
| - Errors that result in application shutdown SHOULD be recorded with severity `Fatal`. | ||
|
|
||
| Examples: | ||
|
|
||
| - The application detects an invalid configuration at startup and shuts down. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - The application encounters a (presumably) terminal error, such as an out-of-memory condition. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 6. When recording exceptions/errors in logs, applications and instrumentations are encouraged to add additional attributes | ||
| to describe the context in which the exception/error occurred. | ||
| They are also encouraged to define their own events and enrich them with exception/error details. | ||
|
|
||
| 7. The OTel SDK SHOULD record exception stack traces on logs with severity `Error` or higher and drop | ||
| them on logs with lower severity. It SHOULD allow users to change the threshold. | ||
|
|
||
| See [logback exception config](https://logback.qos.ch/manual/layouts.html#ex) for an example of configuration that | ||
| records stack traces conditionally. | ||
|
|
||
| 8. Instrumentation libraries that record exceptions using span events SHOULD gracefully migrate | ||
| to log-based exceptions following the migration path outlined in the [Span Event API deprecation plan OTEP](./4430-span-event-api-deprecation-plan.md). | ||
|
|
||
| ## API changes | ||
|
|
||
| > [!NOTE] | ||
| > | ||
| > It should not be an instrumentation concern to decide whether an exception stack trace | ||
| > should be recorded or not. | ||
| > | ||
| > A natively instrumented library may write logs providing an exception instance | ||
| > through a log bridge and not be aware of this guidance. | ||
| > | ||
| > It also may be desirable for some vendors/apps to record all exception details at all levels. | ||
| The OTel Logs API SHOULD provide methods that enrich log records with exception details such as | ||
| `setException(exception)` and similar to the [RecordException](../specification/trace/api.md#record-exception) method on span. | ||
|
|
||
| The OTel SDK, based on the log severity and configuration, SHOULD record exception details fully or partially. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The signature of the method is to be determined by each language and can be overloaded | ||
| as appropriate. | ||
|
|
||
| It MUST be possible to efficiently set exception and error information on a log record based on configuration | ||
| and without using the `setException` method. | ||
|
|
||
| ## SDK changes | ||
|
|
||
| TODO: we should consider if exception instances should reach log processing pipeline | ||
| where their processing can be customized or we'd rather do it via a separate concept like exception | ||
| customizer. | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Logging errors from client library in a user application | ||
|
|
||
| ```java | ||
| StorageClient client = createClient(endpoint, credential); | ||
| ... | ||
| try { | ||
| BinaryData content = client.download(contentId); | ||
|
|
||
| return response(content, HttpStatus.OK); | ||
| } catch (ContentNotFoundException ex) { | ||
| // we don't record exception here, but may record a log record without exception info | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| logger.logRecordBuilder() | ||
| .addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId) | ||
| // let's assume it's expected that some content can disappear | ||
| .severityNumber(Severity.INFO) | ||
| // by default SDK will only populate `exception.type` and `exception.message` | ||
| // since severity is `INFO`, but it should not be an instrumentation library | ||
| // concern | ||
| .setException(ex) | ||
| .emit(); | ||
|
|
||
| return response(HttpStatus.NOT_FOUND); | ||
| } catch (ForbiddenException ex) { | ||
| logger.logRecordBuilder() | ||
| // let's assume it's really unexpected for this application - the service does not have access to the underlying storage. | ||
| .severityNumber(Severity.ERROR) | ||
| .addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId) | ||
| // by default SDK will record stack trace for this exception since the severity is ERROR | ||
| .setException(ex) | ||
| .emit(); | ||
|
|
||
| return response(HttpStatus.INTERNAL_SERVER_ERROR); | ||
| } | ||
| ``` | ||
|
|
||
| ### Logging errors inside the natively instrumented library | ||
|
|
||
| It's a common practice to record errors using logging libraries. Client libraries that are natively instrumented with OpenTelemetry should | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| leverage the OTel Events/Logs API for their exception logging purposes. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ```java | ||
| public class StorageClient { | ||
|
|
||
| private final Logger logger; | ||
| ... | ||
| public BinaryData getContent(String contentId) { | ||
| HttpResponse response = client.get(contentId); | ||
| if (response.statusCode() == 200) { | ||
| return readContent(response); | ||
| } | ||
|
|
||
| logger.logRecordBuilder() | ||
| // In general we don't know if it's an error - we expect the caller | ||
| // to handle it and decide. So this is a warning (at most). | ||
| // If the exception thrown below remains unhandled, it'd be logged by the global handler. | ||
| .setSeverity(Severity.WARN) | ||
| .addAttribute(AttributeKey.stringKey("com.example.content.id"), contentId) | ||
| .addAttribute(AttributeKey.stringKey("http.response.status_code"), response.statusCode()) | ||
| .setBody("Unexpected HTTP response") | ||
| .emit(); | ||
|
|
||
| if (response.statusCode() == 404) { | ||
| throw new ContentNotFoundException(readErrorInfo(response)); | ||
| } | ||
|
|
||
| ... | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Network-level errors are part of normal life; we should consider using low severity for them. | ||
|
|
||
| ```java | ||
| public class NetworkClient { | ||
|
|
||
| private final Logger logger; | ||
| ... | ||
| public long send(ByteBuffer content) { | ||
| try { | ||
| return socketChannel.write(content); | ||
| } catch (SocketException ex) { | ||
| logger.logRecordBuilder() | ||
| // we'll retry it, so it's info or lower. | ||
| // we'll write a warn for the overall operation if retries are exhausted. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| .setSeverity(Severity.INFO) | ||
| .addAttribute("connection.id", this.getId()) | ||
| .addException(ex) | ||
| .setBody("Failed to send content") | ||
| .emit(); | ||
|
|
||
| throw ex; | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### Logging errors in messaging processor | ||
|
|
||
| #### Natively instrumented library | ||
|
|
||
| In this example, application code provides the callback to the messaging processor to | ||
| execute for each message. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ```java | ||
| MessagingProcessorClient processorClient = new MessagingClientBuilder() | ||
| .endpoint(endpoint) | ||
| .queueName(queueName) | ||
| .processor() | ||
| .processMessage(messageContext -> processMessage(messageContext)) | ||
| .buildProcessorClient(); | ||
|
|
||
| processorClient.start(); | ||
| ``` | ||
|
|
||
| The `MessagingProcessorClient` implementation should catch exceptions thrown by the `processMessage` callback and log them similarly to | ||
| this example: | ||
|
|
||
| ```java | ||
| MessageContext context = retrieveNext(); | ||
| try { | ||
| processMessage.accept(context); | ||
| } catch (Throwable t) { | ||
| // This natively instrumented library may use the OTel log API or another logging library such as slf4j. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| // Here we use Error severity since this exception was not handled by the application code. | ||
| logger.atError() | ||
| .addKeyValuePair("messaging.message.id", context.getMessageId()) | ||
| ... | ||
| .setException(t) | ||
| .log("Message processing failed"); | ||
| // error handling logic ... | ||
| } | ||
| ``` | ||
|
|
||
| If this instrumentation supports tracing, it should capture the error in the scope of the processing | ||
| span. | ||
trask marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #### Instrumentation library | ||
|
|
||
| In this example, we leverage the Spring Kafka `RecordInterceptor` extensibility point that allows us to | ||
| listen to exceptions that remained unhandled. | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ```java | ||
| import org.springframework.kafka.listener.RecordInterceptor; | ||
| final class InstrumentedRecordInterceptor<K, V> implements RecordInterceptor<K, V> { | ||
| ... | ||
|
|
||
| @Override | ||
| public void failure(ConsumerRecord<K, V> record, Exception exception, Consumer<K, V> consumer) { | ||
| // we should capture this error in the scope of the processing span (or pass its context explicitly). | ||
lmolkova marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| logger.logRecordBuilder() | ||
| .setSeverity(Severity.ERROR) | ||
| .addAttribute("messaging.message.id", record.getId()) | ||
| ... | ||
| .addException(exception) | ||
| .setBody("Consumer error") | ||
| .emit(); | ||
| // .. | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| See the [corresponding Java (tracing) instrumentation](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/spring/spring-kafka-2.7/library/src/main/java/io/opentelemetry/instrumentation/spring/kafka/v2_7/InstrumentedRecordInterceptor.java) for details. | ||
|
|
||
| ## Prototypes | ||
|
|
||
| TODO (at least two prototypes: one in a language that does and other in the one that does not have exceptions). | ||
|
|
||
| ## Prior art and alternatives | ||
|
|
||
| Alternatives: | ||
|
|
||
| 1. Deduplicate exception info by marking exception instances as logged. | ||
| This can potentially mitigate the problem for existing applications when they log exceptions extensively. | ||
| We should still provide optimal guidance for greenfield applications and libraries, | ||
| covering the wider problem of recording errors. | ||
|
|
||
| ## Future possibilities | ||
|
|
||
| Exception stack traces can be recorded in structured form instead of their | ||
| string representation. It may be easier to process and consume them in this form. | ||
| This is out of scope for this OTEP. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.