Skip to content

Commit 259eaac

Browse files
committed
docs: Add ADR dir and error handling ADR
1 parent cd9bbcf commit 259eaac

File tree

2 files changed

+102
-0
lines changed

2 files changed

+102
-0
lines changed

docs/adr/001_error_handling.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Error handling patterns in public API interfaces
2+
3+
4+
## Context and Problem Statement
5+
There is uncertainty around how to model errors in in the `opentelemetry-rust` public API interfaces - that is, APIs facing the consumers. At the time of writing this is an important issue to resolve as moving beginning to move the signals towards RC and eventually a stable release is an urgent priority.
6+
7+
The situation is as follows; a concrete example is given, but the issue holds across various public traits, in particular the exporters:
8+
9+
* A given public interface in `opentelemetry-sdk`,such as [trait LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L115)
10+
* ... exposes multiple discrete actions with logically disjoint error types (e.g. [export](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L133-L136) and [shutdown](https://github.com/open-telemetry/opentelemetry-rust/blob/3ec4c186ad22944b208ae7c3d38435e735a85c8e/opentelemetry-sdk/src/logs/export.rs#L139) - that is, the class of errors returned for each of these actions are foreseeably very different, as is the callers reaction to them
11+
* ... is implemented by multiple concrete types such as `InMemoryLogExporter`, `OtlpLogExporter`, `StdOutLogExporter` that have different error requirements - for instance, an `OtlpLogExporter` will experience network failures, an `InMemoryLogExporter` will not
12+
* Potentially has operations on the API that, either in the direct implementation, or in a derived utility that utilises the direct implementation, call _multiple_ API actions and therefore need to return an aggregated log type
13+
14+
Today, we have a situation where a single error type is used per API-trait, and some methods simply swallow their errors. In the example above of `LogExporter`, `shutdown` swallows errors, and `export` returns the `LogError` type, a type that could conceptually be thought of as belonging to the entire trait, not a particular method. For the exporters, the [opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) tells us that they need to indicate success or failure, with a distinction made between 'failed' and 'timed out'.
15+
16+
There are also similar examples in the builders for providers and exports.
17+
18+
## Related Work
19+
20+
* #2564
21+
* #2561
22+
* #2381
23+
24+
## Considered Options
25+
26+
**Option 1: Continue as is**
27+
Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors.
28+
29+
**Option 2: Extend trait-wide error type to all methods on trait**
30+
In this option we keep the existing error type, add it to the remaining methods on the trait, and extend the error type to include errors covering the new error conditions. This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make.
31+
32+
Conversely, it will reduce the number of error types in the code base.
33+
34+
**Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type**
35+
36+
For example, in the above we'd have something like:
37+
```rust
38+
pub trait LogExporter {
39+
40+
fn export(...) -> Result<..., ExportError>;
41+
fn shutdown(...) -> Result<..., ShutdownError>
42+
}
43+
44+
// Concrete errors for an export operation
45+
pub enum ExportError {
46+
// The distinction between failure and timed out is part of the OTEL spec
47+
// we need to meet.
48+
49+
ExportFailed,
50+
51+
ExportTimedOut(Duration),
52+
53+
// Allow impls to box up errors that can't be logically mapped
54+
// back to one of the APIs errors
55+
#[error("Unknown error (should not occur): {source:?}")]
56+
Unknown {
57+
source: Box<dyn std::error::Error + Send + Sync>,
58+
},
59+
}
60+
61+
// Aggregate error type for convenience
62+
// **Note**: This will be added in response to need, not pre-emptively
63+
#[derive(Debug, thiserror::Error)]
64+
pub enum LogError {
65+
#[error("Export error: {0}")]
66+
InitError(#[from] ExportError),
67+
68+
#[error("Shutdown error: {0}")]
69+
ShutdownError(#[from] ShutdownError),
70+
}
71+
72+
// A downcast helper for callers that need to work with impl-specific
73+
// unknown errors concretely
74+
impl ExportError {
75+
/// Attempt to downcast the inner `source` error to a specific type `T`
76+
pub fn downcast_ref<T: std::error::Error + 'static>(&self) -> Option<&T> {
77+
if let ExportError::Unknown { source } = self {
78+
source.downcast_ref::<T>()
79+
} else {
80+
None
81+
}
82+
}
83+
}
84+
```
85+
86+
## Decision Outcome
87+
88+
Chosen option: **"Option 3: Introduce an error-type per fallible operation, aggregate these into a single trait-wide error type"**
89+
90+
### Consequences
91+
92+
* Good, because callers can handle focussed errors with focussed remediation
93+
* Good, because implementors of the `pub trait`s can box up custom errors in a fashion that follow's [canonical's error and panic discipline](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) guide, by avoiding type erasure of impl-specific errors
94+
* Good, because the per-trait error type (`LogError` for `LogExporter` above) provides consumers of the trait that hit multiple methods in a single method an error type they can use
95+
* Bad, because there's more code than a single error type
96+
* Bad, because a caller may need to use `downcast_ref` if they have a known trait impl and want to handle a `Unknown` error
97+

docs/adr/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Architectural Decision Records
2+
3+
This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions.
4+
5+
* [001 - Error Handling](001_error_handling.md)

0 commit comments

Comments
 (0)