Skip to content

Commit f152a0a

Browse files
committed
Update ADR format to be more prescriptive
1 parent f2a9bf8 commit f152a0a

File tree

2 files changed

+66
-69
lines changed

2 files changed

+66
-69
lines changed

docs/adr/001_error_handling.md

Lines changed: 65 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -1,107 +1,104 @@
11
# Error handling patterns in public API interfaces
2-
32
## Date
4-
17 Feb 2025
5-
6-
## Accepted Option
3+
27 Feb 2025
74

8-
**Option 3**
5+
## Summary
96

10-
## Context and Problem Statement
11-
There is uncertainty around how to model errors in the `opentelemetry-rust` public API interfaces - that is, APIs that are exposed to users of the project's published crates. This is for example the case with the exporter traits - [SpanExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/trace/export.rs#L18), [LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/logs/export.rs#L115), and [PushMetricExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/metrics/exporter.rs#L11) which form part of the API surface of `opentelemetry-sdk`.
7+
This ADR describes the general pattern we will follow when modelling errors in public API interfaces - that is, APIs that are exposed to users of the project's published crates. . It summarises the discussion and final option from [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571); for more context check out that issue.
128

139
We will focus on the exporter traits in this example, but the outcome should be applied to _all_ public traits and their fallible operations.
1410

15-
There are various ways to handle errors on trait methods, including swallowing them and logging, panicing, returning a shared global error, or returning a method-specific error. We strive for consistency, and we want to be sure that we've put enough thought into what this looks like that we don't have to make breaking interface changes unecessarily in the future.
16-
17-
This was discussed extensively in [#2571](https://github.com/open-telemetry/opentelemetry-rust/issues/2571).
18-
19-
20-
## Related Work
21-
22-
* [#2564](https://github.com/open-telemetry/opentelemetry-rust/issues/2564)
23-
* [#2561](https://github.com/open-telemetry/opentelemetry-rust/issues/2561)
24-
* [#2381](https://github.com/open-telemetry/opentelemetry-rust/issues/2381)
11+
These include [SpanExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/trace/export.rs#L18), [LogExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/logs/export.rs#L115), and [PushMetricExporter](https://github.com/open-telemetry/opentelemetry-rust/blob/eca1ce87084c39667061281e662d5edb9a002882/opentelemetry-sdk/src/metrics/exporter.rs#L11) which form part of the API surface of `opentelemetry-sdk`.
2512

26-
## Considered Options
13+
There are various ways to handle errors on trait methods, including swallowing them and logging, panicing, returning a shared global error, or returning a method-specific error. We strive for consistency, and we want to be sure that we've put enough thought into what this looks like that we don't have to make breaking interface changes unecessarily in the future.
2714

28-
**Option 1: Continue as is**
29-
Continue the status quo, returning a mix of either nothing or the trait-wide error type. This is inconsistent and limits the caller's ability to handle errors.
15+
## Design Guidance
3016

31-
**Option 2: Extend trait-wide error type to all methods on trait**
32-
In this option we have an error type per trait regardless of the potential error paths for the individual methods. Concretely if `fn (a)` can return `Failure1` and `Failure2`, and `fn (b)` can return `FailureC`, we have a unified error type that contains `Failure`, `Failure2`, and `Failure3`.
17+
### 1. Panics are only appropriate at startup
18+
Failures _after_ startup should not panic, instead returning errors to the caller where appropriate, _or_ logging an error if not appropriate.
19+
Some of the opentelemetry SDK interfaces are dictated by the specification in way such that they may not return errors.
3320

34-
This will mean that callers will have to know how and when to discard errors from a particular API call based on an understanding of which subset of errors that particular call can make.
21+
### 2. Consolidate error types within a trait where we can, let them diverge when we can't**
3522

36-
Conversely, it will reduce the number of error types in the code base.
23+
We aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_.
24+
Here's an example:
3725

38-
**Option 3: Use shared errors where practical across signals, devolve into individual errors per operation if they need to diverge**
26+
```rust
27+
enum ErrorOne {
28+
TooBig,
29+
TooSmall,
30+
}
3931

40-
Here we aim to consolidate error types where possible _without indicating a function may return more errors than it can actually return_. Conversely in **Option 2**, a caller of either of the example functions is forced to handle or discard all errors. In this case, we choose to sacrifice the single error and diverge into a separate error for each operation.
32+
enum ErrorTwo {
33+
TooLong,
34+
TooShort
35+
}
4136

42-
Our preference for error types is thus:
37+
trait MyTrait {
38+
fn action_one() -> Result<(), ErrorOne>;
4339

44-
1. Consolidated error that covers all methods of a particular "trait type" (e.g., signal export) and method
45-
1. Devolves into error type per method of a particular trait type (e.g., `SdkShutdownResult`, `SdkExportResult`) _if the error types need to diverge_
46-
1. May alternatively devolve into error type per signal (e.g., `SpanExporter`) if the _signals diverge_
40+
// Action two and three share the same error type.
41+
// We do not introduce a common error MyTraitError for all operations, as this would
42+
// force all methods on the trait to indicate they return errors they do not return,
43+
// complicating things for the caller.
44+
fn action_two() -> Result<(), ErrorTwo>;
45+
fn action_three() -> Result<(), ErrorTwo>;
46+
}
47+
```
48+
## 3. Consolidate error types between signals where we can, let them diverge where we can't
4749

48-
This approach generalises across both **signals** and **trait methods**. For example, returning to our exporter traits, we have a trait that looks the same for each signal, with the same three methods. Upon closer inspection ([#2600](https://github.com/open-telemetry/opentelemetry-rust/issues/2600)), the potential error set is the same both between the methods *and* between the signals; this means we can use a single shared error type across both axes:
50+
Consider the `Exporter`s mentioned earlier. Each of them has the same failure indicators - as dicated by the OpenTelemetry spec - and we will
51+
share the error types accordingly:
4952

5053
```rust
5154

52-
#[derive(Error, Debug)]
55+
/// opentelemetry-sdk::error
5356

54-
// Errors that can occur during SDK operations export(), force_flush() and shutdown().
57+
#[derive(Error, Debug)]
5558
pub enum OTelSdkError {
56-
57-
/// All error types in here may be returned by any of the operations
58-
/// of the exporters, on any of their APIs.
59-
/// If this were not the case, we would not be able to use a shared error.
6059
#[error("Shutdown already invoked")]
6160
AlreadyShutdown,
62-
63-
// ... Other errors ...
64-
65-
/// The error message is intended for logging purposes only and should not
66-
/// be used to make programmatic decisions. It is implementation-specific
67-
/// and subject to change without notice. Consumers of this error should not
68-
/// rely on its content beyond logging.
61+
6962
#[error("Operation failed: {0}")]
7063
InternalFailure(String),
64+
65+
/** ... additional errors ... **/
7166
}
7267

7368
pub type OTelSdkResult = Result<(), OTelSdkError>;
74-
```
7569

76-
... which the traits themselves use:
70+
/// signal-specific exporter traits all share the same
71+
/// result types for the exporter operations.
7772

78-
```rust
73+
// pub trait LogExporter {
74+
// pub trait SpanExporter {
75+
pub trait PushMetricExporter {
7976

80-
//
81-
// The actionable errors returned by the exporter traits are effectively
82-
// the same for all operations; we can use a single shared error.
83-
//
77+
fn export(&self, /* ... */) -> OtelSdkResult;
78+
fn force_flush(&self, /* ... */ ) -> OTelSdkResult;
79+
fn shutdown(&self, /* ... */ ) -> OTelSdkResult;
80+
```
8481

85-
use opentelemetry_sdk::error::OTelSdkResult;
82+
If this were _not_ the case - if we needed to mark an extra error for instance for `LogExporter` that the caller could reasonably handle -
83+
we would let that error traits diverge at that point.
8684

87-
pub trait LogExporter {
88-
fn export(...) -> OtelSdkResult;
89-
fn shutdown(...) -> OtelSdkResult;
90-
fn force_flush(...) -> OTelSdkResult;
91-
}
85+
### 4. Box custom errors where a savvy caller may be able to handle them, stringify them if not
9286

93-
// ...
87+
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates that the error types _are not actionable_ by the caller.
9488

95-
pub trait SpanExporter {
96-
fn export(...) -> OtelSdkResult;
97-
fn shutdown(...) -> OtelSdkResult;
98-
fn force_flush(...) -> OTelSdkResult;
99-
}
89+
If the caller may potentially recover from an error, we will follow the generally-accepted best practice (e.g., see [canonical's guide](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error:
10090

91+
```rust
92+
#[derive(Debug, Error)]
93+
pub enum MyError {
94+
#[error("Error one occurred")]
95+
ErrorOne,
96+
97+
#[error("Operation failed: {source}")]
98+
OtherError {
99+
#[from]
100+
source: Box<dyn Error + Send + Sync>,
101+
},
102+
}
101103
```
102104

103-
### When to box custom errors
104-
105-
Note above that we do not box anything into `InternalFailure`. Our rule here is that if the caller cannot reasonably be expected to handle a particular error variant, we will use a simplified interface that returns only a descriptive string. In the concrete example we are using with the exporters, we have a [strong signal in the opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/sdk.md#export) that indicates concretely that the error types are not actionable by the caller.
106-
107-
If the caller may potentially recover from an error, we will follow [canonical's rust best practices](https://canonical.github.io/rust-best-practices/error-and-panic-discipline.html) and instead preserve the nested error.

docs/adr/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Architectural Decision Records
22

3-
This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions. You can read more about ADRs [here](https://adr.github.io/).
3+
This directory contains architectural decision records made for the opentelemetry-rust project. These allow us to consolidate discussion, options, and outcomes, around key architectural decisions.
44

55
* [001 - Error Handling](001_error_handling.md)

0 commit comments

Comments
 (0)