Clean up some instances of mishandled error logging by sruggier · Pull Request #9923 · oxidecomputer/omicron

sruggier · 2026-02-25T22:35:50Z

This cleans up handling of DdmError and ResolveError. It also updates any downstream code that was previously relying on the redundancy in DdmError's Display implementation, such that it instead relies on the implementation of slog::Value from the slog-error-chain crate.

Related to #9804 and #9803.

sruggier · 2026-02-25T22:44:37Z

To clarify, I did this because I was looking for something to work on that would be useful, but not so urgent that I'd end up racing with someone else, and also feasible for me to tackle without any hand-holding. So far, it has felt like a good choice, but if I'm clashing with someone else's work, or this is a misguided thing to pursue for whatever other reason, please let me know. Otherwise, I plan to do a bit more similar work, and either update this PR or make another one, depending on whether anyone has reviewed it yet.

sruggier · 2026-02-25T22:56:41Z

@jgallagher I don't have permission to request reviewers, and I also don't know who to ask, in general. I think you can help me with that, though, so I'm pinging you here. If I understood #9804 correctly, you'd be interested in merging this 🙂

jgallagher · 2026-02-25T23:01:34Z

@jgallagher I don't have permission to request reviewers, and I also don't know who to ask, in general. I think you can help me with that, though, so I'm pinging you here. If I understood #9804 correctly, you'd be interested in merging this 🙂

I am, thanks! I agree that this is useful work that is pretty off-the-beaten-path. I do have a branch that updates a bunch of log sites within sled-agent that I haven't gotten polished enough to open a PR yet, but I don't think it overlaps with these changes very much.

sled-agent/src/bootstrap/bootstore_setup.rs

sruggier · 2026-02-25T23:09:24Z

I'm probably done for today, but if you're willing to push your work in progress to a branch (without making a PR) at some point in the near future, I can either steer clear of it or carry it forward, depending on what you think makes sense.

While I was auditing how EarlyNetworkSetupError is logged, I noticed that omicron_sled_agent::services::Error is a bit of a target-rich environment as far as this kind of cleanup is concerned, so I'm curious to know if I'd be conflicting with your changes if I start chipping away at some of the other errors it references.

jgallagher · 2026-02-25T23:11:41Z

I was hitting mostly the logging sites and not the error definitions themselves, which is why I was guessing we wouldn't have too much overlap. My WIP branch is here: https://github.com/oxidecomputer/omicron/tree/john/log-more-errors-llm; it's probably pretty close to ready-to-PR shape, and I'll try to get that up some time soon.

sruggier · 2026-02-25T23:41:32Z

Sure, but don't feel like you have to expedite that on my behalf. If I end up having to deal with some merge conflicts, it's not the end of the world.

Is there a particular reason not to derive SlogInlineError on the respective enums instead of using InlineErrorChain while logging? I like the idea of rendering the error with something other than Display, to make it impossible to compile code that will accidentally fail to log the entire chain.

Admittedly, for the places that are only calling to_string, there's no more convenient alternative to that now, but one could add a new method on InlineErrorChain, and call that instead.

As described in docs/error-types-and-logging.adoc, recursing into a source's Display implementation can result in duplicated error text in log output, so this commit updates the error attributes on DdmError's variants to stop doing this. Unfortunately, this type of change is breaking, in terms of how callers must log an error, but the difference isn't visible to the type system, and therefore won't result in compilation errors. Accordingly, this commit also updates any callers that were previously relying on the old Display implementation, so they walk the error chain instead.

Similar to the last commit, this updates the #[error(...)] attribute on the DnsResolver variant on EarlyNetworkSetupError, to avoid recursing into the Display implementation of ResolveError. All callers appear to be handling EarlyNetworkSetupError correctly already, so this commit doesn't make any other changes.

sruggier · 2026-02-26T15:38:54Z

I took a look at how the EarlyNetworkError variant is consumed, and ended up concluding that the full chain is dropped on this line. I was going to ask if that was intentional, but I see your branch fixes it, as of now, so I'm assuming it's safe to fix, and I'll look at cleaning up the other omicron_sled_agent::Error variants.

jgallagher · 2026-02-26T16:10:22Z

Is there a particular reason not to derive SlogInlineError on the respective enums instead of using InlineErrorChain while logging? I like the idea of rendering the error with something other than Display, to make it impossible to compile code that will accidentally fail to log the entire chain.

No, not really. I was focusing on the logging sites because even if we derive SlogInlineError, if the logging sites still use some variant of err.to_string() or {err:?} we still won't get the full chain. But coming from the other direction (adding the derive and then updating the logging sites) is fine too.

Admittedly, for the places that are only calling to_string, there's no more convenient alternative to that now, but one could add a new method on InlineErrorChain, and call that instead.

I'm not sure what you mean here - could you expand on that?

sruggier · 2026-02-26T16:37:58Z

Admittedly, for the places that are only calling to_string, there's no more convenient alternative to that now, but one could add a new method on InlineErrorChain, and call that instead.

I'm not sure what you mean here - could you expand on that?

Sure: it seems like the main problem here is that from a call site's perspective, to_string is overloaded, and we don't know whether the resulting string is going to contain the full error chain or not, because it depends on how the error type is implemented. The code is going to compile either way. Having a separate method to call would let us do something like this:

e => HttpError::for_internal_error(e.to_full_error_chain_string()),

or this:

e => HttpError::for_internal_error(e.display_full_error_chain().to_string())

and then that line would fail to compile if the underlying error type doesn't provide the method.

I just belatedly realized that deriving SlogInlineError doesn't implement Display, though, so this is a bit more of a logical leap from what the slog-error-chain provides right now.

jgallagher · 2026-02-26T16:42:05Z

The code is going to compile either way. Having a separate method to call would let us do something like this: ...

Ahh, got it. Yes, I think if we want to make more widespread use of the derive, having an explicit method like this that makes it obvious we're getting the full chain would be great.

I just belatedly realized that deriving SlogInlineError doesn't implement Display, though, so this is a bit more of a logical leap from what the slog-error-chain provides right now.

Yeah, Display is coming from implementing std::error::Error. Right now the derive only implements a handful of existing traits. Seems perfectly reasonable to me that we could add another trait that's exported in slog-error-chain that provides a to_full_error_chain_string() method, though, and make the derive also implement it.

sruggier · 2026-02-26T17:26:38Z

I suppose the logical conclusion of that approach is a trait that gets automatically implemented on other types that implement std::error::Error, such that bringing it in scope makes it possible to call a separate method on any error:

use slog_inline_error::SomeTrait
...
e => HttpError::for_internal_error(e.to_full_error_chain_string())

That wouldn't be materially different from how InlineErrorChain works now, though, so I'm questioning whether the added convenience is worth it. I guess the answer is probably yes, but the improvement is more marginal than I first thought.

sruggier · 2026-02-26T19:00:24Z

Looks like omicron-omdb::test_all_output test_omdb_success_cases failed because the output didn't match. I'm assuming that's unrelated to the diff here, and not something I can do anything about, but correct me if I'm wrong.

jgallagher · 2026-02-26T19:11:33Z

You're correct; that's a known-but-annoying test flake (#9230).

sruggier · 2026-02-26T22:28:19Z

As linked above, I ended up making a PR at oxidecomputer/slog-error-chain#20, whic uses a blanket implementation to provide all Errors with as_inline_error_chain and as_array_error_chain methods, making this possible:

e => HttpError::for_internal_error(e.as_inline_error_chain().to_string())

This style can be used with slog macros as well, so I think it's worth including. I don't consider the extra call to to_string to be too much of a burden, but it's always possible to add a second method, like to_inline_error_chain_string, if anyone has opinions about that.

jgallagher reviewed Feb 25, 2026

View reviewed changes

sled-agent/src/bootstrap/bootstore_setup.rs Outdated Show resolved Hide resolved

sruggier force-pushed the pr/error-logging-fixes branch from 03ceec1 to 627ef3b Compare February 25, 2026 23:48

sruggier added 2 commits February 26, 2026 09:44

sruggier force-pushed the pr/error-logging-fixes branch from 627ef3b to 889deab Compare February 26, 2026 14:45

sruggier mentioned this pull request Feb 26, 2026

Use blanket impls to add convenience methods to all errors oxidecomputer/slog-error-chain#20

Open

jgallagher merged commit 5b2c074 into oxidecomputer:main Feb 26, 2026
17 checks passed

Conversation

sruggier commented Feb 25, 2026

Uh oh!

sruggier commented Feb 25, 2026

Uh oh!

sruggier commented Feb 25, 2026

Uh oh!

jgallagher commented Feb 25, 2026

Uh oh!

Uh oh!

sruggier commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgallagher commented Feb 25, 2026

Uh oh!

sruggier commented Feb 25, 2026

Uh oh!

sruggier commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgallagher commented Feb 26, 2026

Uh oh!

sruggier commented Feb 26, 2026

Uh oh!

jgallagher commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sruggier commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sruggier commented Feb 26, 2026

Uh oh!

jgallagher commented Feb 26, 2026

Uh oh!

sruggier commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sruggier commented Feb 25, 2026 •

edited

Loading

sruggier commented Feb 26, 2026 •

edited

Loading

jgallagher commented Feb 26, 2026 •

edited

Loading

sruggier commented Feb 26, 2026 •

edited

Loading