Skip to content

Clean up some instances of mishandled error logging#9923

Merged
jgallagher merged 2 commits intooxidecomputer:mainfrom
sruggier:pr/error-logging-fixes
Feb 26, 2026
Merged

Clean up some instances of mishandled error logging#9923
jgallagher merged 2 commits intooxidecomputer:mainfrom
sruggier:pr/error-logging-fixes

Conversation

@sruggier
Copy link
Contributor

This cleans up handling of DdmError and ResolveError. It also updates any downstream code that was previously relying on the redundancy in DdmError's Display implementation, such that it instead relies on the implementation of slog::Value from the slog-error-chain crate.

Related to #9804 and #9803.

@sruggier
Copy link
Contributor Author

To clarify, I did this because I was looking for something to work on that would be useful, but not so urgent that I'd end up racing with someone else, and also feasible for me to tackle without any hand-holding. So far, it has felt like a good choice, but if I'm clashing with someone else's work, or this is a misguided thing to pursue for whatever other reason, please let me know. Otherwise, I plan to do a bit more similar work, and either update this PR or make another one, depending on whether anyone has reviewed it yet.

@sruggier
Copy link
Contributor Author

@jgallagher I don't have permission to request reviewers, and I also don't know who to ask, in general. I think you can help me with that, though, so I'm pinging you here. If I understood #9804 correctly, you'd be interested in merging this 🙂

@jgallagher
Copy link
Contributor

@jgallagher I don't have permission to request reviewers, and I also don't know who to ask, in general. I think you can help me with that, though, so I'm pinging you here. If I understood #9804 correctly, you'd be interested in merging this 🙂

I am, thanks! I agree that this is useful work that is pretty off-the-beaten-path. I do have a branch that updates a bunch of log sites within sled-agent that I haven't gotten polished enough to open a PR yet, but I don't think it overlaps with these changes very much.

@sruggier
Copy link
Contributor Author

sruggier commented Feb 25, 2026

I'm probably done for today, but if you're willing to push your work in progress to a branch (without making a PR) at some point in the near future, I can either steer clear of it or carry it forward, depending on what you think makes sense.

While I was auditing how EarlyNetworkSetupError is logged, I noticed that omicron_sled_agent::services::Error is a bit of a target-rich environment as far as this kind of cleanup is concerned, so I'm curious to know if I'd be conflicting with your changes if I start chipping away at some of the other errors it references.

@jgallagher
Copy link
Contributor

I was hitting mostly the logging sites and not the error definitions themselves, which is why I was guessing we wouldn't have too much overlap. My WIP branch is here: https://github.com/oxidecomputer/omicron/tree/john/log-more-errors-llm; it's probably pretty close to ready-to-PR shape, and I'll try to get that up some time soon.

@sruggier
Copy link
Contributor Author

Sure, but don't feel like you have to expedite that on my behalf. If I end up having to deal with some merge conflicts, it's not the end of the world.

Is there a particular reason not to derive SlogInlineError on the respective enums instead of using InlineErrorChain while logging? I like the idea of rendering the error with something other than Display, to make it impossible to compile code that will accidentally fail to log the entire chain.

Admittedly, for the places that are only calling to_string, there's no more convenient alternative to that now, but one could add a new method on InlineErrorChain, and call that instead.

@sruggier sruggier force-pushed the pr/error-logging-fixes branch from 03ceec1 to 627ef3b Compare February 25, 2026 23:48
As described in docs/error-types-and-logging.adoc, recursing into a
source's Display implementation can result in duplicated error text in
log output, so this commit updates the error attributes on DdmError's
variants to stop doing this.

Unfortunately, this type of change is breaking, in terms of how callers
must log an error, but the difference isn't visible to the type system,
and therefore won't result in compilation errors. Accordingly, this
commit also updates any callers that were previously relying on the old
Display implementation, so they walk the error chain instead.
Similar to the last commit, this updates the #[error(...)] attribute on
the DnsResolver variant on EarlyNetworkSetupError, to avoid recursing
into the Display implementation of ResolveError.

All callers appear to be handling EarlyNetworkSetupError correctly
already, so this commit doesn't make any other changes.
@sruggier sruggier force-pushed the pr/error-logging-fixes branch from 627ef3b to 889deab Compare February 26, 2026 14:45
@sruggier
Copy link
Contributor Author

sruggier commented Feb 26, 2026

I took a look at how the EarlyNetworkError variant is consumed, and ended up concluding that the full chain is dropped on this line. I was going to ask if that was intentional, but I see your branch fixes it, as of now, so I'm assuming it's safe to fix, and I'll look at cleaning up the other omicron_sled_agent::Error variants.

@jgallagher
Copy link
Contributor

Is there a particular reason not to derive SlogInlineError on the respective enums instead of using InlineErrorChain while logging? I like the idea of rendering the error with something other than Display, to make it impossible to compile code that will accidentally fail to log the entire chain.

No, not really. I was focusing on the logging sites because even if we derive SlogInlineError, if the logging sites still use some variant of err.to_string() or {err:?} we still won't get the full chain. But coming from the other direction (adding the derive and then updating the logging sites) is fine too.

Admittedly, for the places that are only calling to_string, there's no more convenient alternative to that now, but one could add a new method on InlineErrorChain, and call that instead.

I'm not sure what you mean here - could you expand on that?

@sruggier
Copy link
Contributor Author

Admittedly, for the places that are only calling to_string, there's no more convenient alternative to that now, but one could add a new method on InlineErrorChain, and call that instead.

I'm not sure what you mean here - could you expand on that?

Sure: it seems like the main problem here is that from a call site's perspective, to_string is overloaded, and we don't know whether the resulting string is going to contain the full error chain or not, because it depends on how the error type is implemented. The code is going to compile either way. Having a separate method to call would let us do something like this:

e => HttpError::for_internal_error(e.to_full_error_chain_string()),

or this:

e => HttpError::for_internal_error(e.display_full_error_chain().to_string())

and then that line would fail to compile if the underlying error type doesn't provide the method.

I just belatedly realized that deriving SlogInlineError doesn't implement Display, though, so this is a bit more of a logical leap from what the slog-error-chain provides right now.

@jgallagher
Copy link
Contributor

jgallagher commented Feb 26, 2026

The code is going to compile either way. Having a separate method to call would let us do something like this: ...

Ahh, got it. Yes, I think if we want to make more widespread use of the derive, having an explicit method like this that makes it obvious we're getting the full chain would be great.

I just belatedly realized that deriving SlogInlineError doesn't implement Display, though, so this is a bit more of a logical leap from what the slog-error-chain provides right now.

Yeah, Display is coming from implementing std::error::Error. Right now the derive only implements a handful of existing traits. Seems perfectly reasonable to me that we could add another trait that's exported in slog-error-chain that provides a to_full_error_chain_string() method, though, and make the derive also implement it.

@sruggier
Copy link
Contributor Author

sruggier commented Feb 26, 2026

I suppose the logical conclusion of that approach is a trait that gets automatically implemented on other types that implement std::error::Error, such that bringing it in scope makes it possible to call a separate method on any error:

use slog_inline_error::SomeTrait
...
e => HttpError::for_internal_error(e.to_full_error_chain_string())

That wouldn't be materially different from how InlineErrorChain works now, though, so I'm questioning whether the added convenience is worth it. I guess the answer is probably yes, but the improvement is more marginal than I first thought.

@sruggier
Copy link
Contributor Author

Looks like omicron-omdb::test_all_output test_omdb_success_cases failed because the output didn't match. I'm assuming that's unrelated to the diff here, and not something I can do anything about, but correct me if I'm wrong.

@jgallagher
Copy link
Contributor

You're correct; that's a known-but-annoying test flake (#9230).

@sruggier
Copy link
Contributor Author

As linked above, I ended up making a PR at oxidecomputer/slog-error-chain#20, whic uses a blanket implementation to provide all Errors with as_inline_error_chain and as_array_error_chain methods, making this possible:

e => HttpError::for_internal_error(e.as_inline_error_chain().to_string())

This style can be used with slog macros as well, so I think it's worth including. I don't consider the extra call to to_string to be too much of a burden, but it's always possible to add a second method, like to_inline_error_chain_string, if anyone has opinions about that.

@jgallagher jgallagher merged commit 5b2c074 into oxidecomputer:main Feb 26, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants