-
Notifications
You must be signed in to change notification settings - Fork 421
Introduce structured, span-based observability through Logger interface #3766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 I see @tnull was un-assigned. |
d115f88
to
40a6022
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3766 +/- ##
==========================================
+ Coverage 88.94% 88.99% +0.05%
==========================================
Files 173 174 +1
Lines 123876 124808 +932
Branches 123876 124808 +932
==========================================
+ Hits 110177 111077 +900
+ Misses 11248 11247 -1
- Partials 2451 2484 +33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f7c48f7
to
bd59e5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, okay so there was def a miscommunication when we spoke about switching to an RAII span concept 😅. At a high level I wasn't anticipate we'd give spans for a total HTLC forwards - I was assuming we'd only do spans for basically function runtime in LDK - "how long did it take us to process events", "how long did it take us to process the pending forwards", etc. From there, we might give spans for things like monitor updates (including the parent idea, which is cool), so you could answer questions like "how much of the time spent processing pending forwards is spent in monitor updates".
By focusing on answering questions about specific function runtime the "parent" thing is super well-defined. But once we switch to non-functional spans, it becomes a bit murky - a monitor update might block several payment forwards, so should it have all of them as "parents"?
Of course using only functional blocks as spans makes it less able to answer the types of questions I think you specifically want, eg "how long is it taking us to forward most HTLCs", but as it lies its hard to take that and answer "why is it taking us long to forward some HTLCs".
We could also split the difference and have functional spans for LDK things, but also non-functional spans for outside-LDK things - ie the above spans plus a span for how long each async monitor update took to persist as well as how long it took to get a response from a peer once we send a message that needs a response. Sadly the latter is a bit annoying to build, commitment_signed -> revoke_and_ack latency is pretty easy to measure, though once we get the RAA we're still waiting on another commitment_signed in many cases. There may be a way to do it specific to an HTLC kinda like you have here but just around messages themselves.
WDYT?
cltv_expiry: u32, | ||
payment_hash: PaymentHash, | ||
state_wrapper: InboundHTLCStateWrapper, | ||
span: BoxedSpan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh man its a bit weird to start storing user objects in all our structs. I was kinda anticipating an RAII struct of our own that calls an end method on the Logger
, though thinking about it now its definitely obvious that letting the user define the struct is easier for them (I guess otherwise they'd have to keep a map of pending spans and drop the opentelemetry::Tracer
by hand?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly, this was to avoid reconciliation on the user side.
Haha no worries, I imagined it it wasn't the same but wanted to write this up as a starting point. I agree on the difference between function scope spans and higher-level state machine spans and how they answer different questions. Given the amount of logic that depends on messages, timers, and asynchronous persistence completions, I think we would need both complementing each other to have a complete understanding.
If we're building higher-level state machine spans on top of messages outside of LDK as in https://discord.com/channels/915026692102316113/978829624635195422/1296877062015029258, would it lead to rebuilding/duplicating something similar to the Lightning state machine to get the full picture? It seems like I'm also curious what we should optimize for. I'm assuming we would like to have maximal observability, while minimizing overhead. Is maintaining span objects in the Lightning state machine too invasive? Would there be a more minimal event API that can be added to Having |
Oops, was travelling last week.
I think this is somewhat the key point - we want to be able to answer questions like how long its taking to send/receive/forward an HTLC but also importantly why. That may imply we need a bit of both - functional spans for CPU-blocked blocks (like
This still wouldn't quite answer the "why it it taking so long to forward HTLCs", but maybe if we joined it with blame on wait-time inside a channel:
From there we could answer the above question by looking at how long individual state transitions took - from It's pretty gnarly to actually answer the why, here, sadly, though - you have to track the HTLC:channel relation and then track time in the channel "states" during the time the HTLC is in different states...I suppose we could break out the HTLC state transitions into channel sub-states like
but that would require calling Of course this is all somewhat orthogonal to whether to do things RAII - either we require the user do the mapping on their side or we store the RAII objects and forward them around. Personally I prefer to not store user objects in our state machine, but I admit I don't really have a strong argument for not doing so - there's not really much difference in terms of runtime switching, but it does seem maybe-useful for the user to get a |
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 3rd Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 4th Reminder Hey @tnull! This PR has been waiting for your review. |
fn log(&self, record: Record); | ||
/// Indicates the start of a span of computation. | ||
/// The returned object will be dropped when the span ends. | ||
fn start(&self, span: Span, parent: Option<&Self::UserSpan>) -> Self::UserSpan; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what you thought of my comment above? The simplified "parent" approach here isn't gonna allow us to answer the "why" question on an HTLC, and is kinda confusing (like, I'd naively expect a parent OutboundHTLC
span on a WaitingOnMonitorPersist
span because the outbound HTLC is waiting on a monitor persist for some part of its life).
At a minimum we need to provide enough info to see the mapping to "real" parents, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking a tree of spans would allow us to fully account for the entire lifetime duration of an HTLC, given enough spans. It would indeed not contain real/unique channel monitor persistence spans due to the HTLC:channel relation, it would be only from the perspective of each HTLC. Would that miss part of the "why"?
The tree-based approach was chosen to be compatible with tracing implementations and get out-of-the-box visualizations, but additional relationships could be represented as id attributes on the Span
variants I think. To your point about the mapping, perhaps HTLC traces and channel monitor persistence traces can be created independently, and correlated based on monitor update id attributes.
I'd naively expect a parent OutboundHTLC span on a WaitingOnMonitorPersist span because the outbound HTLC is waiting on a monitor persist for some part of its life
This is mostly the approach I took for the latest commits. Is the confusing part that each HTLC would have its own denormalized WaitingOnMonitorPersist
span?
Apologies for the delay here, a project at work came up that needed to be prioritized. I rebased the PR and added additional
Right, the HTLC:channel relation can't be represented simply, but you can indeed break out the spans from the viewpoint of an HTLC, which feels like the right approach to me. If an HTLC needs to wait for a remote revocation, an HTLC-specific span can be started at that point in time until it arrives, even though other HTLCs might have started waiting for that revocation beforehand. These spans would be specifically from the viewpoint of each HTLC, but I believe it could explain the why? When complete, the The spans from the viewpoint of each HTLC might be complemented by other metrics/spans, such as measuring the duration of |
This change allows users to create hierarchical span objects through the Logger interface for specific computations, such as the handling of HTLCs. These span objects will be held in LDK across the corresponding lifetimes before being dropped, providing insight in durations and latencies.
🔔 5th Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 6th Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 7th Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 8th Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 9th Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 10th Reminder Hey @tnull! This PR has been waiting for your review. |
🔔 11th Reminder Hey @tnull! This PR has been waiting for your review. |
We discussed this a bit at the meeting last Monday, and I suggested looking into a more event-based approach, which is simpler to track on the LDK end, and, I think, done right, doesn't require a lot more tracking on the client end to get the same results, but also might allow other types of results for users who want it. I'm happy to discuss further here if you prefer the more tracing approach we have here. |
This change allows users to create hierarchical span objects through the
Logger
interface for specific computations, such as the handling of HTLCs. These span objects will be held in LDK across the corresponding lifetimes before being dropped, providing insight in durations and latencies.This API is designed to be compatible with https://docs.rs/opentelemetry/latest/opentelemetry/trace/trait.Tracer.html, but can also be used more directly to derive time-based metrics or to log durations.
Hierarchical RAII spans are currently added for:
functional_tests.rs
).