feat: Track background blob event processing#3278
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a storage-node metric that exposes the high-water mark of background blob-event processing completion, helping diagnose epoch-boundary blocking when EpochChangeStart waits for background workers to drain.
Changes:
- Introduce a new metric (
event_highest_background_processed_index) to report the highest background-processed event index (or-1if none). - Track/update this high-water mark from background event workers via
PendingEventCounter. - Add a unit test validating monotonic tracking behavior for the new metric.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| crates/walrus-service/src/node/metrics.rs | Defines the new metric and exposes helper setters/resetters (plus test-only getter). |
| crates/walrus-service/src/node/blob_event_processor/pending_events.rs | Extends PendingEventCounter to track highest processed event index and adds a focused unit test. |
| crates/walrus-service/src/node/blob_event_processor.rs | Wires the counter into background workers and resets the metric on processor initialization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let event_index = event_index | ||
| .try_into() | ||
| .expect("event index should fit in i64"); |
There was a problem hiding this comment.
observe_processed_event converts event_index: u64 into i64 with expect(...), which can panic and take down the node if the event index ever exceeds i64::MAX (even if that’s unlikely). Metrics updates should be panic-free; consider either (a) switching the metric to a u64-backed gauge and representing the “unset” state separately, or (b) clamping/saturating the value at i64::MAX (and optionally logging a warn) instead of panicking.
| let event_index = event_index | |
| .try_into() | |
| .expect("event index should fit in i64"); | |
| let event_index = match i64::try_from(event_index) { | |
| Ok(event_index) => event_index, | |
| Err(_) => { | |
| tracing::warn!( | |
| event_index, | |
| max_supported_event_index = i64::MAX, | |
| "event index exceeds i64 range; clamping highest processed event index metric" | |
| ); | |
| i64::MAX | |
| } | |
| }; |
dfac675 to
f469f45
Compare
| #[derive(Debug, Clone)] | ||
| pub struct PendingEventCounter { | ||
| inner: Arc<AtomicU32>, | ||
| highest_processed_event_index: Arc<AtomicI64>, |
There was a problem hiding this comment.
It's unclear that this needs to be an I64, vs a U64, with a default value of 0. This might eliminate the need for the expect. It seems fine to be hand-wavy about the 0 case for metrics.
Description
At epoch boundaries,
EpochChangeStartwaits for background blob event processing to drain, but current metrics do not show how far those workers have progressed through event indexes. This metric gives a direct high water mark for worker-side processing so we can tell whether the boundary is still blocked on background event handling versus something later.Test plan
cargo test -p walrus-service tracks_highest_processed_event_indexRelease notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that
a user might notice and any actions they must take to implement updates. (Add release notes after the colon for each item)