chore(buffers): Rework gauge metric handling#23561
Conversation
Note that the tests on the state of the gauges have been dropped as they now would only test the proper operation of the metrics recorder.
Corresponds to the existing `buffer_discarded_events_total` counter and `buffer_byte_size` gauge.
The "safe" conversion from u64 values into f64 clamped integers that cannot be exactly represented to the maximum safe integer value. These converted values were always used to `set` a gauge value. When looking at that gauge, however, retaining the magnitude is more important than keeping it exactly representable as an integer, so the safe conversion is actually lossy in the wrong direction.
There was a problem hiding this comment.
Pull Request Overview
This pull request refactors the gauge metric handling in the buffer system to eliminate the need for a shared BUFFER_COUNTERS DashMap, improving reliability and adding better labeling. The changes shift from delta-based gauge updates to absolute value tracking using a current metric state.
Key changes:
- Removes the shared
BUFFER_COUNTERSDashMap and replaces it with absolute value tracking - Adds
buffer_idlabels to buffer gauge metrics for better disambiguation - Implements thread-safe counter updates with overflow protection using saturating arithmetic
- Introduces a
currentstate tracker inBufferUsageDatato maintain accurate gauge values
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
lib/vector-buffers/src/lib.rs |
Removes the now-unused cast_utils module reference |
lib/vector-buffers/src/internal_events.rs |
Refactors events to include buffer_id labels and use absolute gauge values instead of delta updates |
lib/vector-buffers/src/cast_utils.rs |
Removes the entire file containing safe casting utilities that are no longer needed |
lib/vector-buffers/src/buffer_usage_data.rs |
Adds current state tracking and thread-safe counter updates with overflow protection |
graphcareful
left a comment
There was a problem hiding this comment.
Just want to leave a note, this PR was super easy to review when looking at it commit at a time.
|
LGTM, I closed my PR, this solution looks better. |
@vparfonov perhaps you can help us verify with your setup that this PR yields the same metrics as |
pront
left a comment
There was a problem hiding this comment.
LGTM. It would also be nice to get validation per #23561 (comment)
Sure, I will test tomorrow morning and back here with results |
|
@vparfonov this testing you are able to do, would there be any way that could be worked into a unit test? It would be good to add that as evidence things are working as intended. |
👍 thanks
we can do it as a follow-up if possible. |
Summary
This is a step-by-step refactoring of the changes made in #23453 and #23507 in order to remove the need for the shared
BUFFER_COUNTERSdashmap. While the whole change is rather large, I attempted to keep each commit from altering externally-visible behavior, with the exception of additional labels in the buffer create gauges. As such, reviewing each individual commit is likely more useful than reviewing the whole code change.If you compare this PR to the code base starting immediately before the two PRs referenced above (have to manually scroll down to the buffer source files, the direct GitHub link doesn't quite work), the improvements are evident:
buffer_idlabel to disambiguate the metrics with the same ID but different stages. Note that the buffers are created in the context of a component and, as such, should (may?) already have the component ID in their tags. If not, this is a real bug, maybe even the real bug. Since thebuffer_idcorresponds to a component ID, maybe that particular tag name is less than ideal.currentvalues set, allowing us tosetthe gauges instead ofincrementingthem, resulting in more reliable values there.saturating_add, which prevents them from overflowing, and aclamp, which prevents them from going negative.Note: This is a rework of #23542 to fix some bugs I introduced in my line of changes, so see that PR for more context.
Vector configuration
N/A
How did you test this PR?
Tested using the unit tests, all of the above should be behavior-preserving.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.cargo fmt --allcargo clippy --workspace --all-targets -- -D warningscargo nextest run --workspace(alternatively, you can runcargo test --all)git merge origin masterandgit push.Cargo.lock), pleaserun
cargo vdev build licensesto regenerate the license inventory and commit the changes (if any). More details here.