chore(buffers): Rework counter/gauge increment handling#23542
chore(buffers): Rework counter/gauge increment handling#23542
Conversation
The counters that come in from the internal events can never contain negative values, so handling them is completely unnecessary.
Note that the tests on the state of the gauges have been dropped as they now would only test the proper operation of the metrics recorder.
This ends up dropping all the new tests, as they just worked on the `BUFFER_COUNTERS` map which was able to be removed with this change.
The values in the `total_*` counters are manipulated identically to the other fields, so we can drop them now.
pront
left a comment
There was a problem hiding this comment.
The gauges have their values set instead of incremented from the same values. This is the part that really doesn't make sense to me, but is an outflow of all the refactoring.
This is concerning but also it an existing issue.
@bruceg |
The main reason to use |
Are you saying that we are seeing values that are wrapping around an |
If I traced the logic right, the upstream values for those increments are always positive as they are communicated in unsigned integers. Did I mix up a subtraction? |
"You mentioned this doesn't deal with negatives, but I point to |
Maybe my previous note was a bit unclear. The change from increment() to set() isn't about a single operation going negative. It's about making the metric more robust. The old increment-based approach is fragile because it assumes we'll never miss an event and get them in correct time. With set(), we're reporting the actual, current state. This makes the metric self-healing and immune to missed events or race conditions. |
|
I think I found a small but critical issue. The gauge is not tracking the total buffer's size, it's only tracking the size of the most recent batch of events. For reference, here is the approach from my PR: https://github.com/vectordotdev/vector/pull/23518/files#diff-69337e368c06496cca89f39bdf229fb2ebe3c0413c40f3fc0cd9d19d4fdea177R53 |
|
I do appreciate the discussion here. Let's allow enough time to get this right. While we are busy preparing the next release, I have one time sensitive question. Are any of the findings here indicate that we have a gap on |
I believe we can go ahead, but @bruceg voice will be more valuable |
Granted, yes. However, the values being added are always converted from unsigned and so start as positive, so the
Agree 100% here. My concern is that it appears, when all the refactoring is done, that they are
I think, given the above, there has been a regression merged albeit one that is not tested. It would be great to set up some test cases that assert the correct behavior, though, to be able to say for sure. The test above is a good start but unfortunately does not validate that the buffer mechanism itself adjusts the gauges to the correct values. |
|
FYI: To mitigate risk, we’ll revert the merged changes in #23556. |
What regression do you mean? The solution that was merged into |
Do you have a configuration or other test case that can reproduce the original problem of negative metrics reported? I understand that may be difficult if it is a race that produces them, but within the buffers crate we can control the two halves to make that less of an issue. |
|
After re-examining all three patch sets, I noticed that I made a mistake in this step by dropping several negations. That skewed all the following refactoring steps and resulted in a broken result, which makes sense of my confusion at the start. I will have to rework this to resolve that. I still believe there is a much simpler solution hiding in the resulting code, but I can see that the current code is more correct in two aspects than the original gauges:
|
|
I will open a new PR when I have finished fixing the refactoring. I have now discovered two separate issues, the second of which is harder to compensate for. |
Summary
This is a step-by-step refactoring of the changes made in #23453 and #23507 in order to remove the need for the shared
BUFFER_COUNTERSdashmap. While the whole change is rather large, I attempted to keep each commit from altering externally-visible behavior, with the exception of additional labels in the buffer create gauges. As such, reviewing each individual commit is likely more useful than reviewing the whole code change.However, this leaves a bit of a conundrum: If we look at the changes to
lib/vector-buffers/srcstarting immediately before the above changes (have to manually scroll down to the buffer source files, the direct GitHub link doesn't quite work), the result doesn't entirely make sense:buffer_idadded to disambiguate metrics with the same ID but different stages. I thought, however, that the buffers would have been created in the context of a component and, as such, would already have the component ID in their tags. If not, this is a real bug, maybe even the real bug. Since the buffer_id corresponds to a component ID, maybe that particular tag name is less than ideal.Vector configuration
N/A
How did you test this PR?
Tested using the unit tests, all of the above should be behavior-preserving.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.cargo fmt --allcargo clippy --workspace --all-targets -- -D warningscargo nextest run --workspace(alternatively, you can runcargo test --all)git merge origin masterandgit push.Cargo.lock), pleaserun
cargo vdev build licensesto regenerate the license inventory and commit the changes (if any). More details here.