-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
A note for the community
- Please vote on this issue by adding a π reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
The following panic stack trace occurred while running the OP worker with Vector hash dbc805a. I cannot reproduce it consistently but it appears to be caused by the processing of a histogram metric in the datadog metrics sink.
thread 'observability-pipelines-worker' (9158939) panicked at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/value.rs:349:21:
attempt to subtract with overflow
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
1: core::panicking::panic_fmt
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
2: core::panicking::panic_const::panic_const_sub_overflow
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:175:17
3: vector_core::event::metric::value::MetricValue::subtract
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/value.rs:349:21
4: vector_core::event::metric::data::MetricData::subtract
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/data.rs:148:20
5: vector_core::event::metric::Metric::subtract
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/mod.rs:388:19
6: vector::sinks::util::buffer::metrics::normalize::MetricSet::absolute_to_incremental
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/buffer/metrics/normalize.rs:633:27
7: vector::sinks::util::buffer::metrics::normalize::MetricSet::make_incremental
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/buffer/metrics/normalize.rs:574:42
8: <vector::sinks::datadog::metrics::normalizer::DatadogMetricsNormalizer as vector::sinks::util::buffer::metrics::normalize::MetricNormalize>::normalize
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/datadog/metrics/normalizer.rs:27:18
9: vector::sinks::util::buffer::metrics::normalize::MetricNormalizer<N>::normalize
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/buffer/metrics/normalize.rs:173:25
10: <vector::sinks::util::normalizer::Normalizer<St,N> as futures_core::stream::Stream>::poll_next
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/normalizer.rs:63:63
11: <futures_util::stream::stream::fuse::Fuse<S> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/fuse.rs:53:39
12: <vector_stream::partitioned_batcher::PartitionedBatcher<St,Prt,KT,C,F,B> as futures_core::stream::Stream>::poll_next
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-stream/src/partitioned_batcher.rs:273:40
13: <futures_util::stream::stream::fuse::Fuse<S> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/fuse.rs:53:39
14: <vector_stream::concurrent_map::ConcurrentMap<St,T> as futures_core::stream::Stream>::poll_next
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-stream/src/concurrent_map.rs:69:44
15: <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/map.rs:58:47
16: <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/map.rs:58:47
17: <futures_util::stream::stream::flatten::Flatten<St,<St as futures_core::stream::Stream>::Item> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/flatten.rs:55:65
18: <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/lib.rs:97:35
19: <futures_util::stream::stream::filter_map::FilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/filter_map.rs:79:68
20: <futures_util::stream::stream::fuse::Fuse<S> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/fuse.rs:53:39
21: <futures_util::stream::stream::ready_chunks::ReadyChunks<St> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/ready_chunks.rs:40:40
22: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-core-0.3.31/src/stream.rs:130:33
23: futures_util::stream::stream::StreamExt::poll_next_unpin
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/mod.rs:1638:24
24: <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/next.rs:32:21
25: vector_stream::driver::Driver<St,Svc>::run::{{closure}}::{{closure}}
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/macros/select.rs:707:49
26: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/future/poll_fn.rs:151:9
27: vector_stream::driver::Driver<St,Svc>::run::{{closure}}
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-stream/src/driver.rs:123:13
28: vector::sinks::datadog::metrics::sink::DatadogMetricsSink<S>::run_inner::{{closure}}
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/datadog/metrics/sink.rs:148:14
29: <vector::sinks::datadog::metrics::sink::DatadogMetricsSink<S> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/datadog/metrics/sink.rs:164:31
30: <core::pin::Pin<P> as core::future::future::Future>::poll
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/future/future.rs:133:9
31: <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/sink.rs:181:30
32: <core::pin::Pin<P> as core::future::future::Future>::poll
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/future/future.rs:133:9
33: vector_core::sink::VectorSink::run::{{closure}}
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/sink.rs:24:55
34: vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/topology/builder.rs:670:18
35: <vector::topology::task::Task as core::future::future::Future>::poll
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/topology/task.rs:92:29
36: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:299:9
37: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:44
38: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:274:9
39: std::panicking::catch_unwind::do_call
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:590:40
40: ___rust_try
41: std::panicking::catch_unwind
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:553:19
42: std::panic::catch_unwind
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
43: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:9
44: vector::topology::handle_errors::{{closure}}
at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/topology/mod.rs:67:10
45: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tracing-0.1.44/src/instrument.rs:321:15
46: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/runtime/task/core.rs:365:24
47: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/loom/std/unsafe_cell.rs:16:9
48: tokio::runtime::task::core::Core<T,S>::poll
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/runtime/task/core.rs:354:30
49: tokio::runtime::task::harness::poll_future::{{closure}}
at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/runtime/task/harness.rs:535:30
50: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:274:9
51: std::panicking::catch_unwind::do_call
at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:590:40
... (truncated)
I noticed that this part of Vector was recently modified by #23374 and tried investigating this with Claude which produced the following analysis:
How This Happens
This can occur when:
Histogram bucket redistribution: A metric's value can shift between buckets between scrapes (e.g., if latency decreases, values move to lower buckets)
Counter resets: If the source restarted, individual bucket counts may be lower than the previous scrape
Non-monotonic sources: Some histogram sources don't guarantee monotonic bucket counts
What the panic is:
The panic occurs in MetricValue::subtract at line 349:
b1.count -= b2.count; // attempt to subtract with overflowThis happens when an AggregatedHistogram's individual bucket count in the new metric is less than the corresponding bucket count in the stored reference metric.
Why PR #23374 exposes this bug:
The key change in PR #23374 is the introduction of LRU-based cache eviction with optional capacity policies (max_bytes, max_events):
Before PR #23374:
The MetricSet used an IndexMap without capacity limits
Reference metrics were retained indefinitely (or until TTL expired)
The reference metric always tracked the last seen absolute value
After PR #23374:
The MetricSet uses an LRU cache with capacity-based eviction
When max_bytes or max_events limits are hit, LRU entries are evicted
If a reference metric is evicted, the next incoming metric becomes the new reference
Critical issue: If a metric is evicted due to capacity limits, and then a newer metric with LOWER bucket counts comes in first (after a source restart or counter reset), followed by an older/stale reference being somehow restored or a time-skewed metric, the subtraction overflows
The underlying bug (pre-existing):
The real bug is in MetricValue::subtract in value.rs. The guard condition at line 341 only checks:
buckets
if *count >= *count2 // total count check
&& buckets.len() == buckets2.len()
&& buckets.iter().zip(buckets2.iter()).all(|(b1, b2)| b1.upper_limit == b2.upper_limit)It does NOT check that each b1.count >= b2.count, so individual bucket underflows can occur even when the total count check passes.
Likely scenario triggering this:
- Capacity-based eviction removes a reference metric
- A new metric comes in and becomes the reference
- Another instance of that metric arrives with different bucket distribution (redistribution of values across buckets, or counter reset of individual buckets)
- The total count may be higher, but individual bucket counts can be lower
- Panic on subtraction
Fix:
The fix should be in MetricValue::subtract to either:
Add per-bucket count validation to the guard condition
Use saturating_sub for bucket counts
Return false (triggering metric reinitialization) when any bucket would underflow
Configuration
Version
0.52.0
Debug Output
Example Data
No response
Additional Context
No response