Skip to content

Panic in DatadogMetricsSink: "attempt to subtract with overflow"Β #24415

@gwenaskell

Description

@gwenaskell

A note for the community

  • Please vote on this issue by adding a πŸ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The following panic stack trace occurred while running the OP worker with Vector hash dbc805a. I cannot reproduce it consistently but it appears to be caused by the processing of a histogram metric in the datadog metrics sink.

thread 'observability-pipelines-worker' (9158939) panicked at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/value.rs:349:21:
attempt to subtract with overflow
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
   1: core::panicking::panic_fmt
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
   2: core::panicking::panic_const::panic_const_sub_overflow
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:175:17
   3: vector_core::event::metric::value::MetricValue::subtract
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/value.rs:349:21
   4: vector_core::event::metric::data::MetricData::subtract
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/data.rs:148:20
   5: vector_core::event::metric::Metric::subtract
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/event/metric/mod.rs:388:19
   6: vector::sinks::util::buffer::metrics::normalize::MetricSet::absolute_to_incremental
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/buffer/metrics/normalize.rs:633:27
   7: vector::sinks::util::buffer::metrics::normalize::MetricSet::make_incremental
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/buffer/metrics/normalize.rs:574:42
   8: <vector::sinks::datadog::metrics::normalizer::DatadogMetricsNormalizer as vector::sinks::util::buffer::metrics::normalize::MetricNormalize>::normalize
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/datadog/metrics/normalizer.rs:27:18
   9: vector::sinks::util::buffer::metrics::normalize::MetricNormalizer<N>::normalize
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/buffer/metrics/normalize.rs:173:25
  10: <vector::sinks::util::normalizer::Normalizer<St,N> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/util/normalizer.rs:63:63
  11: <futures_util::stream::stream::fuse::Fuse<S> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/fuse.rs:53:39
  12: <vector_stream::partitioned_batcher::PartitionedBatcher<St,Prt,KT,C,F,B> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-stream/src/partitioned_batcher.rs:273:40
  13: <futures_util::stream::stream::fuse::Fuse<S> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/fuse.rs:53:39
  14: <vector_stream::concurrent_map::ConcurrentMap<St,T> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-stream/src/concurrent_map.rs:69:44
  15: <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/map.rs:58:47
  16: <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/map.rs:58:47
  17: <futures_util::stream::stream::flatten::Flatten<St,<St as futures_core::stream::Stream>::Item> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/flatten.rs:55:65
  18: <futures_util::stream::stream::FlatMap<St,U,F> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/lib.rs:97:35
  19: <futures_util::stream::stream::filter_map::FilterMap<St,Fut,F> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/filter_map.rs:79:68
  20: <futures_util::stream::stream::fuse::Fuse<S> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/fuse.rs:53:39
  21: <futures_util::stream::stream::ready_chunks::ReadyChunks<St> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/ready_chunks.rs:40:40
  22: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-core-0.3.31/src/stream.rs:130:33
  23: futures_util::stream::stream::StreamExt::poll_next_unpin
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/mod.rs:1638:24
  24: <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/stream/stream/next.rs:32:21
  25: vector_stream::driver::Driver<St,Svc>::run::{{closure}}::{{closure}}
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/macros/select.rs:707:49
  26: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/future/poll_fn.rs:151:9
  27: vector_stream::driver::Driver<St,Svc>::run::{{closure}}
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-stream/src/driver.rs:123:13
  28: vector::sinks::datadog::metrics::sink::DatadogMetricsSink<S>::run_inner::{{closure}}
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/datadog/metrics/sink.rs:148:14
  29: <vector::sinks::datadog::metrics::sink::DatadogMetricsSink<S> as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/sinks/datadog/metrics/sink.rs:164:31
  30: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/future/future.rs:133:9
  31: <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/sink.rs:181:30
  32: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/future/future.rs:133:9
  33: vector_core::sink::VectorSink::run::{{closure}}
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/lib/vector-core/src/sink.rs:24:55
  34: vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/topology/builder.rs:670:18
  35: <vector::topology::task::Task as core::future::future::Future>::poll
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/topology/task.rs:92:29
  36: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:299:9
  37: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:44
  38: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:274:9
  39: std::panicking::catch_unwind::do_call
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:590:40
  40: ___rust_try
  41: std::panicking::catch_unwind
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:553:19
  42: std::panic::catch_unwind
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
  43: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:9
  44: vector::topology::handle_errors::{{closure}}
             at /.../.cargo/git/checkouts/vector-7010c25277c07669/dbc805a/src/topology/mod.rs:67:10
  45: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tracing-0.1.44/src/instrument.rs:321:15
  46: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/runtime/task/core.rs:365:24
  47: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/loom/std/unsafe_cell.rs:16:9
  48: tokio::runtime::task::core::Core<T,S>::poll
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/runtime/task/core.rs:354:30
  49: tokio::runtime::task::harness::poll_future::{{closure}}
             at /.../.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.48.0/src/runtime/task/harness.rs:535:30
  50: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:274:9
  51: std::panicking::catch_unwind::do_call
             at /.../.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/std/src/panicking.rs:590:40

... (truncated)

I noticed that this part of Vector was recently modified by #23374 and tried investigating this with Claude which produced the following analysis:

How This Happens

This can occur when:
Histogram bucket redistribution: A metric's value can shift between buckets between scrapes (e.g., if latency decreases, values move to lower buckets)
Counter resets: If the source restarted, individual bucket counts may be lower than the previous scrape
Non-monotonic sources: Some histogram sources don't guarantee monotonic bucket counts

What the panic is:

The panic occurs in MetricValue::subtract at line 349:

b1.count -= b2.count;  // attempt to subtract with overflow

This happens when an AggregatedHistogram's individual bucket count in the new metric is less than the corresponding bucket count in the stored reference metric.

Why PR #23374 exposes this bug:
The key change in PR #23374 is the introduction of LRU-based cache eviction with optional capacity policies (max_bytes, max_events):

Before PR #23374:
The MetricSet used an IndexMap without capacity limits
Reference metrics were retained indefinitely (or until TTL expired)
The reference metric always tracked the last seen absolute value

After PR #23374:
The MetricSet uses an LRU cache with capacity-based eviction
When max_bytes or max_events limits are hit, LRU entries are evicted
If a reference metric is evicted, the next incoming metric becomes the new reference

Critical issue: If a metric is evicted due to capacity limits, and then a newer metric with LOWER bucket counts comes in first (after a source restart or counter reset), followed by an older/stale reference being somehow restored or a time-skewed metric, the subtraction overflows

The underlying bug (pre-existing):
The real bug is in MetricValue::subtract in value.rs. The guard condition at line 341 only checks:
buckets

if *count >= *count2  // total count check
    && buckets.len() == buckets2.len()
    && buckets.iter().zip(buckets2.iter()).all(|(b1, b2)| b1.upper_limit == b2.upper_limit)

It does NOT check that each b1.count >= b2.count, so individual bucket underflows can occur even when the total count check passes.

Likely scenario triggering this:

  • Capacity-based eviction removes a reference metric
  • A new metric comes in and becomes the reference
  • Another instance of that metric arrives with different bucket distribution (redistribution of values across buckets, or counter reset of individual buckets)
  • The total count may be higher, but individual bucket counts can be lower
  • Panic on subtraction

Fix:

The fix should be in MetricValue::subtract to either:
Add per-bucket count validation to the guard condition
Use saturating_sub for bucket counts
Return false (triggering metric reinitialization) when any bucket would underflow

Configuration


Version

0.52.0

Debug Output


Example Data

No response

Additional Context

No response

References

#23374

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugA code related bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions