add burst monitoring for global task injector by zyguan · Pull Request #85 · tikv/yatp

zyguan · 2026-02-10T14:42:18Z

This PR adds burst monitoring for the global task injector to track maximum enqueue throughput (tasks/sec) since the last scrape.

Add MaxGauge and MaxGaugeVec types that track maximum values and reset on scrape/collection.
Implement BurstMonitor that calculates throughput based on task sampling and records it in QUEUE_CORE_BURST_THROUGHPUT.
Wire BurstMonitor into the QueueCore and expose configuration through the Builder.

Summary by CodeRabbit

New Features
- Optional burst monitoring for task queues to sample enqueue bursts and report peak per-burst throughput between scrapes.
- Configurable controls for burst monitoring (per-worker multiplier, minimum sample size).
- New Prometheus metric exposing max enqueue throughput since last scrape; values reset after collection.
Tests
- Added tests validating burst throughput sampling, metric reporting accuracy, and reset behavior (note: duplicate test entry present).

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai · 2026-02-10T14:42:47Z

📝 Walkthrough

Walkthrough

Adds burst monitoring: a MaxGauge metric/vector for max enqueue throughput, a BurstMonitor sampling on enqueues, Builder support to enable monitoring, QueueCore updated to accept an optional BurstMonitor, and tests exercising throughput metrics and reset behavior.

Changes

Cohort / File(s)	Summary
Metrics Infrastructure `src/metrics.rs`	Added `MaxGauge`, `MaxGaugeVecBuilder`, `MaxGaugeVec` (atomic max-tracking, reset-on-collect), Collector/Metric impls, tests, and public static `QUEUE_CORE_BURST_THROUGHPUT`.
Builder & Config `src/pool/builder.rs`	Added `BurstMonitorConfig`, `burst_monitoring: Option<...>` on `Builder`, `enable_burst_monitoring(...)`, and updated `freeze_with_queue` to construct/pass an optional `BurstMonitor` into `QueueCore::new`.
Burst Monitoring & QueueCore `src/pool/spawn.rs`	Added `BurstMonitor` (constructor and `on_enqueue` sampling), `now_ns()` helper, made `QueueCore` carry `Option<BurstMonitor>`, updated constructor signature, and invoked sampling from `push`.
Tests & Call Sites `src/pool/tests.rs`, `src/queue/priority.rs`	Added `test_burst_monitoring` (duplicated) asserting throughput via `QUEUE_CORE_BURST_THROUGHPUT`; updated `QueueCore::new(..., None)` callsites to match new constructor signature.
Minor Refactors `src/queue.rs`, `src/task/future.rs`	Derived `Default` for `QueueType` (removed manual impl); adjusted thread_local initialization syntax in `task/future.rs` (const block).

Sequence Diagram

sequenceDiagram
    participant Client as Task Submission
    participant QueueCore
    participant BurstMonitor
    participant Metrics as MaxGaugeVec
    participant Scraper as Prometheus Scraper

    Client->>QueueCore: push(task)
    QueueCore->>BurstMonitor: on_enqueue(active_workers) (if Some)

    rect rgba(100,200,100,0.5)
    Note over BurstMonitor: sample timing & counts, compute throughput
    BurstMonitor->>Metrics: observe(throughput_value)
    end

    Scraper->>Metrics: collect()
    rect rgba(100,100,200,0.5)
    Note over Metrics: emit max since last scrape and reset
    Metrics-->>Scraper: metric_data
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I watch the queues and count each sprint and hop,
Peaks tucked in MaxGauge until the scraper's stop,
I sample bursts and whisper numbers new,
Then clear my tally — ready for the next queue,
🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'add burst monitoring for global task injector' accurately and directly describes the main objective of the PR, which is to implement burst monitoring functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/metrics.rs`:
- Around line 120-129: MaxGauge::observe currently reads self.gauge.get() then
sets, causing a TOCTOU race that can regress the max; replace the read-then-set
with a lock-free compare-and-swap loop on an AtomicU64 storing f64 bits: load
the current bits, compute new_bits = f64::to_bits(v), if new_bits > current_bits
attempt compare_exchange (using appropriate Orderings) and retry on failure
until success or current is already >= new_bits; use f64::from_bits when needed
and operate on self.gauge's underlying AtomicU64 (or change gauge to hold one)
so concurrent observers cannot overwrite a larger value with a smaller one.

In `@src/pool/tests.rs`:
- Around line 299-332: The timing-sensitive assertions in test_burst_monitoring
are too strict and can flake under CI; update the test (test_burst_monitoring)
to make the throughput checks more robust by either increasing the sleep
durations around spawn_n to reduce relative jitter (e.g., double the sleep
times) or widening the assertion tolerance for QUEUE_CORE_BURST_THROUGHPUT (the
gauge checks after the spawn_n sequences) to a safer margin (e.g., ±50 or use a
percentage-based threshold) so the asserts no longer fail from small OS
scheduling delays during the 200ms/400ms windows.

🧹 Nitpick comments (3)

src/metrics.rs (1)

131-149: Reset-on-read in both Collector::collect and Metric::metric — clarify intended usage.

Both collect() (Line 137-138) and metric() (Line 145-146) reset the gauge to 0.0 after reading. When MaxGaugeVec::collect() is called, the inner MetricVec calls metric() on each child MaxGauge, which triggers the reset. This is the correct path.

However, the standalone Collector::collect() on MaxGauge is also callable. If someone inadvertently registers a single MaxGauge as a Collector and it's also part of a MaxGaugeVec, both paths would race to reset. Consider adding a doc comment clarifying that MaxGauge instances obtained from MaxGaugeVec should only be reset via the vec's collect().

src/pool/spawn.rs (1)

61-97: Relaxed ordering on sample_target and start_ns can cause stale reads across threads.

When Thread A (the one with new_count == 1) stores sample_target and start_ns with Relaxed ordering (lines 69-70), another Thread B that subsequently reaches the target check (line 73-76) or reads start_ns (line 78) is not guaranteed to observe those stores — Relaxed provides no cross-thread visibility ordering.

In practice, this is likely benign because:

If target is read as 0 (stale), the thread returns early — no harm.

A stale start_ns would only slightly skew the throughput measurement.

However, if correctness of the throughput measurement matters more than the comment suggests, consider using Release on the stores in the new_count == 1 branch and Acquire on the corresponding loads. The count fetch_add already acts as a synchronization point, but Relaxed on the ancillary fields doesn't piggyback on it.

Given the documented "good enough" stance, this is likely acceptable as-is.

src/pool/builder.rs (1)

223-238: Silently ignoring per_worker_multiplier == 0 is overly restrictive.

The guard per_worker_multiplier > 0 && min_sample_size > 0 rejects per_worker_multiplier = 0, but that value would still produce a valid configuration: the sample target would always be max(min_sample_size, 0 * workers) = min_sample_size. Only min_sample_size == 0 is truly problematic (would cause the monitor to fire on every enqueue).

Consider either relaxing the check to only require min_sample_size > 0, or documenting that both must be positive.

src/metrics.rs

src/pool/tests.rs

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/metrics.rs`:
- Around line 194-199: Clippy warns that MaxGaugeVecBuilder should implement
Default; add an impl Default for MaxGaugeVecBuilder that returns Self::new()
(i.e., implement fn default() -> Self { Self::new() }) so the type satisfies
Default and the linter error is resolved while keeping the existing pub fn new()
constructor.

🧹 Nitpick comments (1)

src/pool/spawn.rs (1)

61-97: Relaxed ordering between sample_target/start_ns stores and loads across threads is technically unsound but practically benign.

When the thread that sets new_count == 1 stores sample_target (Line 69) then start_ns (Line 70) with Relaxed ordering, a later thread that observes sample_target != 0 is not formally guaranteed to also see the updated start_ns. On weakly-ordered architectures (e.g., ARM), it's theoretically possible to read a stale start_ns of 0.

The impact is minimal: elapsed_ns would be inflated (time since process start), yielding an artificially low throughput that MaxGauge would ignore if a higher value was already recorded. This self-corrects on the next sampling window.

If you ever want to tighten this, changing the sample_target store to Release and its load (Line 73) to Acquire would establish a happens-before edge that covers start_ns.

src/metrics.rs

Signed-off-by: zyguan <zhongyangguan@gmail.com>

Copilot

Pull request overview

Adds optional “burst” monitoring for the global task injector to report peak enqueue throughput (tasks/sec) since the last Prometheus scrape, and wires it through the pool builder/QueueCore.

Changes:

Introduce MaxGauge / MaxGaugeVec to track a maximum value and reset on collection.
Add BurstMonitor and emit QUEUE_CORE_BURST_THROUGHPUT from QueueCore::push.
Expose burst monitoring configuration via Builder and add tests.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/metrics.rs`	Adds `MaxGauge`/`MaxGaugeVec` and the `yatp_queue_core_burst_throughput` metric.
`src/pool/spawn.rs`	Implements `BurstMonitor`, extends `QueueCore` to optionally track enqueue throughput.
`src/pool/builder.rs`	Adds builder configuration and wires `BurstMonitor` construction into `QueueCore::new`.
`src/pool/tests.rs`	Adds a unit test for burst monitoring behavior.
`src/queue/priority.rs`	Updates tests to pass the new `QueueCore::new(..., burst_monitor)` parameter.
`src/queue.rs`	Simplifies `QueueType` default via `#[derive(Default)]`.
`src/task/future.rs`	Adjusts `thread_local!` initialization style.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T17:01:50Z

src/pool/tests.rs

+    let value = metric.metric().get_gauge().get_value();
+    assert!(value > 100.0); // the above loop should be executed within 1s


This assertion is timing-based (value > 100.0 assuming the enqueue loop completes within 1s) and is likely to be flaky on slower/contended CI runners. Consider asserting only that a positive, finite value is reported and that it resets after a scrape/collect (e.g. call metric.metric()/collect() twice and assert the second is 0), or make the threshold much more forgiving and/or control time in the monitor for the test.

Suggested change

let value = metric.metric().get_gauge().get_value();

assert!(value > 100.0); // the above loop should be executed within 1s

let value1 = metric.metric().get_gauge().get_value();

assert!(value1.is_finite());

assert!(value1 > 0.0);

let value2 = metric.metric().get_gauge().get_value();

assert_eq!(value2, 0.0);

Copilot · 2026-02-10T17:01:51Z

src/metrics.rs

+    /// Wraps a `Gauge` to create a `MaxGauge`. The `Gauge` should not be used directly after being wrapped, otherwise
+    /// the maximum tracking will be broken.
+    pub fn wrap(gauge: Gauge) -> Self {
+        let val = gauge.get().to_bits();
+        Self {
+            gauge,
+            max_val: Arc::new(AtomicU64::new(val)),
+        }
+    }


MaxGauge::wrap seeds max_val from the wrapped Gauge's current value (typically 0). That makes the max tracking incorrect for metrics that can legitimately be negative (e.g. observing -1.0 will never update the max if the seed is 0). Consider initializing the internal max to a sentinel like -inf/"no value yet" instead of the gauge's current value, and only exporting 0 when there have been no observations since the last scrape.

Copilot · 2026-02-10T17:01:51Z

src/metrics.rs

+        let val = self.max_val.swap(0f64.to_bits(), Ordering::Relaxed);
+        f64::from_bits(val)


MaxGauge::take resets the tracked max to 0. This breaks correctness for gauges whose values can be negative: after a scrape/reset, any negative observations will be ignored because 0 remains the maximum. Use a sentinel (e.g. f64::NEG_INFINITY) or a separate "has_value" flag for the reset state, and translate the sentinel to 0 only at collection time if you want the exported default to be 0.

Suggested change

let val = self.max_val.swap(0f64.to_bits(), Ordering::Relaxed);

f64::from_bits(val)

let prev = self

.max_val

.swap(f64::NEG_INFINITY.to_bits(), Ordering::Relaxed);

let val = f64::from_bits(prev);

if val == f64::NEG_INFINITY {

0.0

} else {

val

}

Copilot · 2026-02-10T17:01:52Z

src/metrics.rs

+            match self.max_val.compare_exchange_weak(
+                current,
+                v.to_bits(),
+                Ordering::Relaxed,
+                Ordering::Relaxed,
+            ) {
+                Ok(_) => {
+                    self.gauge.set(v);
+                    break;


MaxGauge::observe updates the underlying Gauge directly, while collect()/metric() also mutate the same Gauge during a scrape/reset. Without synchronization, a concurrent observe() can race with collect() and cause a scrape to include values from after the reset window. To preserve "max since last scrape" semantics, consider not mutating the underlying gauge in observe() (only update the atomic max), and have collect()/metric() render from the atomic value, or add a lightweight lock around the reset+export path.

Copilot · 2026-02-10T17:01:52Z

src/pool/spawn.rs

+            self.sample_target.store(target, Ordering::Relaxed);
+            self.start_ns.store(now_ns(), Ordering::Relaxed);
+        }
+
+        let target = self.sample_target.load(Ordering::Relaxed);
+        if target == 0 || new_count < target {
+            return;
+        }
+
+        let start_ns = self.start_ns.load(Ordering::Relaxed);


BurstMonitor::on_enqueue uses only Relaxed orderings when publishing sample_target/start_ns on the first enqueue and reading them on subsequent enqueues. Under contention, other threads are allowed to observe stale values (e.g. start_ns == 0), which can skew throughput calculations significantly. Consider using Release on the stores and Acquire on the loads (or setting start_ns via CAS) so the window start/target become visible consistently once initialized.

Suggested change

self.sample_target.store(target, Ordering::Relaxed);

self.start_ns.store(now_ns(), Ordering::Relaxed);

}

let target = self.sample_target.load(Ordering::Relaxed);

if target == 0 || new_count < target {

return;

}

let start_ns = self.start_ns.load(Ordering::Relaxed);

self.sample_target.store(target, Ordering::Release);

self.start_ns.store(now_ns(), Ordering::Release);

}

let target = self.sample_target.load(Ordering::Acquire);

if target == 0 || new_count < target {

return;

}

let start_ns = self.start_ns.load(Ordering::Acquire);

Signed-off-by: zyguan <zhongyangguan@gmail.com>

add burst monitoring for global task injector

be620ef

Signed-off-by: zyguan <zhongyangguan@gmail.com>

ti-chi-bot bot added the dco-signoff: yes label Feb 10, 2026

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

src/metrics.rs Show resolved Hide resolved

src/pool/tests.rs Show resolved Hide resolved

zyguan added 3 commits February 10, 2026 14:58

fix lint issue

00ad45f

Signed-off-by: zyguan <zhongyangguan@gmail.com>

address the comments

f175ad0

Signed-off-by: zyguan <zhongyangguan@gmail.com>

make test_burst_monitoring stable

7ae8e40

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

src/metrics.rs Show resolved Hide resolved

zyguan added 3 commits February 10, 2026 16:30

make test_burst_monitoring more stable

af34a0c

Signed-off-by: zyguan <zhongyangguan@gmail.com>

make clippy happy

5b47610

Signed-off-by: zyguan <zhongyangguan@gmail.com>

simplify test_burst_monitoring

30b88f1

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan requested a review from Copilot February 10, 2026 16:56

Copilot started reviewing on behalf of zyguan February 10, 2026 16:56 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

address some comments from copilot

d01ed0d

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk requested review from cfzjywxk and you06 February 14, 2026 02:00

		let value = metric.metric().get_gauge().get_value();
		assert!(value > 100.0); // the above loop should be executed within 1s

-    let value = metric.metric().get_gauge().get_value();
-    assert!(value > 100.0); // the above loop should be executed within 1s
+    let value1 = metric.metric().get_gauge().get_value();
+    assert!(value1.is_finite());
+    assert!(value1 > 0.0);
+    let value2 = metric.metric().get_gauge().get_value();
+    assert_eq!(value2, 0.0);

		let val = self.max_val.swap(0f64.to_bits(), Ordering::Relaxed);
		f64::from_bits(val)

-        let val = self.max_val.swap(0f64.to_bits(), Ordering::Relaxed);
-        f64::from_bits(val)
+        let prev = self
+            .max_val
+            .swap(f64::NEG_INFINITY.to_bits(), Ordering::Relaxed);
+        let val = f64::from_bits(prev);
+        if val == f64::NEG_INFINITY {
+.0
+        } else {
+            val
+        }

Conversation

zyguan commented Feb 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zyguan commented Feb 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 10, 2026 •

edited

Loading