Skip to content

add burst monitoring for global task injector#85

Open
zyguan wants to merge 8 commits intotikv:masterfrom
zyguan:dev/observability-2
Open

add burst monitoring for global task injector#85
zyguan wants to merge 8 commits intotikv:masterfrom
zyguan:dev/observability-2

Conversation

@zyguan
Copy link
Contributor

@zyguan zyguan commented Feb 10, 2026

This PR adds burst monitoring for the global task injector to track maximum enqueue throughput (tasks/sec) since the last scrape.

  • Add MaxGauge and MaxGaugeVec types that track maximum values and reset on scrape/collection.
  • Implement BurstMonitor that calculates throughput based on task sampling and records it in QUEUE_CORE_BURST_THROUGHPUT.
  • Wire BurstMonitor into the QueueCore and expose configuration through the Builder.

Summary by CodeRabbit

  • New Features

    • Optional burst monitoring for task queues to sample enqueue bursts and report peak per-burst throughput between scrapes.
    • Configurable controls for burst monitoring (per-worker multiplier, minimum sample size).
    • New Prometheus metric exposing max enqueue throughput since last scrape; values reset after collection.
  • Tests

    • Added tests validating burst throughput sampling, metric reporting accuracy, and reset behavior (note: duplicate test entry present).

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

Adds burst monitoring: a MaxGauge metric/vector for max enqueue throughput, a BurstMonitor sampling on enqueues, Builder support to enable monitoring, QueueCore updated to accept an optional BurstMonitor, and tests exercising throughput metrics and reset behavior.

Changes

Cohort / File(s) Summary
Metrics Infrastructure
src/metrics.rs
Added MaxGauge, MaxGaugeVecBuilder, MaxGaugeVec (atomic max-tracking, reset-on-collect), Collector/Metric impls, tests, and public static QUEUE_CORE_BURST_THROUGHPUT.
Builder & Config
src/pool/builder.rs
Added BurstMonitorConfig, burst_monitoring: Option<...> on Builder, enable_burst_monitoring(...), and updated freeze_with_queue to construct/pass an optional BurstMonitor into QueueCore::new.
Burst Monitoring & QueueCore
src/pool/spawn.rs
Added BurstMonitor (constructor and on_enqueue sampling), now_ns() helper, made QueueCore carry Option<BurstMonitor>, updated constructor signature, and invoked sampling from push.
Tests & Call Sites
src/pool/tests.rs, src/queue/priority.rs
Added test_burst_monitoring (duplicated) asserting throughput via QUEUE_CORE_BURST_THROUGHPUT; updated QueueCore::new(..., None) callsites to match new constructor signature.
Minor Refactors
src/queue.rs, src/task/future.rs
Derived Default for QueueType (removed manual impl); adjusted thread_local initialization syntax in task/future.rs (const block).

Sequence Diagram

sequenceDiagram
    participant Client as Task Submission
    participant QueueCore
    participant BurstMonitor
    participant Metrics as MaxGaugeVec
    participant Scraper as Prometheus Scraper

    Client->>QueueCore: push(task)
    QueueCore->>BurstMonitor: on_enqueue(active_workers) (if Some)

    rect rgba(100,200,100,0.5)
    Note over BurstMonitor: sample timing & counts, compute throughput
    BurstMonitor->>Metrics: observe(throughput_value)
    end

    Scraper->>Metrics: collect()
    rect rgba(100,100,200,0.5)
    Note over Metrics: emit max since last scrape and reset
    Metrics-->>Scraper: metric_data
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I watch the queues and count each sprint and hop,
Peaks tucked in MaxGauge until the scraper's stop,
I sample bursts and whisper numbers new,
Then clear my tally — ready for the next queue,
🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add burst monitoring for global task injector' accurately and directly describes the main objective of the PR, which is to implement burst monitoring functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/metrics.rs`:
- Around line 120-129: MaxGauge::observe currently reads self.gauge.get() then
sets, causing a TOCTOU race that can regress the max; replace the read-then-set
with a lock-free compare-and-swap loop on an AtomicU64 storing f64 bits: load
the current bits, compute new_bits = f64::to_bits(v), if new_bits > current_bits
attempt compare_exchange (using appropriate Orderings) and retry on failure
until success or current is already >= new_bits; use f64::from_bits when needed
and operate on self.gauge's underlying AtomicU64 (or change gauge to hold one)
so concurrent observers cannot overwrite a larger value with a smaller one.

In `@src/pool/tests.rs`:
- Around line 299-332: The timing-sensitive assertions in test_burst_monitoring
are too strict and can flake under CI; update the test (test_burst_monitoring)
to make the throughput checks more robust by either increasing the sleep
durations around spawn_n to reduce relative jitter (e.g., double the sleep
times) or widening the assertion tolerance for QUEUE_CORE_BURST_THROUGHPUT (the
gauge checks after the spawn_n sequences) to a safer margin (e.g., ±50 or use a
percentage-based threshold) so the asserts no longer fail from small OS
scheduling delays during the 200ms/400ms windows.
🧹 Nitpick comments (3)
src/metrics.rs (1)

131-149: Reset-on-read in both Collector::collect and Metric::metric — clarify intended usage.

Both collect() (Line 137-138) and metric() (Line 145-146) reset the gauge to 0.0 after reading. When MaxGaugeVec::collect() is called, the inner MetricVec calls metric() on each child MaxGauge, which triggers the reset. This is the correct path.

However, the standalone Collector::collect() on MaxGauge is also callable. If someone inadvertently registers a single MaxGauge as a Collector and it's also part of a MaxGaugeVec, both paths would race to reset. Consider adding a doc comment clarifying that MaxGauge instances obtained from MaxGaugeVec should only be reset via the vec's collect().

src/pool/spawn.rs (1)

61-97: Relaxed ordering on sample_target and start_ns can cause stale reads across threads.

When Thread A (the one with new_count == 1) stores sample_target and start_ns with Relaxed ordering (lines 69-70), another Thread B that subsequently reaches the target check (line 73-76) or reads start_ns (line 78) is not guaranteed to observe those stores — Relaxed provides no cross-thread visibility ordering.

In practice, this is likely benign because:

  1. If target is read as 0 (stale), the thread returns early — no harm.
  2. A stale start_ns would only slightly skew the throughput measurement.

However, if correctness of the throughput measurement matters more than the comment suggests, consider using Release on the stores in the new_count == 1 branch and Acquire on the corresponding loads. The count fetch_add already acts as a synchronization point, but Relaxed on the ancillary fields doesn't piggyback on it.

Given the documented "good enough" stance, this is likely acceptable as-is.

src/pool/builder.rs (1)

223-238: Silently ignoring per_worker_multiplier == 0 is overly restrictive.

The guard per_worker_multiplier > 0 && min_sample_size > 0 rejects per_worker_multiplier = 0, but that value would still produce a valid configuration: the sample target would always be max(min_sample_size, 0 * workers) = min_sample_size. Only min_sample_size == 0 is truly problematic (would cause the monitor to fire on every enqueue).

Consider either relaxing the check to only require min_sample_size > 0, or documenting that both must be positive.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/metrics.rs`:
- Around line 194-199: Clippy warns that MaxGaugeVecBuilder should implement
Default; add an impl Default for MaxGaugeVecBuilder that returns Self::new()
(i.e., implement fn default() -> Self { Self::new() }) so the type satisfies
Default and the linter error is resolved while keeping the existing pub fn new()
constructor.
🧹 Nitpick comments (1)
src/pool/spawn.rs (1)

61-97: Relaxed ordering between sample_target/start_ns stores and loads across threads is technically unsound but practically benign.

When the thread that sets new_count == 1 stores sample_target (Line 69) then start_ns (Line 70) with Relaxed ordering, a later thread that observes sample_target != 0 is not formally guaranteed to also see the updated start_ns. On weakly-ordered architectures (e.g., ARM), it's theoretically possible to read a stale start_ns of 0.

The impact is minimal: elapsed_ns would be inflated (time since process start), yielding an artificially low throughput that MaxGauge would ignore if a higher value was already recorded. This self-corrects on the next sampling window.

If you ever want to tighten this, changing the sample_target store to Release and its load (Line 73) to Acquire would establish a happens-before edge that covers start_ns.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional “burst” monitoring for the global task injector to report peak enqueue throughput (tasks/sec) since the last Prometheus scrape, and wires it through the pool builder/QueueCore.

Changes:

  • Introduce MaxGauge / MaxGaugeVec to track a maximum value and reset on collection.
  • Add BurstMonitor and emit QUEUE_CORE_BURST_THROUGHPUT from QueueCore::push.
  • Expose burst monitoring configuration via Builder and add tests.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/metrics.rs Adds MaxGauge/MaxGaugeVec and the yatp_queue_core_burst_throughput metric.
src/pool/spawn.rs Implements BurstMonitor, extends QueueCore to optionally track enqueue throughput.
src/pool/builder.rs Adds builder configuration and wires BurstMonitor construction into QueueCore::new.
src/pool/tests.rs Adds a unit test for burst monitoring behavior.
src/queue/priority.rs Updates tests to pass the new QueueCore::new(..., burst_monitor) parameter.
src/queue.rs Simplifies QueueType default via #[derive(Default)].
src/task/future.rs Adjusts thread_local! initialization style.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +315 to +316
let value = metric.metric().get_gauge().get_value();
assert!(value > 100.0); // the above loop should be executed within 1s
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is timing-based (value > 100.0 assuming the enqueue loop completes within 1s) and is likely to be flaky on slower/contended CI runners. Consider asserting only that a positive, finite value is reported and that it resets after a scrape/collect (e.g. call metric.metric()/collect() twice and assert the second is 0), or make the threshold much more forgiving and/or control time in the monitor for the test.

Suggested change
let value = metric.metric().get_gauge().get_value();
assert!(value > 100.0); // the above loop should be executed within 1s
let value1 = metric.metric().get_gauge().get_value();
assert!(value1.is_finite());
assert!(value1 > 0.0);
let value2 = metric.metric().get_gauge().get_value();
assert_eq!(value2, 0.0);

Copilot uses AI. Check for mistakes.
Comment on lines +123 to +131
/// Wraps a `Gauge` to create a `MaxGauge`. The `Gauge` should not be used directly after being wrapped, otherwise
/// the maximum tracking will be broken.
pub fn wrap(gauge: Gauge) -> Self {
let val = gauge.get().to_bits();
Self {
gauge,
max_val: Arc::new(AtomicU64::new(val)),
}
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxGauge::wrap seeds max_val from the wrapped Gauge's current value (typically 0). That makes the max tracking incorrect for metrics that can legitimately be negative (e.g. observing -1.0 will never update the max if the seed is 0). Consider initializing the internal max to a sentinel like -inf/"no value yet" instead of the gauge's current value, and only exporting 0 when there have been no observations since the last scrape.

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +166
let val = self.max_val.swap(0f64.to_bits(), Ordering::Relaxed);
f64::from_bits(val)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxGauge::take resets the tracked max to 0. This breaks correctness for gauges whose values can be negative: after a scrape/reset, any negative observations will be ignored because 0 remains the maximum. Use a sentinel (e.g. f64::NEG_INFINITY) or a separate "has_value" flag for the reset state, and translate the sentinel to 0 only at collection time if you want the exported default to be 0.

Suggested change
let val = self.max_val.swap(0f64.to_bits(), Ordering::Relaxed);
f64::from_bits(val)
let prev = self
.max_val
.swap(f64::NEG_INFINITY.to_bits(), Ordering::Relaxed);
let val = f64::from_bits(prev);
if val == f64::NEG_INFINITY {
0.0
} else {
val
}

Copilot uses AI. Check for mistakes.
src/metrics.rs Outdated
Comment on lines +143 to +151
match self.max_val.compare_exchange_weak(
current,
v.to_bits(),
Ordering::Relaxed,
Ordering::Relaxed,
) {
Ok(_) => {
self.gauge.set(v);
break;
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxGauge::observe updates the underlying Gauge directly, while collect()/metric() also mutate the same Gauge during a scrape/reset. Without synchronization, a concurrent observe() can race with collect() and cause a scrape to include values from after the reset window. To preserve "max since last scrape" semantics, consider not mutating the underlying gauge in observe() (only update the atomic max), and have collect()/metric() render from the atomic value, or add a lightweight lock around the reset+export path.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +78
self.sample_target.store(target, Ordering::Relaxed);
self.start_ns.store(now_ns(), Ordering::Relaxed);
}

let target = self.sample_target.load(Ordering::Relaxed);
if target == 0 || new_count < target {
return;
}

let start_ns = self.start_ns.load(Ordering::Relaxed);
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BurstMonitor::on_enqueue uses only Relaxed orderings when publishing sample_target/start_ns on the first enqueue and reading them on subsequent enqueues. Under contention, other threads are allowed to observe stale values (e.g. start_ns == 0), which can skew throughput calculations significantly. Consider using Release on the stores and Acquire on the loads (or setting start_ns via CAS) so the window start/target become visible consistently once initialized.

Suggested change
self.sample_target.store(target, Ordering::Relaxed);
self.start_ns.store(now_ns(), Ordering::Relaxed);
}
let target = self.sample_target.load(Ordering::Relaxed);
if target == 0 || new_count < target {
return;
}
let start_ns = self.start_ns.load(Ordering::Relaxed);
self.sample_target.store(target, Ordering::Release);
self.start_ns.store(now_ns(), Ordering::Release);
}
let target = self.sample_target.load(Ordering::Acquire);
if target == 0 || new_count < target {
return;
}
let start_ns = self.start_ns.load(Ordering::Acquire);

Copilot uses AI. Check for mistakes.
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@cfzjywxk cfzjywxk requested review from cfzjywxk and you06 February 14, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants