Skip to content

fix(metrics): add replication task latency histogram#7684

Merged
zawadzkidiana merged 9 commits intocadence-workflow:masterfrom
zawadzkidiana:diana/histogram-task-ack
Feb 26, 2026
Merged

fix(metrics): add replication task latency histogram#7684
zawadzkidiana merged 9 commits intocadence-workflow:masterfrom
zawadzkidiana:diana/histogram-task-ack

Conversation

@zawadzkidiana
Copy link
Contributor

@zawadzkidiana zawadzkidiana commented Feb 6, 2026

What changed?
Added an exponential histogram for replication task latency in TaskAckManager alongside the existing timer (task_latency_ns)

Why?
Replication task latency was tracked only via timers, which limits distribution analysis. Adding a histogram retains the existing timer while enabling bucketed/percentile insights in M3/Grafana without changing processing logic.

How did you test it?
Verified metric emission in uMonitor.

Potential risks
Low risk: additional metrics emission on the replication ack path; minimal overhead.

Release notes
Replication now emits task_latency_ns histogram for replication task latency. CadenceCDNC-17610

Documentation Changes
N/A

@zawadzkidiana zawadzkidiana changed the title refactor: task-ack - add replication task latency histogram chore: add replication task latency histogram Feb 6, 2026
@zawadzkidiana zawadzkidiana force-pushed the diana/histogram-task-ack branch from 8db2480 to 59dd869 Compare February 9, 2026 18:22
@gitar-bot
Copy link

gitar-bot bot commented Feb 9, 2026

Analyzing CI failures

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Clean observability enhancement adding exponential histograms for replication task-ack metrics. Follows established dual-emit patterns correctly; minor missing test for clampInt64ToInt helper remains from prior review.

💡 Quality: Missing unit tests for clampInt64ToInt helper

📄 service/history/replication/task_ack_manager.go:235

The new clampInt64ToInt helper has three distinct branches (negative/zero, overflow, happy path) but no unit tests. Since this function handles edge cases like int64-to-int overflow and negative clamping, a small table-driven test would be valuable to document and verify the intended behavior.

Suggested test cases:

func TestClampInt64ToInt(t *testing.T) {
    tests := []struct {
        name string
        in   int64
        want int
    }{
        {"negative", -1, 0},
        {"zero", 0, 0},
        {"small positive", 42, 42},
        {"max int", int64(math.MaxInt), math.MaxInt},
        {"overflow", math.MaxInt64, math.MaxInt},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            assert.Equal(t, tt.want, clampInt64ToInt(tt.in))
        })
    }
}
Rules ❌ No requirements met

Repository Rules

PR Description Quality Standards: **[What changed?]** Update to reflect all 6 histogram metrics added (task latency plus 5 count-based metrics for lag, fetched, returned). **[Why?]** Expand to explain why distribution analysis is needed and what operational problem this solves. **[How did you test it?]** Replace "Verified metric emission in uMonitor" with specific, copyable commands or queries.
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@zawadzkidiana zawadzkidiana force-pushed the diana/histogram-task-ack branch from 72a5c2b to 94f4f2c Compare February 9, 2026 21:24
@zawadzkidiana zawadzkidiana force-pushed the diana/histogram-task-ack branch from 811185a to cfac903 Compare February 10, 2026 21:11
@zawadzkidiana zawadzkidiana force-pushed the diana/histogram-task-ack branch 2 times, most recently from b52020f to cef3eb5 Compare February 13, 2026 03:21
@cadence-workflow cadence-workflow deleted a comment from gitar-bot bot Feb 17, 2026
zawadzkidiana and others added 3 commits February 25, 2026 16:23
…ed diff

Dual-emit exponential histograms alongside existing timers for
ReplicationTasksLag, ReplicationTasksReturned, and ReplicationTasksReturnedDiff
to support histogram migration. ReplicationTasksLag uses the already-defined
ExponentialReplicationTasksLag (duration-based Mid1ms24h). Returned and
ReturnedDiff use new integer histograms (Mid1To16k) with _counts suffix.
All six metric names added to HistogramMigrationMetrics allowlist.

Signed-off-by: Diana Zawadzki <dzawa@live.de>
Address review feedback by removing task_latency_ns histogram and using
ExponentialTaskProcessingLatency in task_ack_manager. Also remove stale
histogram-migration allowlist entries for task_latency/task_latency_ns.

Signed-off-by: zawadzki <zawadzki@uber.com>
Signed-off-by: Diana Zawadzki <dzawa@live.de>
@zawadzkidiana zawadzkidiana force-pushed the diana/histogram-task-ack branch from 1d18b34 to 6a32d1c Compare February 26, 2026 00:28
@zawadzkidiana zawadzkidiana changed the title chore: add replication task latency histogram fix(metrics): add replication task latency histogram Feb 26, 2026
Signed-off-by: Diana Zawadzki <dzawa@live.de>
Signed-off-by: Diana Zawadzki <dzawa@live.de>
Add integer histogram dual-emission for ReplicationTasksFetched and ReplicationTasksLagRaw,
and include their names in HistogramMigrationMetrics for config consistency checks.

Signed-off-by: Diana Zawadzki <dzawa@live.de>
…ager

Signed-off-by: Diana Zawadzki <dzawa@live.de>
Signed-off-by: Diana Zawadzki <dzawa@live.de>
Signed-off-by: Diana Zawadzki <dzawa@live.de>
@gitar-bot
Copy link

gitar-bot bot commented Feb 26, 2026

Code Review ✅ Approved 4 resolved / 4 findings

All 5 previously-identified issues (wrong metric constant, merge conflicts, missing config entries, mismatched histogram method) have been resolved. Metric names, bucket types, and histogram emission methods are now consistent across config.go, defs.go, and task_ack_manager.go.

✅ 4 resolved
Bug: Unresolved merge conflict markers in config.go

📄 common/metrics/config.go:55
The file common/metrics/config.go contains unresolved git merge conflict markers (<<<<<<< Updated upstream, =======, >>>>>>> Stashed changes) at lines 55-71. This will cause a compilation failure — the code cannot be built or deployed in this state.

Both sides of the conflict add different metric names to HistogramMigrationMetrics. Based on the changes in defs.go and task_ack_manager.go, it appears both sides should be included (the "Updated upstream" side adds lag/returned/returned_diff metrics, while the "Stashed changes" side adds fetched/lag_raw metrics). The conflict markers need to be removed and both sets of entries retained.

Bug: Histogram paired with wrong timer: TaskLatency vs TaskProcessingLatency

📄 service/history/replication/task_ack_manager.go:136 📄 service/history/replication/task_ack_manager.go:139
The new code at line 139 emits ExponentialTaskProcessingLatency (metric name: task_latency_processing_ns) as the histogram dual-emit for the TaskLatency timer (metric name: task_latency). However, ExponentialTaskProcessingLatency is the established histogram counterpart for TaskProcessingLatency (task_latency_processing), not TaskLatency (task_latency).

These are two semantically different metrics:

  • TaskLatency / task_latency — started via StartTimer at line 136
  • TaskProcessingLatency / task_latency_processing — a different timer used elsewhere

By emitting the histogram under the task_latency_processing_ns name for what is actually the task_latency timer measurement, dashboard queries on task_latency_processing_ns will now include data from a different timer, corrupting the histogram data for the actual task_latency_processing metric.

The fix should either:

  1. Create a new histogram metric ExponentialTaskLatency with name task_latency_ns to pair with TaskLatency, or
  2. Use the existing TaskProcessingLatency timer instead of TaskLatency if that's what was intended.
Bug: Missing replication_tasks_lag entries in HistogramMigrationMetrics

📄 common/metrics/config.go:50
The HistogramMigrationMetrics map in config.go is missing entries for replication_tasks_lag and replication_tasks_lag_ns. These were present in the original PR diff (lines 52-53) but appear to have been lost during merge conflict resolution.

The code at task_ack_manager.go:208-209 emits both ReplicationTasksLag (metric name: replication_tasks_lag) and ExponentialReplicationTasksLag (metric name: replication_tasks_lag_ns). Without these entries in the migration config map, operators cannot independently control whether the timer or histogram is emitted for this metric pair — both EmitTimer() and EmitHistogram() will unconditionally return true for unregistered metrics, bypassing the migration config entirely.

Add the missing entries to the map:

"replication_tasks_lag":    {},
"replication_tasks_lag_ns": {},
Bug: ExponentialHistogram called on metric with only intExponentialBuckets

📄 service/history/replication/task_ack_manager.go:204 📄 common/metrics/defs.go:3567
ExponentialReplicationTasksLag is defined with intExponentialBuckets: Mid1To16k (line 3567 of defs.go), but task_ack_manager.go:204 calls ExponentialHistogram() which accesses def.exponentialBuckets — a nil interface on this metric.

In scope.go, ExponentialHistogram does:

m.scope.Tagged(def.exponentialBuckets.tags()).Histogram(def.metricName.String(), def.exponentialBuckets.buckets()).RecordDuration(value)

When exponentialBuckets is nil (because only intExponentialBuckets was set), calling .tags() on a nil interface causes a nil pointer panic.

This panic is latent — it only triggers when histogram emission is enabled (migration mode "histogram" or "both"), which is the entire purpose of this migration code. With the default config (""), EmitHistogram returns false, so the nil path is not reached. But as soon as operators flip the config to enable histograms, this will crash.

For reference, the same metric is correctly emitted with IntExponentialHistogram in task_processor.go. The lag value computed here is a task-ID difference cast to time.Duration, so it's effectively an integer count — IntExponentialHistogram with an int() cast is the correct approach.

Bug: CacheSize type changed to Gauge but callers still use RecordTimer

📄 common/metrics/defs.go:3478 📄 service/history/replication/task_store.go
The metric type of CacheSize was changed from Timer to Gauge in the definition, but the two call sites in service/history/replication/task_store.go still use scope.RecordTimer(metrics.CacheSize, ...).

Since RecordTimer calls tally.Timer(name).Record(d) regardless of the metricType in the definition, the actual metric will still be emitted as a timer to the backend. This creates a mismatch:

  • The definition says it's a Gauge (implying it should be emitted via UpdateGauge)
  • The code emits it via RecordTimer which always uses tally.Timer()

If the intent is to change this metric to a Gauge (which makes semantic sense — cache size is a point-in-time value), the callers in task_store.go should also be updated to use scope.UpdateGauge(metrics.CacheSize, float64(count)).

If the callers aren't meant to change in this PR, the metricType should remain as Timer for now to maintain consistency.

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: PR description includes all required sections with clear explanations of changes, motivation, testing approach, risks, and release notes

2 rules not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@zawadzkidiana zawadzkidiana merged commit 624556c into cadence-workflow:master Feb 26, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants