Skip to content

fix(metrics): use _counts histogram for replication tasks lag#7716

Open
zawadzkidiana wants to merge 1 commit intocadence-workflow:masterfrom
zawadzkidiana:diana/histogram-task-processor
Open

fix(metrics): use _counts histogram for replication tasks lag#7716
zawadzkidiana wants to merge 1 commit intocadence-workflow:masterfrom
zawadzkidiana:diana/histogram-task-processor

Conversation

@zawadzkidiana
Copy link
Contributor

What changed?
Updated replication task processor lag histogram emission to use integer _counts instead of duration/ns. Specifically, in cleanupAckedReplicationTasks, changed ExponentialReplicationTasksLag emission from ExponentialHistogram(..., lag) to IntExponentialHistogram(..., lagCount), and updated metric definitions + migration allowlist names from replication_tasks_lag_ns to replication_tasks_lag_counts.

Why?
Per follow-up review, this metric represents queue depth/lag in number of tasks, not time duration, so _counts is the correct histogram semantic.
Previously, the code emitted replication_tasks_lag_ns with duration buckets, which could misrepresent the signal and make dashboards/alerts inconsistent with actual units. This change keeps timer emission for backward compatibility while making histogram emission unit-correct for migration and analysis.

How did you test it?
go test ./service/history/replication/... -count=1
go test ./common/metrics/... -run TestHistogramMigration -count=1
make pr

Potential risks
Low to moderate metrics-consumer risk.
No API/IDL or schema changes.
Timer metric (replication_tasks_lag) is unchanged.
Histogram metric name changed from _ns to _counts; any dashboards/alerts reading the old histogram name will need to move to replication_tasks_lag_counts.

Release notes
Internal metrics migration update: replication task processor lag histogram now emits task-count based values via replication_tasks_lag_counts (integer histogram), while preserving existing timer emission.

Documentation Changes
N/A for Cadence docs; internal dashboard/alert metric references should switch from replication_tasks_lag_ns to replication_tasks_lag_counts.

Switch replication task cleanup lag histogram emission from duration/ns to integer counts,
and update metric defs plus histogram migration allowlist mapping accordingly.

Signed-off-by: Diana Zawadzki <dzawa@live.de>
@zawadzkidiana zawadzkidiana force-pushed the diana/histogram-task-processor branch from 1dc57ba to 759e0be Compare February 17, 2026 21:17
@gitar-bot
Copy link

gitar-bot bot commented Feb 17, 2026

Code Review ✅ Approved

Clean and correct semantic fix: histogram emission properly changed from duration-based to count-based with consistent updates across metric definitions, migration allowlist, and emission call site. No issues found.

Rules ❌ No requirements met

Repository Rules

PR Description Quality Standards: Link the relevant GitHub issue in the 'What changed?' section as required by the PR template
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant