fix(metrics): use _counts histogram for replication tasks lag#7716
Open
zawadzkidiana wants to merge 1 commit intocadence-workflow:masterfrom
Open
fix(metrics): use _counts histogram for replication tasks lag#7716zawadzkidiana wants to merge 1 commit intocadence-workflow:masterfrom
zawadzkidiana wants to merge 1 commit intocadence-workflow:masterfrom
Conversation
Switch replication task cleanup lag histogram emission from duration/ns to integer counts, and update metric defs plus histogram migration allowlist mapping accordingly. Signed-off-by: Diana Zawadzki <dzawa@live.de>
1dc57ba to
759e0be
Compare
Code Review ✅ ApprovedClean and correct semantic fix: histogram emission properly changed from duration-based to count-based with consistent updates across metric definitions, migration allowlist, and emission call site. No issues found. Rules ❌ No requirements metRepository Rules
OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed?
Updated replication task processor lag histogram emission to use integer _counts instead of duration/ns. Specifically, in cleanupAckedReplicationTasks, changed ExponentialReplicationTasksLag emission from ExponentialHistogram(..., lag) to IntExponentialHistogram(..., lagCount), and updated metric definitions + migration allowlist names from replication_tasks_lag_ns to replication_tasks_lag_counts.
Why?
Per follow-up review, this metric represents queue depth/lag in number of tasks, not time duration, so _counts is the correct histogram semantic.
Previously, the code emitted replication_tasks_lag_ns with duration buckets, which could misrepresent the signal and make dashboards/alerts inconsistent with actual units. This change keeps timer emission for backward compatibility while making histogram emission unit-correct for migration and analysis.
How did you test it?
go test ./service/history/replication/... -count=1
go test ./common/metrics/... -run TestHistogramMigration -count=1
make pr
Potential risks
Low to moderate metrics-consumer risk.
No API/IDL or schema changes.
Timer metric (replication_tasks_lag) is unchanged.
Histogram metric name changed from _ns to _counts; any dashboards/alerts reading the old histogram name will need to move to replication_tasks_lag_counts.
Release notes
Internal metrics migration update: replication task processor lag histogram now emits task-count based values via replication_tasks_lag_counts (integer histogram), while preserving existing timer emission.
Documentation Changes
N/A for Cadence docs; internal dashboard/alert metric references should switch from replication_tasks_lag_ns to replication_tasks_lag_counts.