Skip to content

Commit 87ecb5a

Browse files
fix(metrics): use _counts histogram for replication tasks lag (#7716)
**What changed?** Updated replication task processor lag histogram emission to use integer _counts instead of duration/ns. Specifically, in cleanupAckedReplicationTasks, changed ExponentialReplicationTasksLag emission from ExponentialHistogram(..., lag) to IntExponentialHistogram(..., lagCount), and updated metric definitions + migration allowlist names from replication_tasks_lag_ns to replication_tasks_lag_counts. **Why?** Per follow-up review, this metric represents queue depth/lag in number of tasks, not time duration, so _counts is the correct histogram semantic. Previously, the code emitted replication_tasks_lag_ns with duration buckets, which could misrepresent the signal and make dashboards/alerts inconsistent with actual units. This change keeps timer emission for backward compatibility while making histogram emission unit-correct for migration and analysis. **How did you test it?** go test ./service/history/replication/... -count=1 go test ./common/metrics/... -run TestHistogramMigration -count=1 make pr **Potential risks** Low to moderate metrics-consumer risk. No API/IDL or schema changes. Timer metric (replication_tasks_lag) is unchanged. Histogram metric name changed from _ns to _counts; any dashboards/alerts reading the old histogram name will need to move to replication_tasks_lag_counts. **Release notes** Internal metrics migration update: replication task processor lag histogram now emits task-count based values via replication_tasks_lag_counts (integer histogram), while preserving existing timer emission. **Documentation Changes** N/A for Cadence docs; internal dashboard/alert metric references should switch from replication_tasks_lag_ns to replication_tasks_lag_counts. Signed-off-by: Diana Zawadzki <dzawa@live.de>
1 parent 63d68a2 commit 87ecb5a

File tree

3 files changed

+5
-5
lines changed

3 files changed

+5
-5
lines changed

common/metrics/config.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ var HistogramMigrationMetrics = map[string]struct{}{
5959
// Replication task processor histograms (PR #7685).
6060
// Dual-emitted as timer + histogram.
6161
"replication_tasks_lag": {},
62-
"replication_tasks_lag_ns": {},
62+
"replication_tasks_lag_counts": {},
6363
"replication_tasks_applied_latency": {},
6464
"replication_tasks_applied_latency_ns": {},
6565

common/metrics/defs.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3562,7 +3562,7 @@ var MetricDefs = map[ServiceIdx]map[MetricIdx]metricDefinition{
35623562
ReplicationTasksApplied: {metricName: "replication_tasks_applied", metricType: Counter},
35633563
ReplicationTasksFailed: {metricName: "replication_tasks_failed", metricType: Counter},
35643564
ReplicationTasksLag: {metricName: "replication_tasks_lag", metricType: Timer},
3565-
ExponentialReplicationTasksLag: {metricName: "replication_tasks_lag_ns", metricType: Histogram, exponentialBuckets: Mid1ms24h},
3565+
ExponentialReplicationTasksLag: {metricName: "replication_tasks_lag_counts", metricType: Histogram, intExponentialBuckets: Mid1To16k},
35663566
ReplicationTasksLagRaw: {metricName: "replication_tasks_lag_raw", metricType: Timer},
35673567
ReplicationTasksDelay: {metricName: "replication_tasks_delay", metricType: Histogram, buckets: ReplicationTaskDelayBucket},
35683568
ReplicationTasksFetched: {metricName: "replication_tasks_fetched", metricType: Timer},

service/history/replication/task_processor.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -278,12 +278,12 @@ func (p *taskProcessorImpl) cleanupAckedReplicationTasks() error {
278278
persistence.HistoryTaskCategoryReplication,
279279
p.currentCluster,
280280
).GetTaskID()
281-
lag := time.Duration(maxReadLevel - minAckLevel)
281+
lagCount := int(maxReadLevel - minAckLevel)
282282
scope := p.metricsClient.Scope(metrics.ReplicationTaskFetcherScope,
283283
metrics.TargetClusterTag(p.currentCluster),
284284
)
285-
scope.RecordTimer(metrics.ReplicationTasksLag, lag)
286-
scope.ExponentialHistogram(metrics.ExponentialReplicationTasksLag, lag)
285+
scope.RecordTimer(metrics.ReplicationTasksLag, time.Duration(lagCount))
286+
scope.IntExponentialHistogram(metrics.ExponentialReplicationTasksLag, lagCount)
287287
for {
288288
pageSize := p.config.ReplicatorTaskDeleteBatchSize()
289289
resp, err := p.shard.GetExecutionManager().RangeCompleteHistoryTask(

0 commit comments

Comments
 (0)