Skip to content

feat: enable dual emitting for task attempt and latency related metrics#7743

Merged
neil-xie merged 4 commits intocadence-workflow:masterfrom
neil-xie:dual_emitting_task_metrics
Feb 25, 2026
Merged

feat: enable dual emitting for task attempt and latency related metrics#7743
neil-xie merged 4 commits intocadence-workflow:masterfrom
neil-xie:dual_emitting_task_metrics

Conversation

@neil-xie
Copy link
Member

@neil-xie neil-xie commented Feb 24, 2026

What changed?
Start dual emitting timer and histogram metrics for task latency and attempt metrics
#7741

Why?
Timer -> Histogram migration

How did you test it?
go test -v ./common/metrics

Potential risks
Metrics storage increase

Release notes

Documentation Changes


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

Signed-off-by: Neil Xie <neil.xie@uber.com>
@neil-xie neil-xie changed the title Enable dual emitting for task attempt and latency related metrics feat: enable dual emitting for task attempt and latency related metrics Feb 24, 2026
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
Signed-off-by: Neil Xie <neil.xie@uber.com>
@gitar-bot
Copy link

gitar-bot bot commented Feb 25, 2026

🔍 CI failure analysis for 9616919: Replication simulation test (activeactive_regional_failover) failed with workflow timeout, but this appears unrelated to the metrics instrumentation changes in this PR and is likely a flaky test.

Issue

The replication simulation test TestReplicationSimulation failed in the activeactive_regional_failover scenario:

  • Workflow wf3 timed out during operation 9
  • Status: TIMED_OUT
  • Close time: 2026-02-25 19:52:16 UTC

Root Cause

This failure appears unrelated to the code changes in this PR. The PR only adds metrics instrumentation:

  • Dual-emitting configuration for histogram migration
  • Instruments task execution with RecordTimer and IntExponentialHistogram calls
  • Does not modify replication, workflow execution, or timeout logic

Details

The modified files are:

  • common/metrics/config.go - histogram migration config
  • common/metrics/defs.go - metric definitions
  • common/metrics/scope_test.go - metric tests
  • service/history/task/task.go - adds metrics recording calls only

This is likely a flaky test with timing sensitivities in the cross-cluster replication simulation environment. The metrics instrumentation should not affect:

  • Cross-cluster replication logic
  • Workflow execution or completion behavior
  • Regional failover handling
  • Workflow timeout mechanisms
Code Review ✅ Approved 2 resolved / 2 findings

Both previous findings (bucket truncation and missing HandleErr dual-emit) have been resolved. The implementation follows the established dual-emit pattern with appropriate bucket configurations and complete config/test coverage.

✅ 2 resolved
Bug: Missing histogram dual-emit in HandleErr for attempt counts

📄 service/history/task/task.go:239
HandleErr() at line 239 records TaskAttemptTimerPerDomain when the attempt count exceeds the critical retry threshold, but does not emit the corresponding ExponentialTaskAttemptCountsPerDomain histogram. In Ack() (lines 376-378), both the timer and histogram are emitted together, so the HandleErr path creates an inconsistency where the histogram will under-count critical retry attempts compared to the timer.

This means operators relying on the new histogram for attempt count distribution will see a different picture than the existing timer during high-retry scenarios, defeating the purpose of dual-emission for migration validation.

Edge Case: ExponentialTaskLatencyPerDomain may truncate long-lived tasks

📄 common/metrics/defs.go:3349
ExponentialTaskLatencyPerDomain uses Low1ms100s buckets (1ms–100s) but measures time.Since(t.initialSubmitTime) — the total end-to-end task latency from initial submission to acknowledgment. For tasks with many retries, this duration can easily exceed 100 seconds, causing all such observations to pile into the overflow bucket and lose histogram resolution.

For comparison, ExponentialReplicationTaskLatency (which similarly measures end-to-end replication task latency) uses Mid1ms24h buckets, and ExponentialTaskQueueLatencyPerDomain (which measures a subset of the same time range) also uses Mid1ms24h.

Consider using Mid1ms24h for ExponentialTaskLatencyPerDomain to match the precedent set by other end-to-end latency histograms and avoid losing precision for long-running tasks.

Rules ❌ No requirements met

Repository Rules

GitHub Issue Linking Requirement: Link issue #7741 using GitHub closing keywords (Closes #7741, Fixes #7741, or Resolves #7741) in the PR description
PR Description Quality Standards: Restructure description to match template sections: What changed?, Why?, How did you test it?, Potential risks, Release notes, Documentation Changes

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@neil-xie neil-xie merged commit 1cf01e4 into cadence-workflow:master Feb 25, 2026
64 of 65 checks passed
@neil-xie neil-xie deleted the dual_emitting_task_metrics branch February 25, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants