fix(matching): we returned wrong error if shard is lost by dkrotx · Pull Request #7713 · cadence-workflow/cadence

dkrotx · 2026-02-16T17:18:32Z

What changed?
During refactoring for #7711
Return typed error when shard is lost and matcher uses shard-manager.

Why?
Previously we returned untyped error which could potentially lead to wrong retries from client.
Also, returning untyped error prevented us from same stats (metrics.CadenceErrTaskListNotOwnedByHostPerTaskListCounter) and logs as we emit when matching is not onboarded to shard-manager. Handling this error should work the same way with shard-manager enabled.

How did you test it?
Wrote a new table-test for engine.errIfShardOwnershipLost:
go test ./service/matching/handler -run TestErrIfShardOwnershipLost

Potential risks
Should no risk as the changes only used in frontend -> matching integration (Cadence internal) and auxiliary stats emitted from matching.

Release notes

Documentation Changes

Reviewer Validation

PR Description Quality (check these before reviewing code):

service/matching/handler/engine.go

service/matching/handler/engine_test.go

During refactoring for cadence-workflow#7711 When shard was lost and matcher is on shard-manager, we returned untyped error which could potentially lead to wrong retries from client. Also, returning right error ensures we have the same stats/logs as we would have without shard-manager. Signed-off-by: Jan Kisel <dkrot@uber.com>

gitar-bot · 2026-02-16T17:48:03Z

🔍 CI failure analysis for 37ca410: The Golang integration test with sqlite failed after 679 seconds with no specific test failure output, likely a timeout or teardown issue unrelated to this PR's matching service error handling changes.

Issue

The Golang integration test with sqlite failed after running for 679 seconds (over 11 minutes) with exit code 2. The test suite completed with FAIL github.com/uber/cadence/host 679.067s but did not output which specific test case failed or provide diagnostic information.

Root Cause

Integration Test Failure: Likely Timeout or Flakiness

This integration test failure appears unrelated to the PR changes based on the following analysis:

Test Characteristics:

Ran for 679 seconds (over 11 minutes) before failing
No specific test case failure output or error messages
Logs show normal shutdown sequences with task lists and services stopping gracefully
Final line simply states "FAIL" with no diagnostic context
Coverage report was generated (25.4%), indicating tests ran but something failed during or after execution

Evidence Found in Logs:

Multiple workflow decision task failures with causes like "WORKFLOW_WORKER_UNHANDLED_FAILURE" during test execution
Base64-encoded error details: "Decider Panic" and "non deterministic error"
These are expected test scenarios being exercised by the integration test suite
Final error: "Failed to get replication tasks" with "context canceled" - typical shutdown race condition

Why This Cannot Be Caused By The PR:

Scope Isolation: This PR only modifies service/matching/handler/engine.go and service/matching/handler/engine_test.go
- Changes are limited to error handling in errIfShardOwnershipLost() and stillOwnShard()
- No changes to workflow execution, decision handling, or integration test infrastructure
Error Type Consistency: The PR ensures consistent typed errors (TaskListNotOwnedByHostError) are returned
- This cannot cause decision task panics or non-deterministic errors in workflow execution
- The error handling changes are backward compatible
No Framework Changes: The PR doesn't modify:
- Workflow decision processing logic
- Integration test infrastructure
- Service initialization or shutdown sequences
- Replication or cross-cluster communication
Test Pattern: The failure pattern (long runtime, no specific failure, clean shutdown logs) suggests:
- Test timeout after 11+ minutes
- Flaky teardown race condition
- Infrastructure issue, not code bug

Details

What the Logs Show:

Normal test execution logging for ~11 minutes
Decision task failures that are part of test scenarios (not actual failures)
Clean service shutdown sequences
Final replication error during teardown: "context canceled"
No indication of matching service errors or ownership check failures

What's Missing:

No "--- FAIL: TestXxx" output showing which test failed
No panic stack traces
No assertion failures or error messages
No indication that matching service error handling caused issues

Integration Test Nature:

These tests bootstrap full Cadence services (frontend, history, matching, worker)
They exercise complex workflows with decision tasks, activities, and replication
They are prone to flakiness due to timing, resource constraints, and shutdown races
Long-running nature (11+ minutes) increases likelihood of environmental issues

Code Review ✅ Approved 3 resolved / 3 findings

Clean refactoring that correctly returns typed errors for shard ownership loss across both SD and legacy paths. Code is well-tested with comprehensive table-driven tests covering error and success scenarios.

✅ 3 resolved

✅ Quality: Typo in Fatal log message: "im" should be "in"

📄 service/matching/handler/engine.go:148
Line 148 has a typo in the fatal error message: "failed to lookup self im membership" should be "failed to lookup self in membership". This same typo also exists in the original code at line 1548 ("failed to lookup task list owner" is fine, but the pattern was copied from the old WhoAmI error message). Since this is a Fatal log message that operators would see during startup failures, it should be clear.

✅ Bug: stillOwnShard always returns true in non-SD path, even if not owned

📄 service/matching/handler/engine.go:1550
In stillOwnShard, when using the membership resolver (non-SD) path, the function sets reason and whoOwns when the identity check fails (taskListOwner.Identity() != e.selfAddress.Identity()), but then falls through to return true, nil on line 1555. This means errIfShardOwnershipLost will see ok == true and return nil, silently allowing operations on task lists this host doesn't own.\n\nThis defeats the entire purpose of the ownership guard for the non-SD code path — the very bug this PR is trying to fix (returning wrong errors when a shard is lost). In the non-SD case, the ownership check is now completely non-functional.\n\nThe fix is to return false, nil inside the identity mismatch branch.

✅ Quality: Tests only cover SD path; non-SD membership path is untested

📄 service/matching/handler/engine_test.go:813
All test cases in TestErrIfShardOwnershipLost set executor.EXPECT().IsOnboardedToSD().Return(true).AnyTimes() on line 813, meaning the non-SD (membership resolver) code path is never exercised. This is why the critical bug in the non-SD path (always returning true) went undetected.\n\nPlease add test cases for the non-SD path:\n1. Membership Lookup fails → returns untyped error\n2. Task list owned by a different host → returns TaskListNotOwnedByHostError with the other host's identity\n3. Task list owned by self → returns nil

Rules ✅ All requirements met

Repository Rules

✅ PR Description Quality Standards: [Potential risks] section is filled out with risk assessment, and [Why?] section expanded with details about stats/metrics and logs consistency.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

gitar-bot bot reviewed Feb 16, 2026

View reviewed changes

service/matching/handler/engine.go Show resolved Hide resolved

gitar-bot bot reviewed Feb 16, 2026

View reviewed changes

service/matching/handler/engine_test.go Outdated Show resolved Hide resolved

dkrotx force-pushed the dkrot/refactor-errIfShardOwnershipLost branch from 20fc077 to 37ca410 Compare February 16, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(matching): we returned wrong error if shard is lost#7713

fix(matching): we returned wrong error if shard is lost#7713
dkrotx wants to merge 1 commit intocadence-workflow:masterfrom
dkrotx:dkrot/refactor-errIfShardOwnershipLost

dkrotx commented Feb 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

gitar-bot bot commented Feb 16, 2026 •

edited

Loading

Issue

Root Cause

Details

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dkrotx commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer Validation

Uh oh!

Uh oh!

Uh oh!

gitar-bot bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause

Details

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dkrotx commented Feb 16, 2026 •

edited

Loading

gitar-bot bot commented Feb 16, 2026 •

edited

Loading