Skip to content

fix(matching): we returned wrong error if shard is lost#7713

Open
dkrotx wants to merge 1 commit intocadence-workflow:masterfrom
dkrotx:dkrot/refactor-errIfShardOwnershipLost
Open

fix(matching): we returned wrong error if shard is lost#7713
dkrotx wants to merge 1 commit intocadence-workflow:masterfrom
dkrotx:dkrot/refactor-errIfShardOwnershipLost

Conversation

@dkrotx
Copy link
Member

@dkrotx dkrotx commented Feb 16, 2026

What changed?
During refactoring for #7711
Return typed error when shard is lost and matcher uses shard-manager.

Why?
Previously we returned untyped error which could potentially lead to wrong retries from client.
Also, returning untyped error prevented us from same stats (metrics.CadenceErrTaskListNotOwnedByHostPerTaskListCounter) and logs as we emit when matching is not onboarded to shard-manager. Handling this error should work the same way with shard-manager enabled.

How did you test it?
Wrote a new table-test for engine.errIfShardOwnershipLost:
go test ./service/matching/handler -run TestErrIfShardOwnershipLost

Potential risks
Should no risk as the changes only used in frontend -> matching integration (Cadence internal) and auxiliary stats emitted from matching.

Release notes

Documentation Changes


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

During refactoring for cadence-workflow#7711
When shard was lost and matcher is on shard-manager, we returned untyped
error which could potentially lead to wrong retries from client.
Also, returning right error ensures we have the same stats/logs as we
would have without shard-manager.

Signed-off-by: Jan Kisel <dkrot@uber.com>
@dkrotx dkrotx force-pushed the dkrot/refactor-errIfShardOwnershipLost branch from 20fc077 to 37ca410 Compare February 16, 2026 17:44
@gitar-bot
Copy link

gitar-bot bot commented Feb 16, 2026

🔍 CI failure analysis for 37ca410: The Golang integration test with sqlite failed after 679 seconds with no specific test failure output, likely a timeout or teardown issue unrelated to this PR's matching service error handling changes.

Issue

The Golang integration test with sqlite failed after running for 679 seconds (over 11 minutes) with exit code 2. The test suite completed with FAIL github.com/uber/cadence/host 679.067s but did not output which specific test case failed or provide diagnostic information.

Root Cause

Integration Test Failure: Likely Timeout or Flakiness

This integration test failure appears unrelated to the PR changes based on the following analysis:

Test Characteristics:

  • Ran for 679 seconds (over 11 minutes) before failing
  • No specific test case failure output or error messages
  • Logs show normal shutdown sequences with task lists and services stopping gracefully
  • Final line simply states "FAIL" with no diagnostic context
  • Coverage report was generated (25.4%), indicating tests ran but something failed during or after execution

Evidence Found in Logs:

  • Multiple workflow decision task failures with causes like "WORKFLOW_WORKER_UNHANDLED_FAILURE" during test execution
  • Base64-encoded error details: "Decider Panic" and "non deterministic error"
  • These are expected test scenarios being exercised by the integration test suite
  • Final error: "Failed to get replication tasks" with "context canceled" - typical shutdown race condition

Why This Cannot Be Caused By The PR:

  1. Scope Isolation: This PR only modifies service/matching/handler/engine.go and service/matching/handler/engine_test.go

    • Changes are limited to error handling in errIfShardOwnershipLost() and stillOwnShard()
    • No changes to workflow execution, decision handling, or integration test infrastructure
  2. Error Type Consistency: The PR ensures consistent typed errors (TaskListNotOwnedByHostError) are returned

    • This cannot cause decision task panics or non-deterministic errors in workflow execution
    • The error handling changes are backward compatible
  3. No Framework Changes: The PR doesn't modify:

    • Workflow decision processing logic
    • Integration test infrastructure
    • Service initialization or shutdown sequences
    • Replication or cross-cluster communication
  4. Test Pattern: The failure pattern (long runtime, no specific failure, clean shutdown logs) suggests:

    • Test timeout after 11+ minutes
    • Flaky teardown race condition
    • Infrastructure issue, not code bug

Details

What the Logs Show:

  • Normal test execution logging for ~11 minutes
  • Decision task failures that are part of test scenarios (not actual failures)
  • Clean service shutdown sequences
  • Final replication error during teardown: "context canceled"
  • No indication of matching service errors or ownership check failures

What's Missing:

  • No "--- FAIL: TestXxx" output showing which test failed
  • No panic stack traces
  • No assertion failures or error messages
  • No indication that matching service error handling caused issues

Integration Test Nature:

  • These tests bootstrap full Cadence services (frontend, history, matching, worker)
  • They exercise complex workflows with decision tasks, activities, and replication
  • They are prone to flakiness due to timing, resource constraints, and shutdown races
  • Long-running nature (11+ minutes) increases likelihood of environmental issues
Code Review ✅ Approved 3 resolved / 3 findings

Clean refactoring that correctly returns typed errors for shard ownership loss across both SD and legacy paths. Code is well-tested with comprehensive table-driven tests covering error and success scenarios.

✅ 3 resolved
Quality: Typo in Fatal log message: "im" should be "in"

📄 service/matching/handler/engine.go:148
Line 148 has a typo in the fatal error message: "failed to lookup self im membership" should be "failed to lookup self in membership". This same typo also exists in the original code at line 1548 ("failed to lookup task list owner" is fine, but the pattern was copied from the old WhoAmI error message). Since this is a Fatal log message that operators would see during startup failures, it should be clear.

Bug: stillOwnShard always returns true in non-SD path, even if not owned

📄 service/matching/handler/engine.go:1550
In stillOwnShard, when using the membership resolver (non-SD) path, the function sets reason and whoOwns when the identity check fails (taskListOwner.Identity() != e.selfAddress.Identity()), but then falls through to return true, nil on line 1555. This means errIfShardOwnershipLost will see ok == true and return nil, silently allowing operations on task lists this host doesn't own.\n\nThis defeats the entire purpose of the ownership guard for the non-SD code path — the very bug this PR is trying to fix (returning wrong errors when a shard is lost). In the non-SD case, the ownership check is now completely non-functional.\n\nThe fix is to return false, nil inside the identity mismatch branch.

Quality: Tests only cover SD path; non-SD membership path is untested

📄 service/matching/handler/engine_test.go:813
All test cases in TestErrIfShardOwnershipLost set executor.EXPECT().IsOnboardedToSD().Return(true).AnyTimes() on line 813, meaning the non-SD (membership resolver) code path is never exercised. This is why the critical bug in the non-SD path (always returning true) went undetected.\n\nPlease add test cases for the non-SD path:\n1. Membership Lookup fails → returns untyped error\n2. Task list owned by a different host → returns TaskListNotOwnedByHostError with the other host's identity\n3. Task list owned by self → returns nil

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: [Potential risks] section is filled out with risk assessment, and [Why?] section expanded with details about stats/metrics and logs consistency.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant