fix(matching): we returned wrong error if shard is lost#7713
fix(matching): we returned wrong error if shard is lost#7713dkrotx wants to merge 1 commit intocadence-workflow:masterfrom
Conversation
During refactoring for cadence-workflow#7711 When shard was lost and matcher is on shard-manager, we returned untyped error which could potentially lead to wrong retries from client. Also, returning right error ensures we have the same stats/logs as we would have without shard-manager. Signed-off-by: Jan Kisel <dkrot@uber.com>
20fc077 to
37ca410
Compare
🔍 CI failure analysis for 37ca410: The Golang integration test with sqlite failed after 679 seconds with no specific test failure output, likely a timeout or teardown issue unrelated to this PR's matching service error handling changes.IssueThe Golang integration test with sqlite failed after running for 679 seconds (over 11 minutes) with exit code 2. The test suite completed with Root CauseIntegration Test Failure: Likely Timeout or Flakiness This integration test failure appears unrelated to the PR changes based on the following analysis: Test Characteristics:
Evidence Found in Logs:
Why This Cannot Be Caused By The PR:
DetailsWhat the Logs Show:
What's Missing:
Integration Test Nature:
Code Review ✅ Approved 3 resolved / 3 findingsClean refactoring that correctly returns typed errors for shard ownership loss across both SD and legacy paths. Code is well-tested with comprehensive table-driven tests covering error and success scenarios. ✅ 3 resolved✅ Quality: Typo in Fatal log message: "im" should be "in"
✅ Bug:
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
What changed?
During refactoring for #7711
Return typed error when shard is lost and matcher uses shard-manager.
Why?
Previously we returned untyped error which could potentially lead to wrong retries from client.
Also, returning untyped error prevented us from same stats (metrics.CadenceErrTaskListNotOwnedByHostPerTaskListCounter) and logs as we emit when matching is not onboarded to shard-manager. Handling this error should work the same way with shard-manager enabled.
How did you test it?
Wrote a new table-test for engine.errIfShardOwnershipLost:
go test ./service/matching/handler -run TestErrIfShardOwnershipLost
Potential risks
Should no risk as the changes only used in frontend -> matching integration (Cadence internal) and auxiliary stats emitted from matching.
Release notes
Documentation Changes
Reviewer Validation
PR Description Quality (check these before reviewing code):
go testinvocation)