fix lack of idempotency in SpiceDB writes #148

vroldanbet · 2025-07-25T16:00:45Z

Built on top of #146, this PR only adds the last commit

I started all of this because of flakes that were trivially reproducible in #146, and the discovery that the underlying WriteToSpiceDB workflow activity was non-idempotent.

It was used by both the optimistic and pessimistic locking workflows, and the lack of idempotency meant the workflows couldn't recover by themselves from transient errors. When the tests failed for the wrong reasons, they flaked.

We introduced failpoints as a mechanism to simulate panics similar to hardware or network failures. However, some recent changes meant the failpoints were no longer unique and were causing failures in the wrong places.

When the pessimistic writer wrote an exclusive lock, retrying it due to failure wouldn't work, because the precondition prevented it (make sure no one has a lock on this resource!)
When the optimistic writer wrote with CREATE semantics, a subsequent retry after a successful write due to a failure response would fail because the tuples already exist.

This commit proposes making WriteToSpiceDB truly idempotent by introducing idempotency keys in the SpiceDB schema. All writes will include a relationships that identify the workflow and the hash of the payload as the idempotency key.

The flow is as follows:

perform write
If failure happens, check if the idempotency key was written in the previous request
if exists, assume operation was successful
if it does not, bubble up the error

The cost of the extra ReadRelationships is only paid in the event of a retry due to a failure.

The tests were written with the assumption that the system would bubble up errors after recovery. This goes against the expectations of a durable workflow engine, which embraces idempotency and is expected to retry on errors, rather than have the client retry, unless those are unrecoverable.

This was all by design: the workflow wouldn't be responsible for retrying things, but rather execute compensatory operations after an operation failed. This meant the client had to retry those errors, and it turns out that troubleshooting what happened on a transient error is not that trivial for folks building on top of the spicedb-kubeapi-proxy.

codecov-commenter · 2025-07-25T16:05:45Z

Codecov Report

❌ Patch coverage is 83.47107% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.77%. Comparing base (d3c3219) to head (0e61400).

Files with missing lines	Patch %	Lines
pkg/authz/distributedtx/activity.go	81.39%	11 Missing and 5 partials ⚠️
pkg/authz/distributedtx/workflow.go	88.23%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #148      +/-   ##
==========================================
+ Coverage   65.67%   65.77%   +0.09%     
==========================================
  Files          25       25              
  Lines        3292     3392     +100     
==========================================
+ Hits         2162     2231      +69     
- Misses        890      913      +23     
- Partials      240      248       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pkg/authz/distributedtx/activity.go

pkg/authz/distributedtx/workflow.go

I started all of this because of flakes that were trivially to reproduce, and the underlying issue was WriteToSpiceDB was not idempotent. It was used by both the optimistic and pesimistic locking workflows, and the lack of idempotency meant the workflows couldn't recover by themselves from transient errors. When the tests failed for the wrong reasons, they flaked. We introduced `failpoints` as a mechanism to simulate panics similar to hardware or network failures. However, some recent changes meant the failpoints were no longer unique and were causing failures _in the wrong places_. - when the pessimistic wrote an exclusive lock, retrying it due to failure wouldn't work, because the precondition prevented it (_make sure no one has a lock on this resource_!) - when the optimistic wrote with CREATE semantics, a subsequent retry after a successful write due to failure response failure would fail because the tuples already exist. This commit proposes making WriteToSpiceDB truly idempotent by introducing idempotency keys in the SpiceDB schema. All writes will include a relationships that identify the workflow and the hash of the payload as the idempotency key. The flow is as follows: - perform write - if failure happens, check if idempotency key was written in previous request - if exists, assume operation was successful - if it does not, bubble up the error The cost of the extra ReadRelationships is only paid in the even of a retry due to a failure. The tests were written with the assumption that the system would bubble up errors after recovery. This goes against the expectations of a durable workflow engine, which embraces idempotency and is expected to retry on errors, rather than have the client retry, unless those are unrecoverable. This was all by design: the workflow wouldn't be responsible to retry things, but rather execute compensatory operations after an operation failed. This meant the client had to retry those errors, and it turns out troubleshooting what happened on a transient error is not that trivial for folks building on top of the spicedb-kubeapi-proxy.

adds relationship expiration to idempotency keys, so they don't end up accumulating indefinitely in the database

github-actions bot added area/tooling Affects the dev or user toolchain area/core labels Jul 25, 2025

vroldanbet mentioned this pull request Jul 25, 2025

fix: prevent panics in KubeConflict #146

Closed

vroldanbet force-pushed the fix-non-idempotent-lock branch from 6316aa6 to 68b7862 Compare July 25, 2025 16:15

vroldanbet requested a review from ecordell July 25, 2025 16:26

vroldanbet self-assigned this Jul 25, 2025

vroldanbet requested a review from josephschorr July 25, 2025 16:29

vroldanbet changed the base branch from prevent-panic to main July 25, 2025 16:33

vroldanbet force-pushed the fix-non-idempotent-lock branch from 68b7862 to af0f2f6 Compare July 25, 2025 16:41

vroldanbet commented Jul 29, 2025

View reviewed changes

pkg/authz/distributedtx/activity.go Outdated Show resolved Hide resolved

ecordell reviewed Jul 30, 2025

View reviewed changes

pkg/authz/distributedtx/activity.go Outdated Show resolved Hide resolved

pkg/authz/distributedtx/activity.go Show resolved Hide resolved

pkg/authz/distributedtx/workflow.go Outdated Show resolved Hide resolved

miparnisari and others added 5 commits July 31, 2025 17:48

fix: prevent panics in KubeConflict

58c3c84

minor adjustments

0af5060

change test assertion

e148d1e

add run e2e until it fails mage command

27a9395

vroldanbet force-pushed the fix-non-idempotent-lock branch from af0f2f6 to 4b28081 Compare July 31, 2025 16:48

vroldanbet added 3 commits July 31, 2025 18:05

return zedtoken in WriteToSpiceDB activity in case of retries

305f098

PR feedback: missing CloneVT call

0e8f02f

PR feedback: clean up idempotency key tuples with expiration

0e61400

adds relationship expiration to idempotency keys, so they don't end up accumulating indefinitely in the database

github-actions bot added the area/dependencies Affects dependencies label Jul 31, 2025

vroldanbet requested a review from ecordell July 31, 2025 18:14

ecordell approved these changes Jul 31, 2025

View reviewed changes

vroldanbet merged commit eacf9c1 into main Jul 31, 2025
7 checks passed

github-actions bot locked and limited conversation to collaborators Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix lack of idempotency in SpiceDB writes #148

fix lack of idempotency in SpiceDB writes #148

Uh oh!

vroldanbet commented Jul 25, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix lack of idempotency in SpiceDB writes #148

fix lack of idempotency in SpiceDB writes #148

Uh oh!

Conversation

vroldanbet commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vroldanbet commented Jul 25, 2025 •

edited

Loading

codecov-commenter commented Jul 25, 2025 •

edited

Loading