Skip to content

Conversation

@vroldanbet
Copy link
Contributor

@vroldanbet vroldanbet commented Jul 25, 2025

Built on top of #146, this PR only adds the last commit

I started all of this because of flakes that were trivially reproducible in #146, and the discovery that the underlying WriteToSpiceDB workflow activity was non-idempotent.

It was used by both the optimistic and pessimistic locking workflows, and the lack of idempotency meant the workflows couldn't recover by themselves from transient errors. When the tests failed for the wrong reasons, they flaked.

We introduced failpoints as a mechanism to simulate panics similar to hardware or network failures. However, some recent changes meant the failpoints were no longer unique and were causing failures in the wrong places.

  • When the pessimistic writer wrote an exclusive lock, retrying it due to failure wouldn't work, because the precondition prevented it (make sure no one has a lock on this resource!)
  • When the optimistic writer wrote with CREATE semantics, a subsequent retry after a successful write due to a failure response would fail because the tuples already exist.

This commit proposes making WriteToSpiceDB truly idempotent by introducing idempotency keys in the SpiceDB schema. All writes will include a relationships that identify the workflow and the hash of the payload as the idempotency key.

The flow is as follows:

  • perform write
  • If failure happens, check if the idempotency key was written in the previous request
  • if exists, assume operation was successful
  • if it does not, bubble up the error

The cost of the extra ReadRelationships is only paid in the event of a retry due to a failure.

The tests were written with the assumption that the system would bubble up errors after recovery. This goes against the expectations of a durable workflow engine, which embraces idempotency and is expected to retry on errors, rather than have the client retry, unless those are unrecoverable.

This was all by design: the workflow wouldn't be responsible for retrying things, but rather execute compensatory operations after an operation failed. This meant the client had to retry those errors, and it turns out that troubleshooting what happened on a transient error is not that trivial for folks building on top of the spicedb-kubeapi-proxy.

@github-actions github-actions bot added area/tooling Affects the dev or user toolchain area/core labels Jul 25, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jul 25, 2025

Codecov Report

❌ Patch coverage is 83.47107% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.77%. Comparing base (d3c3219) to head (0e61400).

Files with missing lines Patch % Lines
pkg/authz/distributedtx/activity.go 81.39% 11 Missing and 5 partials ⚠️
pkg/authz/distributedtx/workflow.go 88.23% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #148      +/-   ##
==========================================
+ Coverage   65.67%   65.77%   +0.09%     
==========================================
  Files          25       25              
  Lines        3292     3392     +100     
==========================================
+ Hits         2162     2231      +69     
- Misses        890      913      +23     
- Partials      240      248       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vroldanbet vroldanbet force-pushed the fix-non-idempotent-lock branch from 6316aa6 to 68b7862 Compare July 25, 2025 16:15
@vroldanbet vroldanbet requested a review from ecordell July 25, 2025 16:26
@vroldanbet vroldanbet self-assigned this Jul 25, 2025
@vroldanbet vroldanbet requested a review from josephschorr July 25, 2025 16:29
@vroldanbet vroldanbet changed the base branch from prevent-panic to main July 25, 2025 16:33
@vroldanbet vroldanbet force-pushed the fix-non-idempotent-lock branch from 68b7862 to af0f2f6 Compare July 25, 2025 16:41
miparnisari and others added 5 commits July 31, 2025 17:48
I started all of this because of flakes that were trivially
to reproduce, and the underlying issue was WriteToSpiceDB was not idempotent.

It was used by both the optimistic and pesimistic locking workflows, and
the lack of idempotency meant the workflows couldn't recover
by themselves from transient errors. When the tests failed
for the wrong reasons, they flaked.

We introduced `failpoints` as a mechanism to simulate panics
similar to hardware or network failures. However, some recent
changes meant the failpoints were no longer unique and
were causing failures _in the wrong places_.

- when the pessimistic wrote an exclusive lock, retrying it due to
  failure wouldn't work, because the precondition prevented it
  (_make sure no one has a lock on this resource_!)
- when the optimistic wrote with CREATE semantics, a
  subsequent retry after a successful write due to failure
  response failure would fail because the tuples already exist.

This commit proposes making WriteToSpiceDB truly idempotent by
introducing idempotency keys in the SpiceDB schema. All writes
will include a relationships that identify the workflow and the hash
of the payload as the idempotency key.

The flow is as follows:
- perform write
- if failure happens, check if idempotency key was written in previous request
- if exists, assume operation was successful
- if it does not, bubble up the error

The cost of the extra ReadRelationships is only paid in the even of a retry
due to a failure.

The tests were written with the assumption that the system would bubble up
errors after recovery. This goes against the expectations of a durable
workflow engine, which embraces idempotency and is expected to retry
on errors, rather than have the client retry, unless those are unrecoverable.

This was all by design: the workflow wouldn't be responsible to retry things, but rather
execute compensatory operations after an operation failed. This meant
the client had to retry those errors, and it turns out troubleshooting
what happened on a transient error is not that trivial for folks
building on top of the spicedb-kubeapi-proxy.
@vroldanbet vroldanbet force-pushed the fix-non-idempotent-lock branch from af0f2f6 to 4b28081 Compare July 31, 2025 16:48
adds relationship expiration to idempotency keys, so they
don't end up accumulating indefinitely in the database
@github-actions github-actions bot added the area/dependencies Affects dependencies label Jul 31, 2025
@vroldanbet vroldanbet requested a review from ecordell July 31, 2025 18:14
@vroldanbet vroldanbet merged commit eacf9c1 into main Jul 31, 2025
7 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jul 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area/core area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants