0.5.2: serialize concurrent entity updates via sidecar flock#27
Merged
mgoldsborough merged 1 commit intomainfrom Apr 17, 2026
Merged
0.5.2: serialize concurrent entity updates via sidecar flock#27mgoldsborough merged 1 commit intomainfrom
mgoldsborough merged 1 commit intomainfrom
Conversation
Production bug: two tool calls targeting the same entity in parallel (e.g. update_deal + move_deal_stage on the same deal) each read the same pre-state, compute their update, and write sequentially. The final on-disk state usually ends up consistent (last writer wins), but the intermediate tool responses lie — each returns the state it wrote, unaware of the other's overlap. Observed in conv_30f049cdb75d464f on ws_mat: move_deal_stage returned previous_stage="lead" when the actual prior stage was "proposal". Fix: update_entity and delete_entity now acquire an exclusive flock on a sidecar .lock file for the duration of read-modify-write. - 30s acquisition timeout → EntityLockTimeout instead of wedging forever if a writer is stuck alive - Thread-local reentry tracking → nested update_entity on the same entity within one thread doesn't self-deadlock - OS releases the lock automatically on process death (FD close) - Windows (no fcntl) falls through to a no-op; no worse than 0.5.1 This is a tactical patch. The architectural fix — versioned optimistic concurrency via the existing `version` field, symmetric across Python and TypeScript — is tracked in #26 and will ship as 0.6.0. 0.5.2 unblocks the production race today; 0.6.0 does it properly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tactical patch for the concurrent-update race observed in production. The architectural fix (versioned optimistic concurrency, both SDKs) is tracked in #26 and targets 0.6.0. This PR ships the flock stopgap so the race stops affecting users today.
The bug
From
conv_30f049cdb75d464fonws_mat: an agent parallelizedupdate_deal({value: 15000})andmove_deal_stage("negotiation")on the same deal atstage: proposal. Both tools reportedprevious_stage: "lead"— a default value the deal had never actually been at. The final on-disk state was correct, but the intermediate tool responses lied about prior state.Root cause:
update_entitydoes read-modify-write with no concurrency control. Two concurrent writers each read the pre-state, each compute their update, each write. Last write wins on disk; every response describes a reality that didn't exist.The fix
update_entityanddelete_entitynow acquire an exclusivefcntl.flockon a sidecar.lockfile for the duration of the read-modify-write. Writers serialize; responses reflect real prior state.Hardening:
EntityLockTimeoutrather than wedging forever on a stuck-alive writerupdate_entitycalls on the same entity in one thread don't self-deadlockfcntlimport is guarded; no-op on Windows with a clear warning at module load. Concurrent updates remain unsafe there, but no worse than 0.5.1.Known limitations — by design
This is not the architectural fix. See #26 for the 0.6.0 plan: versioned optimistic concurrency using the existing
versionfield, symmetric across both SDKs. That approach is:The flock work here can either stay as a performance optimizer on top of CAS in 0.6.0, or be removed entirely. Decision deferred to the 0.6.0 design.
Test plan
TestConcurrentUpdatesclass intests/test_entity.py:test_parallel_updates_do_not_clobber_distinct_fields— 20 iterations of two threads updating different fields; both writes must landtest_lock_times_out_if_never_released— holder blocks indefinitely, waiter raisesEntityLockTimeoutwithin its deadlinetest_lock_is_reentrant_on_same_thread— nested_entity_lockcalls return without deadlockruff format/check+ty checkcleanpython-v0.5.2to trigger PyPI publish (TS unchanged at 0.5.1)Related