Skip to content

0.5.2: serialize concurrent entity updates via sidecar flock#27

Merged
mgoldsborough merged 1 commit intomainfrom
fix-update-entity-race
Apr 17, 2026
Merged

0.5.2: serialize concurrent entity updates via sidecar flock#27
mgoldsborough merged 1 commit intomainfrom
fix-update-entity-race

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

Summary

Tactical patch for the concurrent-update race observed in production. The architectural fix (versioned optimistic concurrency, both SDKs) is tracked in #26 and targets 0.6.0. This PR ships the flock stopgap so the race stops affecting users today.

The bug

From conv_30f049cdb75d464f on ws_mat: an agent parallelized update_deal({value: 15000}) and move_deal_stage("negotiation") on the same deal at stage: proposal. Both tools reported previous_stage: "lead" — a default value the deal had never actually been at. The final on-disk state was correct, but the intermediate tool responses lied about prior state.

Root cause: update_entity does read-modify-write with no concurrency control. Two concurrent writers each read the pre-state, each compute their update, each write. Last write wins on disk; every response describes a reality that didn't exist.

The fix

update_entity and delete_entity now acquire an exclusive fcntl.flock on a sidecar .lock file for the duration of the read-modify-write. Writers serialize; responses reflect real prior state.

Hardening:

  • 30s acquisition timeoutEntityLockTimeout rather than wedging forever on a stuck-alive writer
  • Thread-local reentry tracking → nested update_entity calls on the same entity in one thread don't self-deadlock
  • Process-death safety — OS releases the lock on FD close, so a crashed holder doesn't permanently wedge others
  • Windows fallbackfcntl import is guarded; no-op on Windows with a clear warning at module load. Concurrent updates remain unsafe there, but no worse than 0.5.1.

Known limitations — by design

This is not the architectural fix. See #26 for the 0.6.0 plan: versioned optimistic concurrency using the existing version field, symmetric across both SDKs. That approach is:

  • Portable (works on Windows, NFS, any atomic-rename filesystem)
  • Self-documenting (version increments are observable; stale-read bugs detectable after the fact)
  • Cross-SDK (TypeScript gets identical semantics)

The flock work here can either stay as a performance optimizer on top of CAS in 0.6.0, or be removed entirely. Decision deferred to the 0.6.0 design.

Test plan

  • New TestConcurrentUpdates class in tests/test_entity.py:
    • test_parallel_updates_do_not_clobber_distinct_fields — 20 iterations of two threads updating different fields; both writes must land
    • test_lock_times_out_if_never_released — holder blocks indefinitely, waiter raises EntityLockTimeout within its deadline
    • test_lock_is_reentrant_on_same_thread — nested _entity_lock calls return without deadlock
  • All 414 existing tests still pass
  • ruff format/check + ty check clean
  • After merge: tag python-v0.5.2 to trigger PyPI publish (TS unchanged at 0.5.1)

Related

Production bug: two tool calls targeting the same entity in parallel
(e.g. update_deal + move_deal_stage on the same deal) each read the
same pre-state, compute their update, and write sequentially. The
final on-disk state usually ends up consistent (last writer wins),
but the intermediate tool responses lie — each returns the state it
wrote, unaware of the other's overlap. Observed in conv_30f049cdb75d464f
on ws_mat: move_deal_stage returned previous_stage="lead" when the
actual prior stage was "proposal".

Fix: update_entity and delete_entity now acquire an exclusive flock
on a sidecar .lock file for the duration of read-modify-write.

- 30s acquisition timeout → EntityLockTimeout instead of wedging
  forever if a writer is stuck alive
- Thread-local reentry tracking → nested update_entity on the same
  entity within one thread doesn't self-deadlock
- OS releases the lock automatically on process death (FD close)
- Windows (no fcntl) falls through to a no-op; no worse than 0.5.1

This is a tactical patch. The architectural fix — versioned optimistic
concurrency via the existing `version` field, symmetric across Python
and TypeScript — is tracked in #26 and will ship as 0.6.0. 0.5.2
unblocks the production race today; 0.6.0 does it properly.
@mgoldsborough mgoldsborough merged commit 0c19df3 into main Apr 17, 2026
4 checks passed
@mgoldsborough mgoldsborough deleted the fix-update-entity-race branch April 17, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0.6.0: versioned optimistic concurrency for entity updates (cross-SDK)

1 participant