0.5.2: serialize concurrent entity updates via sidecar flock by mgoldsborough · Pull Request #27 · NimbleBrainInc/upjack

mgoldsborough · 2026-04-17T06:45:04Z

Summary

Tactical patch for the concurrent-update race observed in production. The architectural fix (versioned optimistic concurrency, both SDKs) is tracked in #26 and targets 0.6.0. This PR ships the flock stopgap so the race stops affecting users today.

The bug

From conv_30f049cdb75d464f on ws_mat: an agent parallelized update_deal({value: 15000}) and move_deal_stage("negotiation") on the same deal at stage: proposal. Both tools reported previous_stage: "lead" — a default value the deal had never actually been at. The final on-disk state was correct, but the intermediate tool responses lied about prior state.

Root cause: update_entity does read-modify-write with no concurrency control. Two concurrent writers each read the pre-state, each compute their update, each write. Last write wins on disk; every response describes a reality that didn't exist.

The fix

update_entity and delete_entity now acquire an exclusive fcntl.flock on a sidecar .lock file for the duration of the read-modify-write. Writers serialize; responses reflect real prior state.

Hardening:

30s acquisition timeout → EntityLockTimeout rather than wedging forever on a stuck-alive writer
Thread-local reentry tracking → nested update_entity calls on the same entity in one thread don't self-deadlock
Process-death safety — OS releases the lock on FD close, so a crashed holder doesn't permanently wedge others
Windows fallback — fcntl import is guarded; no-op on Windows with a clear warning at module load. Concurrent updates remain unsafe there, but no worse than 0.5.1.

Known limitations — by design

This is not the architectural fix. See #26 for the 0.6.0 plan: versioned optimistic concurrency using the existing version field, symmetric across both SDKs. That approach is:

Portable (works on Windows, NFS, any atomic-rename filesystem)
Self-documenting (version increments are observable; stale-read bugs detectable after the fact)
Cross-SDK (TypeScript gets identical semantics)

The flock work here can either stay as a performance optimizer on top of CAS in 0.6.0, or be removed entirely. Decision deferred to the 0.6.0 design.

Test plan

New TestConcurrentUpdates class in tests/test_entity.py:
- test_parallel_updates_do_not_clobber_distinct_fields — 20 iterations of two threads updating different fields; both writes must land
- test_lock_times_out_if_never_released — holder blocks indefinitely, waiter raises EntityLockTimeout within its deadline
- test_lock_is_reentrant_on_same_thread — nested _entity_lock calls return without deadlock
All 414 existing tests still pass
ruff format/check + ty check clean
After merge: tag python-v0.5.2 to trigger PyPI publish (TS unchanged at 0.5.1)

Production bug: two tool calls targeting the same entity in parallel (e.g. update_deal + move_deal_stage on the same deal) each read the same pre-state, compute their update, and write sequentially. The final on-disk state usually ends up consistent (last writer wins), but the intermediate tool responses lie — each returns the state it wrote, unaware of the other's overlap. Observed in conv_30f049cdb75d464f on ws_mat: move_deal_stage returned previous_stage="lead" when the actual prior stage was "proposal". Fix: update_entity and delete_entity now acquire an exclusive flock on a sidecar .lock file for the duration of read-modify-write. - 30s acquisition timeout → EntityLockTimeout instead of wedging forever if a writer is stuck alive - Thread-local reentry tracking → nested update_entity on the same entity within one thread doesn't self-deadlock - OS releases the lock automatically on process death (FD close) - Windows (no fcntl) falls through to a no-op; no worse than 0.5.1 This is a tactical patch. The architectural fix — versioned optimistic concurrency via the existing `version` field, symmetric across Python and TypeScript — is tracked in #26 and will ship as 0.6.0. 0.5.2 unblocks the production race today; 0.6.0 does it properly.

mgoldsborough merged commit 0c19df3 into main Apr 17, 2026
4 checks passed

mgoldsborough deleted the fix-update-entity-race branch April 17, 2026 06:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.5.2: serialize concurrent entity updates via sidecar flock#27

0.5.2: serialize concurrent entity updates via sidecar flock#27
mgoldsborough merged 1 commit intomainfrom
fix-update-entity-race

mgoldsborough commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgoldsborough commented Apr 17, 2026

Summary

The bug

The fix

Known limitations — by design

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant