Skip to content

fix: resolve stale lock and same-instance write contention (#622, #623)#626

Merged
rwmjhb merged 1 commit intoCortexReach:masterfrom
jlin53882:fix/issue-622-623-lock-clean-v2
Apr 15, 2026
Merged

fix: resolve stale lock and same-instance write contention (#622, #623)#626
rwmjhb merged 1 commit intoCortexReach:masterfrom
jlin53882:fix/issue-622-623-lock-clean-v2

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

Fixes two related lock issues in memory-lancedb-pro:

Changes

src/store.ts

  • Added \updateQueue\ for intra-instance write serialization
  • Added \lockfilePath\ option with explicit lock path
  • Added proactive stale lock cleanup (5-minute threshold)
  • Fixed nested deadlock where \update()\ used both
    unWithFileLock()\ AND
    unSerializedUpdate()\

test/lock-recovery.test.mjs

  • Tests for stale lock recovery

test/store-write-queue.test.mjs

  • Tests for queue serialization

Testing

  • 7 tests pass, 1 skip

Closes #622
Closes #623

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f691cff8bd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/store.ts Outdated
Comment on lines +223 to +224
if (ageMs > staleThresholdMs) {
if (stat.isDirectory()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear legacy lock artifacts immediately

This cleanup gate can leave a legacy .memory-write.lock regular file in place for up to 5 minutes, but the new lockfile.lock(this.config.dbPath, { lockfilePath }) flow expects that path to be created as a lock directory. In upgrade/restart scenarios where the old file was created recently, lock acquisition can fail (ELOCKED/ENOTDIR) and block all writes until the age threshold is crossed. Remove incompatible pre-existing lock artifacts regardless of age (or use the same short stale window as the lock) before acquiring the new lock.

Useful? React with 👍 / 👎.

@jlin53882
Copy link
Copy Markdown
Contributor Author

CI Failure Analysis

The cli-smoke and storage-and-schema job failures are NOT related to this PR's changes.

Root Cause

These failures are tracked in:

The cli-smoke.mjs mock for store is missing a count() method. PR #582 added a countStore parameter to
etrieveWithRetry(), but the test mock was never updated.

Verification

I tested on base commit

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Apr 15, 2026

Please resolve conflict at first

@jlin53882 jlin53882 force-pushed the fix/issue-622-623-lock-clean-v2 branch from f691cff to 18c39e7 Compare April 15, 2026 09:18
@jlin53882
Copy link
Copy Markdown
Contributor Author

@rwmjhb 幫我看一下~

現況更新

衝突已解決 — PR 已 rebase 到 upstream/master 最新,零衝突。

⚠️ packaging-and-workflow CI 失敗

CI 失敗在 verify-ci-test-manifest.mjs

Error: unexpected manifest entry: test/hook-dedup-phase1.test.mjs

初步確認這是 upstream #617 merge 時產生的既有问题,不是本 PR 引進的:

正在確認 upstream #617 的 CI 狀態,確認後會再更新。

@jlin53882
Copy link
Copy Markdown
Contributor Author

關聯 PR #617 Hook Event Deduplication 的潛在問題

嗨 @CortexReach/memory-lancedb-pro 維護者,

PR #617 實作了 hook event deduplication(_hookEventDedup Set,max 200 entries,prune to newest 100),並將新測試檔註冊進了 scripts/ci-test-manifest.mjs

注意到 PR #626 這次新增了兩個測試檔:

  • test/lock-recovery.test.mjs
  • test/store-write-queue.test.mjs

建議:確認這兩個測試檔是否也已註冊進 scripts/ci-test-manifest.mjs(類似的 packaging-and-workflow CI 失敗在 PR #617 時曾短暫出現,後來在 merge 前修復)。

另外,PR #617 的 hook deduplication 邏輯(index.ts)和 PR #626 的 lock recovery(src/store.ts)涉及不同的模組級狀態,如果有任何互動(例如 hook 層的 dedup skip 是否影響 store 層的 lock 事件時序),建議確認測試覆蓋。

謝謝!

@rwmjhb rwmjhb merged commit 2718a8b into CortexReach:master Apr 15, 2026
5 of 7 checks passed
jlin53882 added a commit to jlin53882/memory-lancedb-pro that referenced this pull request Apr 16, 2026
jlin53882 added a commit to jlin53882/memory-lancedb-pro that referenced this pull request Apr 16, 2026
…ortexReach#415)

對齊 upstream/master(包含 PR CortexReach#626 proactive cleanup),並保留 James 針對
Issue CortexReach#415 的修復:

- 從 PR CortexReach#626 引入 proactive cleanup(age > 5 分鐘的 stale lock 自動清除)
- 【修復 CortexReach#415】保守 retries 設定:
  - minTimeout: 1000ms(避免高負載下過度密集重試)
  - maxTimeout: 30000ms(支撐更久的 event loop 阻塞)
  - stale: 10000ms
- 【修復 CortexReach#415】onCompromised flag:lock compromised 時不立即崩潰,
  由 finally block 統一處理 fn() 錯誤 vs compromisedErr 的抛出邏輯
- 新增 lock-stress-test.mjs:驗證並發寫入、重試行為、stress test

PR CortexReach#517: CortexReach/memory-lancedb-pro
Issue CortexReach#415: ECOMPROMISED crash under event-loop pressure
jlin53882 added a commit to jlin53882/memory-lancedb-pro that referenced this pull request Apr 16, 2026
- PR CortexReach#517: onCompromised callback for graceful ECOMPROMISED handling
- PR CortexReach#517: retries 10, maxTimeout 30000ms, minTimeout 1000ms
- PR CortexReach#626: proactive stale lock cleanup (>5min)
- PR CortexReach#626: updateQueue for intra-instance write serialization

Issue CortexReach#632: lock contention in multi-agent environment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants