fix: prevent CLI session message loss on concurrent saves#1854
Conversation
UnifiedSessionStore.save() overwrote the session file without reloading disk state first. Two CLI/TUI processes sharing a session could lose chat history when one saved from a stale in-memory copy. - Reload and merge messages under an exclusive file lock before writing - Invalidate in-process cache when the on-disk file is newer - Add regression tests for stale-save and concurrent writer scenarios Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
|
@coderabbitai review |
|
/review |
✅ Action performedReview finished.
|
|
Important Review skippedBot user detected. To trigger a single review, invoke the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthrough
ChangesConcurrent-safe session store with merge conflict handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings. Review areas:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/praisonai/praisonai/cli/session/unified.py`:
- Around line 356-360: The deletion block around path.exists() / path.unlink()
is susceptible to a race where unlink() raises FileNotFoundError; modify the
code in the session cleanup logic (the branch that currently checks
path.exists() then calls path.unlink()) to either call
path.unlink(missing_ok=True) or wrap path.unlink() in a try/except
FileNotFoundError and ignore it, then acquire self._lock and pop from
self._cache and self._cache_mtimes as before; ensure the same unique symbols are
used (path.unlink, self._lock, self._cache.pop, self._cache_mtimes.pop) so the
change is localized and thread/process-safe.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 857221f2-d06e-486a-aaf3-402b846f416c
📒 Files selected for processing (2)
src/praisonai/praisonai/cli/session/unified.pysrc/praisonai/tests/unit/cli/test_unified_session.py
| if path.exists(): | ||
| path.unlink() | ||
| self._cache.pop(session_id, None) | ||
| with self._lock: | ||
| self._cache.pop(session_id, None) | ||
| self._cache_mtimes.pop(session_id, None) |
There was a problem hiding this comment.
Concurrent delete could raise FileNotFoundError.
Between path.exists() check and path.unlink(), another process could delete the file, causing unlink() to raise. Consider using missing_ok=True (Python 3.8+) for robustness.
Proposed fix
if path.exists():
- path.unlink()
+ path.unlink(missing_ok=True)
with self._lock:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if path.exists(): | |
| path.unlink() | |
| self._cache.pop(session_id, None) | |
| with self._lock: | |
| self._cache.pop(session_id, None) | |
| self._cache_mtimes.pop(session_id, None) | |
| if path.exists(): | |
| path.unlink(missing_ok=True) | |
| with self._lock: | |
| self._cache.pop(session_id, None) | |
| self._cache_mtimes.pop(session_id, None) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/praisonai/praisonai/cli/session/unified.py` around lines 356 - 360, The
deletion block around path.exists() / path.unlink() is susceptible to a race
where unlink() raises FileNotFoundError; modify the code in the session cleanup
logic (the branch that currently checks path.exists() then calls path.unlink())
to either call path.unlink(missing_ok=True) or wrap path.unlink() in a
try/except FileNotFoundError and ignore it, then acquire self._lock and pop from
self._cache and self._cache_mtimes as before; ensure the same unique symbols are
used (path.unlink, self._lock, self._cache.pop, self._cache_mtimes.pop) so the
change is localized and thread/process-safe.
|
@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding. Phase 1: Review per AGENTS.md
Phase 2: FIX Valid Issues Phase 3: Final Verdict |
|
Claude finished @MervinPraison's task —— View job Final Architecture Review ✅Phase 1: Architecture Review per AGENTS.md
Phase 2: Fix Valid Issues
Phase 3: Final Verdict
✅ APPROVED - Architecture Review CompleteArchitecture Compliance (AGENTS.md)
Code Quality Assessment
Issues AddressedFixed (Pushed to branch):
Validated:
SummaryThis PR implements a robust solution for the CLI session message loss bug. The implementation follows PraisonAI architecture guidelines and includes proper concurrent-safe file handling. The single identified race condition has been fixed and pushed to the branch. Final Recommendation: ✅ APPROVE and MERGE |
Use missing_ok=True to handle concurrent file deletion gracefully. Fixes potential race where another process deletes file between exists() check and unlink() call. Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Greptile SummaryThis PR adds read-modify-write semantics to
Confidence Score: 3/5The core message-preservation fix is sound, but two issues in the changed code warrant attention before merging: a read path that now requires write file permissions (silently surfacing as a missing session), and cumulative stats that are silently under-counted when two processes work from the same baseline. The read-modify-write locking strategy and message deduplication logic are correct for their primary goal. However, src/praisonai/praisonai/cli/session/unified.py — specifically the Important Files Changed
Sequence DiagramsequenceDiagram
participant A as Process A
participant F as Session File
participant B as Process B
Note over A,B: Concurrent save scenario (new behaviour)
A->>F: open("r+b") + LOCK_EX
B->>F: open("r+b") [blocks on lock]
A->>F: read existing JSON
A->>A: merge(on_disk, incoming_A)
A->>F: write merged JSON + fsync
A->>F: release lock
Note over A: stat().st_mtime → cache_A
B->>F: acquire LOCK_EX
B->>F: read merged JSON (includes A's messages)
B->>B: merge(on_disk, incoming_B)
B->>F: write merged JSON + fsync
B->>F: release lock
Note over B: stat().st_mtime → cache_B
Note over A,B: load() cache-freshness check
A->>F: stat().st_mtime
alt "mtime <= cached_mtime"
A->>A: return in-memory cache
else "mtime > cached_mtime"
A->>F: open("r+b") + LOCK_EX
A->>F: read latest JSON
A->>F: release lock
A->>A: update cache + mtime
end
Reviews (1): Last reviewed commit: "fix: prevent FileNotFoundError race cond..." | Re-trigger Greptile |
| merged.total_input_tokens = max( | ||
| on_disk.total_input_tokens, incoming.total_input_tokens | ||
| ) | ||
| merged.total_output_tokens = max( | ||
| on_disk.total_output_tokens, incoming.total_output_tokens | ||
| ) | ||
| merged.total_cost = max(on_disk.total_cost, incoming.total_cost) | ||
| merged.request_count = max(on_disk.request_count, incoming.request_count) |
There was a problem hiding this comment.
total_input_tokens, total_output_tokens, total_cost, and request_count are cumulative accumulators, not instantaneous values. Using max() silently discards the contribution from whichever process has the smaller running total. For example, if both processes started from a baseline of 100 tokens and Process A added 50 while Process B added 70, max(150, 170) = 170 drops Process A's 50-token contribution entirely. The correct total would require tracking each process's delta from the shared baseline, which the current data model doesn't support. At minimum this should be noted as a known limitation; if exact accounting matters, consider storing per-process deltas or using an append-only log.
| merged.total_input_tokens = max( | |
| on_disk.total_input_tokens, incoming.total_input_tokens | |
| ) | |
| merged.total_output_tokens = max( | |
| on_disk.total_output_tokens, incoming.total_output_tokens | |
| ) | |
| merged.total_cost = max(on_disk.total_cost, incoming.total_cost) | |
| merged.request_count = max(on_disk.request_count, incoming.request_count) | |
| # NOTE: cumulative counters cannot be merged correctly without knowing the | |
| # shared baseline each process started from; max() avoids a total reset but | |
| # will under-count when both processes accumulate from the same starting | |
| # point. Tracked as a known limitation of the current data model. | |
| merged.total_input_tokens = max( | |
| on_disk.total_input_tokens, incoming.total_input_tokens | |
| ) | |
| merged.total_output_tokens = max( | |
| on_disk.total_output_tokens, incoming.total_output_tokens | |
| ) | |
| merged.total_cost = max(on_disk.total_cost, incoming.total_cost) | |
| merged.request_count = max(on_disk.request_count, incoming.request_count) |
| with open(path, "r+b") as f: | ||
| self._acquire_exclusive_lock(f) | ||
| try: | ||
| data = self._read_json_locked(f) | ||
| finally: | ||
| self._release_exclusive_lock(f) |
There was a problem hiding this comment.
load() now opens the session file with "r+b" (read-write) mode even though it never writes anything. This silently regresses read-only access: if a session file has been chmod'd to 0444 (or any other state where the process has read but not write permission), open(path, "r+b") raises a PermissionError that is caught and turned into None, making a perfectly valid session appear missing. The original code correctly used "rb" for reads. A shared (LOCK_SH) lock is also sufficient here and avoids serialising concurrent readers.
| with open(path, "r+b") as f: | |
| self._acquire_exclusive_lock(f) | |
| try: | |
| data = self._read_json_locked(f) | |
| finally: | |
| self._release_exclusive_lock(f) | |
| with open(path, "rb") as f: | |
| if sys.platform == "win32": | |
| import msvcrt | |
| f.seek(0) | |
| msvcrt.locking(f.fileno(), msvcrt.LK_RLCK, 1) | |
| try: | |
| data = self._read_json_locked(f) | |
| finally: | |
| f.seek(0) | |
| msvcrt.locking(f.fileno(), msvcrt.LK_UNLCK, 1) | |
| elif _HAS_FCNTL: | |
| fcntl.flock(f.fileno(), fcntl.LOCK_SH) | |
| try: | |
| data = self._read_json_locked(f) | |
| finally: | |
| fcntl.flock(f.fileno(), fcntl.LOCK_UN) | |
| else: | |
| data = self._read_json_locked(f) |
| try: | ||
| mtime = path.stat().st_mtime | ||
| except OSError: | ||
| mtime = datetime.now().timestamp() | ||
|
|
||
| with self._lock: | ||
| self._cache[session.session_id] = to_save | ||
| self._cache_mtimes[session.session_id] = mtime |
There was a problem hiding this comment.
mtime race between write and stat
The file lock is released when the with open(...) block exits (line 248), but path.stat().st_mtime is read after that point. Another process could write the file in the gap, giving us their later mtime. We then store (our merged content, their mtime) in the cache. On the next load(), _is_cache_fresh compares current_mtime <= cached_mtime and may return True, serving our stale in-memory copy instead of re-reading from disk. The caller would see incomplete message history until the next invalidating event. Reading st_mtime while still holding the file lock (i.e., before releasing it) would close the window.
Bug and impact
UnifiedSessionStore.save()wrote the full in-memory session to disk without reloading first. When two CLI/TUI processes (or threads) used the same session file—e.g. TUI plus--interactive, or two terminals resuming the same session—the last writer could overwrite newer chat messages from the other process.Severity: Data loss of user chat history.
Root cause
save()acquired a file lock but performed a blind overwrite of the JSON file. Process A could load messages 1–10, process B could append message 11 and save, then process A could save its stale copy and drop message 11.This is the same class of bug fixed for
DefaultSessionStorein #1781/#1790, but it remained in the CLI layer after #1837 added cross-platform locking without read-modify-write.Fix
load()refreshes cache when another process updates the fileValidation
pytest tests/unit/cli/test_unified_session.py— 22 passed (including new stale-save and concurrent-save tests)Summary by CodeRabbit
Bug Fixes
Tests