Skip to content

fix: prevent CLI session message loss on concurrent saves#1854

Merged
MervinPraison merged 2 commits into
mainfrom
cursor/critical-bug-investigation-7b44
Jun 12, 2026
Merged

fix: prevent CLI session message loss on concurrent saves#1854
MervinPraison merged 2 commits into
mainfrom
cursor/critical-bug-investigation-7b44

Conversation

@cursor

@cursor cursor Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Bug and impact

UnifiedSessionStore.save() wrote the full in-memory session to disk without reloading first. When two CLI/TUI processes (or threads) used the same session file—e.g. TUI plus --interactive, or two terminals resuming the same session—the last writer could overwrite newer chat messages from the other process.

Severity: Data loss of user chat history.

Root cause

save() acquired a file lock but performed a blind overwrite of the JSON file. Process A could load messages 1–10, process B could append message 11 and save, then process A could save its stale copy and drop message 11.

This is the same class of bug fixed for DefaultSessionStore in #1781/#1790, but it remained in the CLI layer after #1837 added cross-platform locking without read-modify-write.

Fix

  • Reload session JSON under an exclusive lock before writing
  • Merge messages from disk and incoming session (dedupe by role/content/timestamp)
  • Track file mtimes so load() refreshes cache when another process updates the file

Validation

  • pytest tests/unit/cli/test_unified_session.py — 22 passed (including new stale-save and concurrent-save tests)
Open in Web View Automation 

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced session persistence to reliably handle concurrent updates across multiple processes
    • Improved message deduplication and metadata handling during concurrent session access
    • Prevents message loss when sessions are updated simultaneously from multiple sources
  • Tests

    • Added unit tests for concurrent session operations

UnifiedSessionStore.save() overwrote the session file without reloading
disk state first. Two CLI/TUI processes sharing a session could lose chat
history when one saved from a stale in-memory copy.

- Reload and merge messages under an exclusive file lock before writing
- Invalidate in-process cache when the on-disk file is newer
- Add regression tests for stale-save and concurrent writer scenarios

Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
@MervinPraison

Copy link
Copy Markdown
Owner

@coderabbitai review

@MervinPraison

Copy link
Copy Markdown
Owner

/review

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 120438a0-3865-438c-a47e-95d2398e89d3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

UnifiedSessionStore now ensures multi-process safety by maintaining a re-entrant lock and in-memory cache with per-session mtime tracking. When saving, existing on-disk sessions are merged with incoming changes by deduplicating messages and resolving counters. When loading, cache freshness is validated via mtime; stale cache is discarded. Session deletion evicts cache under lock. Two new tests verify stale-copy and parallel-writer scenarios.

Changes

Concurrent-safe session store with merge conflict handling

Layer / File(s) Summary
Threading support and merge conflict resolution
src/praisonai/praisonai/cli/session/unified.py
Adds threading import for synchronization. Introduces re-entrant lock, mtime-tracking cache dict, and _merge_sessions() helper that deduplicates messages by role/content/timestamp and resolves token/cost/request counters via max.
Exclusive-lock and JSON I/O helpers
src/praisonai/praisonai/cli/session/unified.py
Centralizes exclusive-lock acquisition/release and JSON read/write operations with truncation, flushing, and fsync to ensure atomicity and durability.
Concurrent-safe save and load with cache validation
src/praisonai/praisonai/cli/session/unified.py
save() uses exclusive-lock helpers and _merge_sessions() to combine concurrent on-disk and incoming sessions, then updates cache/mtimes under lock. load() validates cache freshness via _is_cache_fresh() mtime comparison; returns cached only if fresh; otherwise reads JSON under lock and refreshes cache.
Thread-safe session deletion
src/praisonai/praisonai/cli/session/unified.py
Updates delete() to evict session and mtime cache entries under the re-entrant lock.
Concurrency behavior tests
src/praisonai/tests/unit/cli/test_unified_session.py
Adds ThreadPoolExecutor import. Two new tests: stale-copy preservation (verifies concurrent saves merge messages without loss) and concurrent writers (verifies parallel writers all save distinct messages).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • MervinPraison/PraisonAI#1837: Both PRs modify UnifiedSessionStore file locking in save/load paths with exclusive locks and centralized lock/I/O helpers.
  • MervinPraison/PraisonAI#1759: Both PRs add mtime-based cache staleness detection to the session read path to avoid returning stale in-memory cached history.
  • MervinPraison/PraisonAI#1745: Both PRs prevent concurrent-write message loss via locked read-modify-write with merge/deduplication semantics in session stores.

Suggested reviewers

  • MervinPraison

Poem

🐰 Locks and merges, caches bright,
Threads no longer race through night,
Messages preserved, not lost in time,
Sessions sync in perfect rhyme! 🔐✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main fix: preventing message loss during concurrent session saves in the CLI layer.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cursor/critical-bug-investigation-7b44

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MervinPraison

Copy link
Copy Markdown
Owner

@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings.

Review areas:

  1. Bloat check: Are changes minimal and focused? Any unnecessary code or scope creep?
  2. Security: Any hardcoded secrets, unsafe eval/exec, missing input validation?
  3. Performance: Any module-level heavy imports? Hot-path regressions?
  4. Tests: Are tests included? Do they cover the changes adequately?
  5. Backward compat: Any public API changes without deprecation?
  6. Code quality: DRY violations, naming conventions, error handling?
  7. Address reviewer feedback: If Qodo, Coderabbit, or Gemini flagged valid issues, include them in your review
  8. Suggest specific improvements with code examples where possible

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/praisonai/praisonai/cli/session/unified.py`:
- Around line 356-360: The deletion block around path.exists() / path.unlink()
is susceptible to a race where unlink() raises FileNotFoundError; modify the
code in the session cleanup logic (the branch that currently checks
path.exists() then calls path.unlink()) to either call
path.unlink(missing_ok=True) or wrap path.unlink() in a try/except
FileNotFoundError and ignore it, then acquire self._lock and pop from
self._cache and self._cache_mtimes as before; ensure the same unique symbols are
used (path.unlink, self._lock, self._cache.pop, self._cache_mtimes.pop) so the
change is localized and thread/process-safe.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 857221f2-d06e-486a-aaf3-402b846f416c

📥 Commits

Reviewing files that changed from the base of the PR and between a9f4bd5 and 2d7f67a.

📒 Files selected for processing (2)
  • src/praisonai/praisonai/cli/session/unified.py
  • src/praisonai/tests/unit/cli/test_unified_session.py

Comment on lines 356 to +360
if path.exists():
path.unlink()
self._cache.pop(session_id, None)
with self._lock:
self._cache.pop(session_id, None)
self._cache_mtimes.pop(session_id, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Concurrent delete could raise FileNotFoundError.

Between path.exists() check and path.unlink(), another process could delete the file, causing unlink() to raise. Consider using missing_ok=True (Python 3.8+) for robustness.

Proposed fix
         if path.exists():
-            path.unlink()
+            path.unlink(missing_ok=True)
             with self._lock:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if path.exists():
path.unlink()
self._cache.pop(session_id, None)
with self._lock:
self._cache.pop(session_id, None)
self._cache_mtimes.pop(session_id, None)
if path.exists():
path.unlink(missing_ok=True)
with self._lock:
self._cache.pop(session_id, None)
self._cache_mtimes.pop(session_id, None)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/praisonai/praisonai/cli/session/unified.py` around lines 356 - 360, The
deletion block around path.exists() / path.unlink() is susceptible to a race
where unlink() raises FileNotFoundError; modify the code in the session cleanup
logic (the branch that currently checks path.exists() then calls path.unlink())
to either call path.unlink(missing_ok=True) or wrap path.unlink() in a
try/except FileNotFoundError and ignore it, then acquire self._lock and pop from
self._cache and self._cache_mtimes as before; ensure the same unique symbols are
used (path.unlink, self._lock, self._cache.pop, self._cache_mtimes.pop) so the
change is localized and thread/process-safe.

@MervinPraison

Copy link
Copy Markdown
Owner

@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding.

Phase 1: Review per AGENTS.md

  1. Protocol-driven: check heavy implementations vs core SDK
  2. Backward compatible: ensure zero feature regressions
  3. Performance: no hot-path regressions

Phase 2: FIX Valid Issues
4. For any VALID bugs or architectural flaws found by Gemini, CodeRabbit, Qodo, Copilot, or any other reviewer: implement the fix
5. Push all code fixes directly to THIS branch (do NOT create a new PR)
6. Comment a summary of exact files modified and what you skipped

Phase 3: Final Verdict
7. If all issues are resolved, approve the PR / close the Issue
8. If blocking issues remain, request changes / leave clear action items

@praisonai-triage-agent

praisonai-triage-agent Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Claude finished @MervinPraison's task —— View job


Final Architecture Review ✅

Phase 1: Architecture Review per AGENTS.md

  • Read AGENTS.md guidelines
  • Review changed files for protocol compliance
  • Check backward compatibility
  • Verify no performance regressions

Phase 2: Fix Valid Issues

  • Fix CodeRabbit's FileNotFoundError race condition
  • Address any other valid architectural issues
  • Run tests to validate changes
  • Push fixes to branch

Phase 3: Final Verdict

  • Provide approval or request changes
  • Document what was fixed and what was skipped

APPROVED - Architecture Review Complete

Architecture Compliance (AGENTS.md)

  • ✅ Protocol-driven: Lives correctly in wrapper layer (praisonai) not core SDK
  • ✅ Performance: No hot-path regressions - file I/O operations properly isolated
  • ✅ Backward compatible: No public API changes, maintains existing interface
  • ✅ Multi-agent safe: Thread-safe design with proper locking mechanisms
  • ✅ Dependencies: No new dependencies introduced

Code Quality Assessment

  • ✅ Concurrency handling: Excellent read-modify-write pattern with exclusive file locks
  • ✅ Error handling: Proper exception handling and logging throughout
  • ✅ Test coverage: Comprehensive tests including concurrent scenarios
  • ✅ Message deduplication: Smart merge logic prevents data loss

Issues Addressed

Fixed (Pushed to branch):

  • FileNotFoundError race condition (unified.py:357) - Added missing_ok=True to path.unlink() to handle concurrent deletion gracefully

Validated:

  • Core functionality works correctly with imports and basic operations
  • Delete method now handles race conditions properly
  • No other architectural concerns identified

Summary

This PR implements a robust solution for the CLI session message loss bug. The implementation follows PraisonAI architecture guidelines and includes proper concurrent-safe file handling. The single identified race condition has been fixed and pushed to the branch.

Final Recommendation: ✅ APPROVE and MERGE

--- | Branch: cursor/critical-bug-investigation-7b44

Use missing_ok=True to handle concurrent file deletion gracefully.
Fixes potential race where another process deletes file between
exists() check and unlink() call.

Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
@MervinPraison MervinPraison marked this pull request as ready for review June 12, 2026 09:55
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@MervinPraison MervinPraison merged commit 7427367 into main Jun 12, 2026
5 of 6 checks passed
@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds read-modify-write semantics to UnifiedSessionStore.save(): it reloads the on-disk session under an exclusive file lock, merges messages by deduplicating on (role, content, timestamp), and uses mtime-based cache invalidation in load() to detect concurrent updates. The approach correctly prevents message loss for the primary case described.

  • Message merge: concurrent writers now each read the latest disk state before writing, so chat history from other processes is preserved; deduplication relies on exact timestamp equality, meaning the same user message sent at different milliseconds is never collapsed.
  • Token/cost stats: _merge_sessions takes max() of cumulative counters (total_input_tokens, total_output_tokens, total_cost, request_count), which under-counts when both processes accumulate against the same baseline — this is a residual correctness gap not addressed by the PR.
  • Read path regression: load() was changed to open files with "r+b" (read-write) rather than "rb" (read-only), silently turning a permission-denied error into a missing-session result for read-only files.

Confidence Score: 3/5

The core message-preservation fix is sound, but two issues in the changed code warrant attention before merging: a read path that now requires write file permissions (silently surfacing as a missing session), and cumulative stats that are silently under-counted when two processes work from the same baseline.

The read-modify-write locking strategy and message deduplication logic are correct for their primary goal. However, load() now opens session files in read-write mode when only a read is needed — a permission-denied failure is swallowed as None, making a valid session appear missing. Additionally, _merge_sessions uses max() for additive counters (total_input_tokens, total_cost, request_count), meaning concurrent use will produce lower-than-actual totals for billing and usage tracking. These two issues affect observable correctness on the changed paths and should be resolved before the PR lands.

src/praisonai/praisonai/cli/session/unified.py — specifically the load() open-mode regression and the stat-after-unlock window in save(); the test file is clean.

Important Files Changed

Filename Overview
src/praisonai/praisonai/cli/session/unified.py Adds read-modify-write under file lock plus mtime-based cache invalidation; two logic issues remain: load() opens files r+b (needs write permission for a read-only operation) and _merge_sessions uses max() for additive token/cost counters, causing under-counting in concurrent scenarios.
src/praisonai/tests/unit/cli/test_unified_session.py Adds two well-designed concurrency tests: a stale-save scenario and an 8-writer thread-pool test; tests cover message preservation but do not exercise the token/cost counting paths under concurrent saves.

Sequence Diagram

sequenceDiagram
    participant A as Process A
    participant F as Session File
    participant B as Process B

    Note over A,B: Concurrent save scenario (new behaviour)

    A->>F: open("r+b") + LOCK_EX
    B->>F: open("r+b") [blocks on lock]
    A->>F: read existing JSON
    A->>A: merge(on_disk, incoming_A)
    A->>F: write merged JSON + fsync
    A->>F: release lock
    Note over A: stat().st_mtime → cache_A
    B->>F: acquire LOCK_EX
    B->>F: read merged JSON (includes A's messages)
    B->>B: merge(on_disk, incoming_B)
    B->>F: write merged JSON + fsync
    B->>F: release lock
    Note over B: stat().st_mtime → cache_B

    Note over A,B: load() cache-freshness check
    A->>F: stat().st_mtime
    alt "mtime <= cached_mtime"
        A->>A: return in-memory cache
    else "mtime > cached_mtime"
        A->>F: open("r+b") + LOCK_EX
        A->>F: read latest JSON
        A->>F: release lock
        A->>A: update cache + mtime
    end
Loading

Reviews (1): Last reviewed commit: "fix: prevent FileNotFoundError race cond..." | Re-trigger Greptile

Comment on lines +162 to +169
merged.total_input_tokens = max(
on_disk.total_input_tokens, incoming.total_input_tokens
)
merged.total_output_tokens = max(
on_disk.total_output_tokens, incoming.total_output_tokens
)
merged.total_cost = max(on_disk.total_cost, incoming.total_cost)
merged.request_count = max(on_disk.request_count, incoming.request_count)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 total_input_tokens, total_output_tokens, total_cost, and request_count are cumulative accumulators, not instantaneous values. Using max() silently discards the contribution from whichever process has the smaller running total. For example, if both processes started from a baseline of 100 tokens and Process A added 50 while Process B added 70, max(150, 170) = 170 drops Process A's 50-token contribution entirely. The correct total would require tracking each process's delta from the shared baseline, which the current data model doesn't support. At minimum this should be noted as a known limitation; if exact accounting matters, consider storing per-process deltas or using an append-only log.

Suggested change
merged.total_input_tokens = max(
on_disk.total_input_tokens, incoming.total_input_tokens
)
merged.total_output_tokens = max(
on_disk.total_output_tokens, incoming.total_output_tokens
)
merged.total_cost = max(on_disk.total_cost, incoming.total_cost)
merged.request_count = max(on_disk.request_count, incoming.request_count)
# NOTE: cumulative counters cannot be merged correctly without knowing the
# shared baseline each process started from; max() avoids a total reset but
# will under-count when both processes accumulate from the same starting
# point. Tracked as a known limitation of the current data model.
merged.total_input_tokens = max(
on_disk.total_input_tokens, incoming.total_input_tokens
)
merged.total_output_tokens = max(
on_disk.total_output_tokens, incoming.total_output_tokens
)
merged.total_cost = max(on_disk.total_cost, incoming.total_cost)
merged.request_count = max(on_disk.request_count, incoming.request_count)

Comment on lines +300 to +305
with open(path, "r+b") as f:
self._acquire_exclusive_lock(f)
try:
data = self._read_json_locked(f)
finally:
self._release_exclusive_lock(f)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 load() now opens the session file with "r+b" (read-write) mode even though it never writes anything. This silently regresses read-only access: if a session file has been chmod'd to 0444 (or any other state where the process has read but not write permission), open(path, "r+b") raises a PermissionError that is caught and turned into None, making a perfectly valid session appear missing. The original code correctly used "rb" for reads. A shared (LOCK_SH) lock is also sufficient here and avoids serialising concurrent readers.

Suggested change
with open(path, "r+b") as f:
self._acquire_exclusive_lock(f)
try:
data = self._read_json_locked(f)
finally:
self._release_exclusive_lock(f)
with open(path, "rb") as f:
if sys.platform == "win32":
import msvcrt
f.seek(0)
msvcrt.locking(f.fileno(), msvcrt.LK_RLCK, 1)
try:
data = self._read_json_locked(f)
finally:
f.seek(0)
msvcrt.locking(f.fileno(), msvcrt.LK_UNLCK, 1)
elif _HAS_FCNTL:
fcntl.flock(f.fileno(), fcntl.LOCK_SH)
try:
data = self._read_json_locked(f)
finally:
fcntl.flock(f.fileno(), fcntl.LOCK_UN)
else:
data = self._read_json_locked(f)

Comment on lines +250 to +257
try:
mtime = path.stat().st_mtime
except OSError:
mtime = datetime.now().timestamp()

with self._lock:
self._cache[session.session_id] = to_save
self._cache_mtimes[session.session_id] = mtime

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 mtime race between write and stat

The file lock is released when the with open(...) block exits (line 248), but path.stat().st_mtime is read after that point. Another process could write the file in the gap, giving us their later mtime. We then store (our merged content, their mtime) in the cache. On the next load(), _is_cache_fresh compares current_mtime <= cached_mtime and may return True, serving our stale in-memory copy instead of re-reading from disk. The caller would see incomplete message history until the next invalidating event. Reading st_mtime while still holding the file lock (i.e., before releasing it) would close the window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants