-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Problem
Full sync for ~1.4k files on TigrisFS-mounted cloud storage takes 52 minutes, making the system unusable for cloud deployments.
Root Cause
The current scan_directory() method (line 1133 in sync_service.py) uses Python's recursive aiofiles.os.scandir(), which makes thousands of network round trips:
- For each directory: network call to list entries
- For each file:
entry.stat()to get metadata - With nested directories + 1.4k files = thousands of network operations
Evidence from Logfire traces:
- Tenant
0a20eb58-970f-ab05-ff49-25a9cdb2179cwith ~1.4k files - Full scan took 31.4 seconds for just 379 files (claude-projects)
- Extrapolated to 1.4k files = 52+ minutes
- Meanwhile, incremental scans complete in 200-600ms using
find -newermt
Solution
Replace scan_directory() to use server-side find command with -printf for all scans (both full and incremental).
Unified Implementation
Single code path using find:
async def scan_directory(
self,
directory: Path,
since_timestamp: Optional[float] = None
) -> AsyncIterator[Tuple[str, os.stat_result]]:
"""Scan directory using find command (optimized for network filesystems).
Args:
directory: Directory to scan
since_timestamp: Optional - only return files modified after this timestamp
Yields:
Tuples of (absolute_file_path, stat_info)
"""
# Build find command with printf to get path + mtime + size in one operation
cmd = f'find "{directory}" -type f -printf "%p\\t%T@\\t%s\\n"'
if since_timestamp:
since_date = datetime.fromtimestamp(since_timestamp).strftime("%Y-%m-%d %H:%M:%S")
cmd += f' -newermt "{since_date}"'
# Execute find, parse results, apply .bmignore, yield (path, stat_info) tuplesKey optimization: Using find -printf "%p\t%T@\t%s\n" returns path, mtime, and size in one network operation, eliminating per-file stat() calls.
Code Consolidation
Remove these methods (no longer needed):
_scan_directory_full()(line 1116)_scan_directory_modified_since()(line 1058)_quick_count_files()(line 1022)
Update callers:
scan()method: Usescan_directory(directory)for full scansscan()method: Usescan_directory(directory, since_timestamp=watermark)for incremental- File counting: Use direct
find "{directory}" -type f | wc -lsubprocess
Expected Performance
- Full sync: 52 minutes → ~2-3 minutes (same speed as current incremental scans)
- Incremental sync: No change (already fast at 200-600ms)
- Single code path: Easier to maintain, test, and debug
Why find Over Alternatives (e.g., jwalk, fd-find)
On network filesystems like TigrisFS, network latency is the bottleneck, not traversal speed:
- find with -printf: 1 subprocess → kernel batches operations → ~1 network operation per directory level
- Rust tools (jwalk): Still makes 1.4k individual stat() calls over network = 1.4k × network_latency
- find is ubiquitous: Works everywhere, no additional dependencies
The find command leverages kernel-level optimizations for network filesystems, making it ideal for this use case.
Implementation Checklist
- Rewrite
scan_directory()to usefindwith optional-newermtfilter - Parse
find -printfoutput to createos.stat_resultobjects - Apply
.bmignorepattern filtering to results - Delete obsolete helper methods (
_scan_directory_full,_scan_directory_modified_since,_quick_count_files) - Update
scan()method to use unifiedscan_directory() - Update file count logic in
scan()to usefind | wc -l - Add tests for new implementation
- Validate with tenant 0a20eb58's projects (~1.4k files)
- Verify .bmignore patterns work correctly
Files Modified
src/basic_memory/sync/sync_service.py
References
- Current incremental scan already uses
find -newermtsuccessfully (line 1058) - Performance proven: 200-600ms for incremental scans vs 31+ seconds for Python scandir
- Logfire traces: tenant
0a20eb58-970f-ab05-ff49-25a9cdb2179c