Skip to content

Optimize TigrisFS sync performance using find command #398

@phernandez

Description

@phernandez

Problem

Full sync for ~1.4k files on TigrisFS-mounted cloud storage takes 52 minutes, making the system unusable for cloud deployments.

Root Cause

The current scan_directory() method (line 1133 in sync_service.py) uses Python's recursive aiofiles.os.scandir(), which makes thousands of network round trips:

  • For each directory: network call to list entries
  • For each file: entry.stat() to get metadata
  • With nested directories + 1.4k files = thousands of network operations

Evidence from Logfire traces:

  • Tenant 0a20eb58-970f-ab05-ff49-25a9cdb2179c with ~1.4k files
  • Full scan took 31.4 seconds for just 379 files (claude-projects)
  • Extrapolated to 1.4k files = 52+ minutes
  • Meanwhile, incremental scans complete in 200-600ms using find -newermt

Solution

Replace scan_directory() to use server-side find command with -printf for all scans (both full and incremental).

Unified Implementation

Single code path using find:

async def scan_directory(
    self, 
    directory: Path,
    since_timestamp: Optional[float] = None
) -> AsyncIterator[Tuple[str, os.stat_result]]:
    """Scan directory using find command (optimized for network filesystems).
    
    Args:
        directory: Directory to scan
        since_timestamp: Optional - only return files modified after this timestamp
        
    Yields:
        Tuples of (absolute_file_path, stat_info)
    """
    # Build find command with printf to get path + mtime + size in one operation
    cmd = f'find "{directory}" -type f -printf "%p\\t%T@\\t%s\\n"'
    if since_timestamp:
        since_date = datetime.fromtimestamp(since_timestamp).strftime("%Y-%m-%d %H:%M:%S")
        cmd += f' -newermt "{since_date}"'
    
    # Execute find, parse results, apply .bmignore, yield (path, stat_info) tuples

Key optimization: Using find -printf "%p\t%T@\t%s\n" returns path, mtime, and size in one network operation, eliminating per-file stat() calls.

Code Consolidation

Remove these methods (no longer needed):

  • _scan_directory_full() (line 1116)
  • _scan_directory_modified_since() (line 1058)
  • _quick_count_files() (line 1022)

Update callers:

  • scan() method: Use scan_directory(directory) for full scans
  • scan() method: Use scan_directory(directory, since_timestamp=watermark) for incremental
  • File counting: Use direct find "{directory}" -type f | wc -l subprocess

Expected Performance

  • Full sync: 52 minutes → ~2-3 minutes (same speed as current incremental scans)
  • Incremental sync: No change (already fast at 200-600ms)
  • Single code path: Easier to maintain, test, and debug

Why find Over Alternatives (e.g., jwalk, fd-find)

On network filesystems like TigrisFS, network latency is the bottleneck, not traversal speed:

  • find with -printf: 1 subprocess → kernel batches operations → ~1 network operation per directory level
  • Rust tools (jwalk): Still makes 1.4k individual stat() calls over network = 1.4k × network_latency
  • find is ubiquitous: Works everywhere, no additional dependencies

The find command leverages kernel-level optimizations for network filesystems, making it ideal for this use case.

Implementation Checklist

  • Rewrite scan_directory() to use find with optional -newermt filter
  • Parse find -printf output to create os.stat_result objects
  • Apply .bmignore pattern filtering to results
  • Delete obsolete helper methods (_scan_directory_full, _scan_directory_modified_since, _quick_count_files)
  • Update scan() method to use unified scan_directory()
  • Update file count logic in scan() to use find | wc -l
  • Add tests for new implementation
  • Validate with tenant 0a20eb58's projects (~1.4k files)
  • Verify .bmignore patterns work correctly

Files Modified

  • src/basic_memory/sync/sync_service.py

References

  • Current incremental scan already uses find -newermt successfully (line 1058)
  • Performance proven: 200-600ms for incremental scans vs 31+ seconds for Python scandir
  • Logfire traces: tenant 0a20eb58-970f-ab05-ff49-25a9cdb2179c

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions