Skip to content

Conversation

@phernandez
Copy link
Member

Fixes #398

Replace Python's scandir() with server-side find command using -printf to eliminate thousands of network stat() calls on network filesystems.

Changes

  • Rewrite scan_directory() to use find -printf for path+mtime+size in one operation
  • Remove obsolete methods: _scan_directory_full, _scan_directory_modified_since, _quick_count_files
  • Update scan() to stream results from unified scan_directory()
  • Add _count_files() helper using find | wc -l

Expected Performance

For 1.4k files on TigrisFS:

  • Full sync: 52 minutes → 2-3 minutes (26x faster)
  • Incremental sync: No change (already fast at 200-600ms)

Generated with Claude Code

Replace Python's scandir() with server-side find command using -printf
to eliminate thousands of network stat() calls on network filesystems.

Changes:
- Rewrite scan_directory() to use find -printf for path+mtime+size in one operation
- Remove obsolete methods: _scan_directory_full, _scan_directory_modified_since, _quick_count_files
- Update scan() to stream results from unified scan_directory()
- Add _count_files() helper using find | wc -l

Expected performance for 1.4k files on TigrisFS:
- Full sync: 52 minutes → 2-3 minutes (26x faster)
- Incremental sync: No change (already fast at 200-600ms)

Fixes #398

Co-authored-by: Paul Hernandez <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 23, 2025

Claude finished @phernandez's task —— View job


PR Review: TigrisFS Sync Performance Optimization

I've completed a thorough review of this PR against our team checklist. Overall, this is a well-executed performance optimization with excellent expected results (26x speedup). However, there are several critical issues that need to be addressed before merging.


🚨 Critical Issues

1. Platform Compatibility: -printf is GNU-specific

Location: src/basic_memory/sync/sync_service.py:1077

The -printf flag used in the find command is GNU-specific and will fail on:

  • macOS (uses BSD find)
  • FreeBSD, OpenBSD, and other BSD systems
  • Some minimal Linux containers without GNU findutils
cmd = f'find "{directory}" -type f -printf "%p\\t%T@\\t%s\\n"'

Impact: This will break sync entirely on macOS, which is likely a primary development platform for many users.

Recommendation: Either:

  1. Add platform detection and use BSD-compatible alternatives (-exec stat or Python fallback)
  2. Document that GNU findutils is required and provide installation instructions for macOS users (brew install findutilsgfind)
  3. Implement a fallback to the previous Python-based scanning when the find command fails

The error handling currently raises RuntimeError, which will crash sync operations rather than falling back gracefully.

2. Breaking Change Without Migration Path

The PR removes three methods that may be used elsewhere:

  • _scan_directory_full()
  • _scan_directory_modified_since()
  • _quick_count_files() → renamed to _count_files()

While these appear to be internal methods, the removal should be verified against:

  • Any background sync operations
  • Cloud sync integration
  • MCP tools that might trigger syncs

Recommendation: Verify no external callers exist by searching the entire codebase.


⚠️ Major Concerns

3. Missing Tests for New Implementation

The existing tests (test_scan_directory_*) were written for the old aiofiles.os.scandir() implementation. While they may still pass, they don't validate:

  • The new find command parsing logic (tab-separated values)
  • Error handling for malformed find output
  • Behavior when find command fails
  • The since_timestamp parameter with -newermt

Recommendation: Add unit tests specifically for:

  • Parse logic for find output
  • Handling of unexpected output formats
  • Error cases (invalid timestamps, permission errors)
  • Incremental scans with since_timestamp

4. Error Handling Removes Fallback

Location: src/basic_memory/sync/sync_service.py:385-413 and 1055-1097

The old _quick_count_files() implementation had a fallback:

# OLD: Fallback when find fails
except Exception:
    count = 0
    async for _ in self.scan_directory(directory):
        count += 1
    return count

The new implementation raises RuntimeError instead, which will:

  • Crash sync operations entirely
  • Prevent any sync progress on systems where find fails

Recommendation: Restore the fallback mechanism or document the hard dependency on GNU findutils.

5. Performance Claims Need Validation

The PR claims "52 minutes → 2-3 minutes" but:

  • No benchmark results are included in the PR
  • The benchmark file (test_sync_performance_benchmark.py) hasn't been run with the new code
  • Performance on non-TigrisFS filesystems is unknown

Recommendation: Include actual benchmark results in PR description showing:

  • TigrisFS: Full sync timing
  • Local filesystem: Verify no regression
  • Network filesystem (SMB/NFS): Expected behavior

✅ Code Quality & Standards

Follows Basic Memory Conventions

  • Python 3.12+ type annotations present (Optional[float], AsyncIterator)
  • Async patterns correctly implemented
  • 100-character line length mostly maintained (a few docstring lines exceed)
  • Proper logging with context (logfire metrics and structured logging)
  • Naming conventions followed (snake_case)

Minor Style Issues

  • Line 1065-1067: Docstring could be more concise (currently 173 chars on one line after formatting)
  • Line 1126-1139: Creating os.stat_result tuple is clever but brittle (depends on tuple order)

✅ Basic Memory Architecture

Preserves Core Principles

  • Local-first architecture maintained (file system is source of truth)
  • Knowledge graph consistency preserved (no changes to entity relationships)
  • Database sync logic unchanged (mtime/size comparison strategy intact)
  • Circuit breaker pattern still applies to file-level failures

Integration Points

  • Watermark strategy (last_scan_timestamp, last_file_count) unchanged
  • Incremental vs full scan logic preserved
  • Move detection still uses checksum matching
  • Ignore patterns (.bmignore) still respected

✅ Security & Performance

Security

  • No hardcoded secrets or credentials
  • Input validation: Directory path is quoted in shell command (f'find "{directory}"')
  • No sensitive data in logs
  • Subprocess stderr properly captured for error reporting

Performance

  • Expected dramatic improvement (26x) for network filesystems
  • Eliminates per-file stat() calls (1.4k → ~1 per directory level)
  • Memory usage: Streaming iterator prevents loading all paths into memory
  • Metrics properly instrumented (logfire.metric_*)

Performance Considerations

  • Potential regression: If find command fails and falls back to Python scanning (if fallback restored), users would see worse performance than before without clear indication
  • Incremental scan: Still efficient (200-600ms) as claimed

📋 Documentation

Missing Documentation

  • README/CLAUDE.md: No mention of GNU findutils requirement
  • Installation docs: Should note platform-specific requirements
  • Troubleshooting: No guidance for users on systems without GNU find
  • Performance section: Could highlight the 26x improvement for network filesystems

Good Documentation

  • Docstrings explain the optimization strategy clearly
  • Comments explain performance characteristics
  • Commit message is descriptive

🔧 Specific Code Issues

Line 1077: Shell Command Construction

cmd = f'find "{directory}" -type f -printf "%p\\t%T@\\t%s\\n"'
  • Issue: Assumes GNU find with -printf
  • Risk: Silent failures on non-GNU systems
  • Fix: Add platform detection or fallback

Line 1106-1109: Tab-Separated Parsing

parts = line.split("\t")
if len(parts) != 3:
    logger.warning(f"Unexpected find output format: {line}")
    continue
  • Issue: What if a filename contains a tab character?
  • Risk: Files with tabs in names would be skipped silently (just a warning)
  • Likelihood: Low, but possible
  • Fix: Consider NUL-separated output (-print0 equivalent) or escape handling

Line 1126-1139: Manual os.stat_result Construction

stat_info = os.stat_result((
    0,  # st_mode
    0,  # st_ino
    # ... zeros for unused fields
    size,  # st_size
    0,  # st_atime
    mtime,  # st_mtime
    0,  # st_ctime
))
  • Good: Maintains compatibility with existing code
  • Concern: Brittle if os.stat_result implementation changes
  • Minor risk: Python's os.stat_result is stable, unlikely to change
  • Comment: Consider adding a helper function for clarity

Line 1932-1933: Test Compatibility Issue

The test test_scan_directory_cached_stat_info checks st_ctime:

assert abs(stat_info.st_ctime - independent_stat.st_ctime) < 1

But the new implementation sets st_ctime = 0. This test will fail.


📝 Recommendations Summary

Must Fix Before Merge

  1. Add platform compatibility for non-GNU find systems (macOS, BSD)
  2. Fix or update failing test (test_scan_directory_cached_stat_info expects valid st_ctime)
  3. Restore fallback mechanism or clearly document hard dependency on GNU findutils

Should Address

  1. Add unit tests for find command parsing logic
  2. Include actual benchmark results in PR description
  3. Update documentation (README/CLAUDE.md) about system requirements
  4. Consider handling filenames with tab characters more robustly

Nice to Have

  1. Add troubleshooting documentation for users without GNU find
  2. Refactor os.stat_result construction into a helper function for maintainability

🎯 Overall Assessment

This PR demonstrates excellent engineering in identifying and solving a real performance bottleneck. The 26x speedup for TigrisFS is impressive and addresses a legitimate pain point.

However, the implementation has critical platform compatibility issues that would break sync on macOS and BSD systems. This needs to be addressed before merging.

Recommendation: Request changes - Fix platform compatibility and test issues, then re-review.


@phernandez phernandez added the On Hold Don't review or merge. Work is pending label Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

On Hold Don't review or merge. Work is pending

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize TigrisFS sync performance using find command

2 participants