perf: Optimize sync/indexing for 43% faster performance#352
Conversation
Add comprehensive benchmark test suite to measure baseline performance of sync/indexing operations before implementing optimizations. Changes: - Add test_sync_performance_benchmark.py with 4 benchmark tests: - 100 files (small repository) - 500 files (medium repository) - 1000 files (large repository) - Re-sync with no changes - Benchmarks measure: - Total sync time and files/sec throughput - Milliseconds per file - Database growth (total and per file) - Operation counts (new, modified, deleted, moved) - Add BENCHMARKS.md documentation explaining: - How to run benchmarks - How to interpret results - Guidelines for tracking improvements - Add pytest markers for 'benchmark' and 'slow' tests - Can selectively run/skip with -m flags Test files generated are realistic with: - YAML frontmatter and tags - 3-10 observations per file - 1-3 relations per file (some forward references) - Varying content to simulate real usage Related to #351 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Implements targeted "Quick Win" optimizations to improve sync and indexing performance for cloud deployments where memory.db is rebuilt on restart. Key optimizations: - Query only file_path/checksum columns instead of loading full entities (10-100x faster) - Batch relation inserts using add_all() to reduce database round-trips - Batch search index inserts with new bulk_index_items() method - Skip file path conflict checks during bulk sync (eliminates O(n²) bottleneck) - Fix database file exclusion patterns to use wildcards (*.db instead of memory.db) Performance improvements: - 100 files: 10.5 files/sec (43% faster than 7.3 baseline) - 500 files: 10.2 files/sec (linear scaling maintained) - Re-sync with no changes: 930 files/sec (0.11s for 100 files) Cloud deployment impact: - 500-file repository: 49s vs 73s (24 second improvement per restart) - Eliminated O(n²) bottleneck for larger repositories See docs/PERFORMANCE_OPTIMIZATIONS.md for full details. Fixes #351 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Claude finished @phernandez's task —— View job Code Review: Performance Optimizations ✅I've completed a thorough review of PR #352 against the Basic Memory team checklist. This is an excellent piece of optimization work with strong engineering discipline. SummaryThis PR delivers meaningful performance improvements (43% faster sync) through targeted optimizations and fixes a critical O(n²) bottleneck. The implementation is clean, well-tested, and thoroughly documented. Recommendation: ✅ Approve with minor suggestions below Team Checklist Assessment✅ Code Quality & Standards
Highlights:
✅ Testing & Documentation
Highlights:
✅ Basic Memory Architecture
Key architectural insights:
✅ Security & Performance
Performance achievements:
Detailed Code ReviewOptimization #1:
|
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task —— View job Code Review: Performance Optimizations ✅I've completed a comprehensive review of PR #352 against the Basic Memory team checklist. This PR delivers excellent performance improvements through well-engineered optimizations. SummaryRecommendation: ✅ Approve with high confidence This PR achieves a 43% performance improvement for sync operations and eliminates a critical O(n²) bottleneck. The implementation is clean, well-tested, thoroughly documented, and follows all Basic Memory architectural patterns and coding standards. Team Checklist Assessment✅ Code Quality & Standards - ExcellentAll requirements met:
Highlights:
✅ Testing & Documentation - OutstandingAll requirements exceeded:
Highlights:
Performance Results Validated:
✅ Basic Memory Architecture - Perfect AlignmentAll architectural principles maintained:
Key Architectural Decisions:
✅ Security & Performance - ExcellentAll requirements met:
Performance Achievements:
Detailed Code AnalysisOptimization #1: Column-Specific Query ⭐File: query = select(Entity.file_path, Entity.checksum).where(
Entity.project_id == self.entity_repository.project_id
)Assessment: Perfect SQLAlchemy 2.0 pattern. Loads only needed columns without expensive relationships. Clear docstring explains 10-100x improvement. Optimization #2: Batch Relation Inserts ✅File: Assessment: Clean implementation with Observation: Logging uses DEBUG level (line 488). Consider INFO level when fallback occurs, as this may indicate data quality issues worth monitoring. Optimization #3: Bulk Search Indexing ⭐Files: Assessment: Excellent architecture:
Note: Docstring at search_repository.py:564-567 correctly documents the assumption about prior deletion - good defensive documentation. Optimization #4: O(n²) Fix - Critical ⭐⭐⭐File: Assessment: This is the most important change in the PR. The Validation of Usage:
Trade-off Analysis: Conflicts are rare during bulk sync and mainly occur during manual operations. This is a well-reasoned performance trade-off with proper safety boundaries. Bug Fix: Database File Exclusion ✅File: Changes: # Before: "memory.db", "memory.db-shm", "memory.db-wal"
# After: "*.db", "*.db-shm", "*.db-wal"Assessment: Smart fix that improves robustness:
Test Suite - Exceptional Quality ⭐File: Strengths:
Coverage:
Documentation Review ⭐BENCHMARKS.mdExcellent technical documentation with:
CONTRIBUTING.md UpdatesComprehensive additions covering:
Minor suggestion: Consider adding a "Quick Reference" section at the top of performance docs with just the key numbers for TL;DR readers. Issues & Observations🟡 Minor: Pytest ConfigurationFile: Current: Observation: Doesn't include Optional improvement: testpaths = ["tests", "test-int"]This would allow 🟢 Performance Variance - ExpectedThe PR shows consistent 10.2-10.5 files/sec across different repository sizes, indicating excellent linear scaling. Some variance is normal for I/O-bound operations due to:
The linear scaling is the key indicator that optimizations are working correctly. 🟢 Future Opportunity: Parallel ProcessingFile: The documentation mentions parallel processing as a future opportunity. With asyncio already in use, this could be explored: # Current: Sequential
for path in report.new:
await self.sync_file(path, new=True)
# Future: Parallel
tasks = [self.sync_file(path, new=True) for path in report.new]
await asyncio.gather(*tasks)Consideration: Would require careful analysis of SQLite's concurrency model and WAL mode settings. Code Quality ObservationsExcellent Patterns Observed
Final AssessmentStrengths
Performance ImpactFor cloud deployments (database rebuilt on restart):
Architectural Alignment
Approval Status: ✅ APPROVED WITH HIGH CONFIDENCEThis PR meets and exceeds all Basic Memory quality standards:
The suggested improvements mentioned above are minor and don't block approval. They can be considered for future enhancements. Excellent work on this optimization effort! This sets a high bar for performance optimization PRs in the project. The attention to detail in testing, documentation, and architectural consistency is exemplary. • Branch: |
Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task —— View job Code Review: Performance Optimizations ✅Comprehensive review of PR #352 against the Basic Memory team checklist. This PR delivers significant performance improvements through well-engineered optimizations. Review SummaryRecommendation: ✅ APPROVE This PR achieves a 43% performance improvement for sync operations and eliminates a critical O(n²) bottleneck. The implementation is clean, well-tested, thoroughly documented, and adheres to all Basic Memory architectural patterns and coding standards. Team Checklist Assessment✅ Code Quality & Standards - ExcellentAll requirements met:
Highlights:
✅ Testing & Documentation - OutstandingAll requirements exceeded:
Test Coverage:
Documentation Files:
Performance Results Validated:
✅ Basic Memory Architecture - Perfect AlignmentAll architectural principles maintained:
Key Architectural Decisions:
✅ Security & Performance - ExcellentAll requirements met:
Performance Achievements:
Detailed Code AnalysisOptimization #1: Column-Specific Query ⭐File: Changed from loading all entities with relationships to querying only needed columns: query = select(Entity.file_path, Entity.checksum).where(
Entity.project_id == self.entity_repository.project_id
)Assessment: Perfect SQLAlchemy 2.0 pattern. The docstring clearly explains the 10-100x improvement for large projects. Optimization #2: Batch Relation Inserts ✅File: Changed from individual try:
await self.relation_repository.add_all(relations_to_add)
except IntegrityError:
# Fallback to individual inserts if batch failsAssessment: Clean implementation with robust error handling. Reduces database round-trips significantly. Optimization #3: Bulk Search Index ⭐Files:
Created new Assessment: Well-architected with proper separation of concerns:
Note: Docstring at Optimization #4: O(n²) Fix - Critical ⭐⭐⭐File: Added Assessment: This is the most important change in the PR. The Validation of Usage:
Trade-off Analysis: Conflicts are rare during bulk sync and mainly occur during manual operations (moves, renames). This is a well-reasoned performance trade-off with proper safety boundaries. Individual file operations still perform full conflict checking. Bug Fix: Database File Exclusion ✅File: Changed ignore patterns from specific filenames to wildcards: # Before: "memory.db", "memory.db-shm", "memory.db-wal"
# After: "*.db", "*.db-shm", "*.db-wal"Assessment: Smart fix that improves robustness:
Test Suite - Exceptional Quality ⭐File: Strengths:
Test Coverage:
Documentation - Outstanding ⭐File: Excellent technical documentation with:
File: Comprehensive additions covering:
Configuration Updates ✅File:
File:
Minor Observations🟢 Performance Variance - ExpectedThe PR shows consistent 10.2-10.5 files/sec across different repository sizes, indicating excellent linear scaling. Some variance is normal for I/O-bound operations due to:
The linear scaling is the key indicator that optimizations are working correctly. 🟢 Logging ConsiderationFile: The fallback to individual inserts uses 🟢 Future Opportunity: Parallel ProcessingThe documentation mentions parallel processing as a future opportunity. With asyncio already in use, this could be explored: # Current: Sequential
for path in report.new:
await self.sync_file(path, new=True)
# Future: Parallel
tasks = [self.sync_file(path, new=True) for path in report.new]
await asyncio.gather(*tasks)However, this would require careful analysis of SQLite's concurrency model and WAL mode settings. Final AssessmentThis PR represents excellent engineering work:
Performance ImpactFor cloud deployments (database rebuilt on restart):
Architectural Alignment
Approval Status: ✅ APPROVEDThis PR meets and exceeds all Basic Memory quality standards:
The observations mentioned above are minor and don't block approval. They can be considered for future enhancements. Excellent work on this optimization effort! This sets a high bar for performance optimization PRs in the project. The attention to detail in testing, documentation, and architectural consistency is exemplary. • Branch: |
Summary
Implements targeted "Quick Win" optimizations to improve sync and indexing performance for cloud deployments where
memory.dbis rebuilt on container restart.Fixes #351
Performance Improvements
Cloud Deployment Impact
For a typical 500-file repository:
The O(n²) fix provides even bigger improvements for larger repositories (1000+ files).
Key Optimizations
1. Optimize
get_db_file_state()(Quick Win #1)File:
src/basic_memory/sync/sync_service.py:275-297Changed from loading all entities with eager-loaded relationships to querying only the 2 columns we need:
Impact: 10-100x faster for large projects
2. Batch Relation Inserts (Quick Win #3)
File:
src/basic_memory/services/entity_service.py:412-427Changed from individual
add()calls to batchadd_all()with IntegrityError fallback.Impact: Reduced database round-trips from N queries to 1 query per entity
3. Batch Search Index Inserts (Quick Win #4)
Files:
src/basic_memory/services/search_service.py:224-326,src/basic_memory/repository/search_repository.py:562-602Created new
bulk_index_items()method that batches entity + observations + relations in one operation.Impact: Reduced search indexing from ~N queries per entity to 1 query per entity
4. Fix O(n²) Bottleneck (Major Fix)
File:
src/basic_memory/services/entity_service.py:55-115Added
skip_conflict_checkparameter to skipdetect_file_path_conflicts()during bulk sync. This method was callingfind_all()for EVERY file, loading all entities with relationships each time.Impact: Eliminated quadratic time complexity - performance now scales linearly
5. Database File Exclusion Fix
File:
src/basic_memory/ignore_utils.py:14-16Changed ignore patterns from specific filenames to wildcards:
memory.db→*.dbmemory.db-shm→*.db-shmmemory.db-wal→*.db-walImpact: Fixed test failures and improved robustness for different deployment scenarios
Testing
Added comprehensive benchmark test suite in
test-int/test_sync_performance_benchmark.py:test_benchmark_sync_100_files- Small repository (10.5 files/sec)test_benchmark_sync_500_files- Medium repository (10.2 files/sec)test_benchmark_sync_1000_files- Large repository (marked slow)test_benchmark_resync_no_changes- Re-sync performance (930 files/sec)Run benchmarks:
pytest test-int/test_sync_performance_benchmark.py -v -m "benchmark and not slow"Documentation
Comprehensive documentation added in
docs/PERFORMANCE_OPTIMIZATIONS.mdcovering:Technical Details
Key Achievement: Fixed O(n²) bottleneck where file path conflict detection was loading all entities for every file during sync. Performance now scales linearly with repository size.
Trade-offs: We skip file path conflict detection during bulk sync since conflicts are rare and mainly occur during manual operations (moves, renames). Individual file operations still perform full conflict checking.
No Breaking Changes: All optimizations are backward compatible with existing API/MCP interface.
🤖 Generated with Claude Code