-
Notifications
You must be signed in to change notification settings - Fork 2.6k
perf: optimize codebase indexing performance #7351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Increase BATCH_SEGMENT_THRESHOLD from 60 to 200 for better batching efficiency - Increase PARSING_CONCURRENCY from 10 to 20 for faster parallel file parsing - Increase BATCH_PROCESSING_CONCURRENCY from 10 to 15 for improved throughput - Increase MAX_PENDING_BATCHES from 20 to 30 to allow more parallel processing - Add early termination check to skip indexing when all files are unchanged - Add progress percentage to indexing status messages for better user feedback These changes significantly improve indexing performance by: 1. Processing larger batches to reduce API overhead 2. Increasing parallelization for CPU-bound operations 3. Skipping unnecessary work when files are already indexed 4. Providing better progress feedback to users Fixes #7350
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing my own code is like debugging in production - technically possible but morally questionable.
| /**Directory Scanner */ | ||
| export const MAX_LIST_FILES_LIMIT_CODE_INDEX = 50_000 | ||
| export const BATCH_SEGMENT_THRESHOLD = 60 // Number of code segments to batch for embeddings/upserts | ||
| export const BATCH_SEGMENT_THRESHOLD = 200 // Number of code segments to batch for embeddings/upserts - increased from 60 for better performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? The increased batch sizes (BATCH_SEGMENT_THRESHOLD from 60 to 200, MAX_PENDING_BATCHES from 20 to 30) could significantly increase memory consumption. Could we consider adding memory monitoring or making these configurable based on system capabilities?
| // Early termination check: if all files are already indexed, skip processing | ||
| let allFilesUnchanged = true | ||
| let quickCheckCount = 0 | ||
| const quickCheckLimit = Math.min(10, supportedPaths.length) // Check first 10 files for quick assessment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we approach this differently to improve reliability? The quick check only samples the first 10 files. If a large codebase has changes only in files beyond position 10, this optimization might incorrectly skip indexing. Consider using a random sample instead?
|
|
||
| // Add progress percentage to status message | ||
| if (cumulativeBlocksFoundSoFar > 0) { | ||
| const progressPercent = Math.round((cumulativeBlocksIndexed / cumulativeBlocksFoundSoFar) * 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? The progress percentage is updated on every file parsed and every batch indexed. For large codebases with thousands of files, this could result in excessive UI updates. Could we throttle these updates to reduce UI overhead?
| export const INITIAL_RETRY_DELAY_MS = 500 | ||
| export const PARSING_CONCURRENCY = 10 | ||
| export const MAX_PENDING_BATCHES = 20 // Maximum number of batches to accumulate before waiting | ||
| export const PARSING_CONCURRENCY = 20 // Increased from 10 for faster parallel file parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make these values configurable through settings? The hardcoded concurrency values (PARSING_CONCURRENCY: 20, BATCH_PROCESSING_CONCURRENCY: 15) might not be optimal for all systems. Lower-end machines might struggle while high-end systems could handle more.
| } | ||
|
|
||
| // If all files are unchanged, we can skip the entire indexing process | ||
| if (allFilesUnchanged && supportedPaths.length > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This critical optimization needs test coverage. The early termination logic is a significant performance improvement but I don't see tests specifically covering this new behavior. Could we add tests to ensure this optimization works correctly in various scenarios?
|
Closing, see #7350 (comment) |
This PR addresses Issue #7350 by significantly improving the performance of codebase indexing and search operations.
Problem
Users were experiencing slow codebase search performance with operations taking minutes to complete, particularly during initial indexing with certain models like Gemini.
Solution
Implemented multiple performance optimizations:
1. Increased Batch Processing Efficiency
2. Enhanced Parallel Processing
3. Smart Indexing with Early Termination
4. Improved User Feedback
Performance Impact
These optimizations provide:
Testing
Future Considerations
As noted in the review, future enhancements could include:
Fixes #7350
Important
Optimizes codebase indexing by increasing batch sizes, concurrency, adding early termination checks, and improving user feedback.
BATCH_SEGMENT_THRESHOLDfrom 60 to 200,BATCH_PROCESSING_CONCURRENCYfrom 10 to 15, andMAX_PENDING_BATCHESfrom 20 to 30 inconstants/index.ts.PARSING_CONCURRENCYfrom 10 to 20 inconstants/index.ts.scanner.tsto skip indexing if files are unchanged.orchestrator.tsduring indexing.This description was created by
for d553512. You can customize this summary. It will automatically update as commits are pushed.