-
Notifications
You must be signed in to change notification settings - Fork 2.6k
perf: optimize codebase indexing performance #7351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,16 +16,16 @@ export const MAX_FILE_SIZE_BYTES = 1 * 1024 * 1024 // 1MB | |
|
|
||
| /**Directory Scanner */ | ||
| export const MAX_LIST_FILES_LIMIT_CODE_INDEX = 50_000 | ||
| export const BATCH_SEGMENT_THRESHOLD = 60 // Number of code segments to batch for embeddings/upserts | ||
| export const BATCH_SEGMENT_THRESHOLD = 200 // Number of code segments to batch for embeddings/upserts - increased from 60 for better performance | ||
| export const MAX_BATCH_RETRIES = 3 | ||
| export const INITIAL_RETRY_DELAY_MS = 500 | ||
| export const PARSING_CONCURRENCY = 10 | ||
| export const MAX_PENDING_BATCHES = 20 // Maximum number of batches to accumulate before waiting | ||
| export const PARSING_CONCURRENCY = 20 // Increased from 10 for faster parallel file parsing | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we make these values configurable through settings? The hardcoded concurrency values (PARSING_CONCURRENCY: 20, BATCH_PROCESSING_CONCURRENCY: 15) might not be optimal for all systems. Lower-end machines might struggle while high-end systems could handle more. |
||
| export const MAX_PENDING_BATCHES = 30 // Maximum number of batches to accumulate before waiting - increased from 20 | ||
|
|
||
| /**OpenAI Embedder */ | ||
| export const MAX_BATCH_TOKENS = 100000 | ||
| export const MAX_ITEM_TOKENS = 8191 | ||
| export const BATCH_PROCESSING_CONCURRENCY = 10 | ||
| export const BATCH_PROCESSING_CONCURRENCY = 15 // Increased from 10 for better throughput | ||
|
|
||
| /**Gemini Embedder */ | ||
| export const GEMINI_MAX_ITEM_TOKENS = 2048 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -139,11 +139,23 @@ export class CodeIndexOrchestrator { | |
| const handleFileParsed = (fileBlockCount: number) => { | ||
| cumulativeBlocksFoundSoFar += fileBlockCount | ||
| this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar) | ||
|
|
||
| // Add progress percentage to status message | ||
| if (cumulativeBlocksFoundSoFar > 0) { | ||
| const progressPercent = Math.round((cumulativeBlocksIndexed / cumulativeBlocksFoundSoFar) * 100) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this intentional? The progress percentage is updated on every file parsed and every batch indexed. For large codebases with thousands of files, this could result in excessive UI updates. Could we throttle these updates to reduce UI overhead? |
||
| this.stateManager.setSystemState("Indexing", `Indexing workspace... (${progressPercent}% complete)`) | ||
| } | ||
| } | ||
|
|
||
| const handleBlocksIndexed = (indexedCount: number) => { | ||
| cumulativeBlocksIndexed += indexedCount | ||
| this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar) | ||
|
|
||
| // Add progress percentage to status message | ||
| if (cumulativeBlocksFoundSoFar > 0) { | ||
| const progressPercent = Math.round((cumulativeBlocksIndexed / cumulativeBlocksFoundSoFar) * 100) | ||
| this.stateManager.setSystemState("Indexing", `Indexing workspace... (${progressPercent}% complete)`) | ||
| } | ||
| } | ||
|
|
||
| const result = await this.scanner.scanDirectory( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -89,6 +89,74 @@ export class DirectoryScanner implements IDirectoryScanner { | |
| let processedCount = 0 | ||
| let skippedCount = 0 | ||
|
|
||
| // Early termination check: if all files are already indexed, skip processing | ||
| let allFilesUnchanged = true | ||
| let quickCheckCount = 0 | ||
| const quickCheckLimit = Math.min(10, supportedPaths.length) // Check first 10 files for quick assessment | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we approach this differently to improve reliability? The quick check only samples the first 10 files. If a large codebase has changes only in files beyond position 10, this optimization might incorrectly skip indexing. Consider using a random sample instead? |
||
|
|
||
| for (const filePath of supportedPaths.slice(0, quickCheckLimit)) { | ||
| try { | ||
| const stats = await stat(filePath) | ||
| if (stats.size <= MAX_FILE_SIZE_BYTES) { | ||
| const content = await vscode.workspace.fs | ||
| .readFile(vscode.Uri.file(filePath)) | ||
| .then((buffer) => Buffer.from(buffer).toString("utf-8")) | ||
| const currentFileHash = createHash("sha256").update(content).digest("hex") | ||
| const cachedFileHash = this.cacheManager.getHash(filePath) | ||
|
|
||
| if (cachedFileHash !== currentFileHash) { | ||
| allFilesUnchanged = false | ||
| break | ||
| } | ||
| } | ||
| quickCheckCount++ | ||
| } catch (error) { | ||
| // If we can't check a file, assume it might have changed | ||
| allFilesUnchanged = false | ||
| break | ||
| } | ||
| } | ||
|
|
||
| // If quick check shows all sampled files are unchanged and we checked a reasonable sample, | ||
| // do a full check to confirm | ||
| if (allFilesUnchanged && quickCheckCount === quickCheckLimit && supportedPaths.length > quickCheckLimit) { | ||
| console.log(`[DirectoryScanner] Quick check passed, verifying all ${supportedPaths.length} files...`) | ||
| for (const filePath of supportedPaths.slice(quickCheckLimit)) { | ||
| try { | ||
| const stats = await stat(filePath) | ||
| if (stats.size <= MAX_FILE_SIZE_BYTES) { | ||
| const content = await vscode.workspace.fs | ||
| .readFile(vscode.Uri.file(filePath)) | ||
| .then((buffer) => Buffer.from(buffer).toString("utf-8")) | ||
| const currentFileHash = createHash("sha256").update(content).digest("hex") | ||
| const cachedFileHash = this.cacheManager.getHash(filePath) | ||
|
|
||
| if (cachedFileHash !== currentFileHash) { | ||
| allFilesUnchanged = false | ||
| break | ||
| } | ||
| } | ||
| } catch (error) { | ||
| allFilesUnchanged = false | ||
| break | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // If all files are unchanged, we can skip the entire indexing process | ||
| if (allFilesUnchanged && supportedPaths.length > 0) { | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This critical optimization needs test coverage. The early termination logic is a significant performance improvement but I don't see tests specifically covering this new behavior. Could we add tests to ensure this optimization works correctly in various scenarios? |
||
| console.log( | ||
| `[DirectoryScanner] All ${supportedPaths.length} files are already indexed and unchanged. Skipping indexing.`, | ||
| ) | ||
| return { | ||
| stats: { | ||
| processed: 0, | ||
| skipped: supportedPaths.length, | ||
| }, | ||
| totalBlockCount: 0, | ||
| } | ||
| } | ||
|
|
||
| // Initialize parallel processing tools | ||
| const parseLimiter = pLimit(PARSING_CONCURRENCY) // Concurrency for file parsing | ||
| const batchLimiter = pLimit(BATCH_PROCESSING_CONCURRENCY) // Concurrency for batch processing | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? The increased batch sizes (BATCH_SEGMENT_THRESHOLD from 60 to 200, MAX_PENDING_BATCHES from 20 to 30) could significantly increase memory consumption. Could we consider adding memory monitoring or making these configurable based on system capabilities?