Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions src/services/code-index/constants/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,16 @@ export const MAX_FILE_SIZE_BYTES = 1 * 1024 * 1024 // 1MB

/**Directory Scanner */
export const MAX_LIST_FILES_LIMIT_CODE_INDEX = 50_000
export const BATCH_SEGMENT_THRESHOLD = 60 // Number of code segments to batch for embeddings/upserts
export const BATCH_SEGMENT_THRESHOLD = 200 // Number of code segments to batch for embeddings/upserts - increased from 60 for better performance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? The increased batch sizes (BATCH_SEGMENT_THRESHOLD from 60 to 200, MAX_PENDING_BATCHES from 20 to 30) could significantly increase memory consumption. Could we consider adding memory monitoring or making these configurable based on system capabilities?

export const MAX_BATCH_RETRIES = 3
export const INITIAL_RETRY_DELAY_MS = 500
export const PARSING_CONCURRENCY = 10
export const MAX_PENDING_BATCHES = 20 // Maximum number of batches to accumulate before waiting
export const PARSING_CONCURRENCY = 20 // Increased from 10 for faster parallel file parsing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make these values configurable through settings? The hardcoded concurrency values (PARSING_CONCURRENCY: 20, BATCH_PROCESSING_CONCURRENCY: 15) might not be optimal for all systems. Lower-end machines might struggle while high-end systems could handle more.

export const MAX_PENDING_BATCHES = 30 // Maximum number of batches to accumulate before waiting - increased from 20

/**OpenAI Embedder */
export const MAX_BATCH_TOKENS = 100000
export const MAX_ITEM_TOKENS = 8191
export const BATCH_PROCESSING_CONCURRENCY = 10
export const BATCH_PROCESSING_CONCURRENCY = 15 // Increased from 10 for better throughput

/**Gemini Embedder */
export const GEMINI_MAX_ITEM_TOKENS = 2048
12 changes: 12 additions & 0 deletions src/services/code-index/orchestrator.ts
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,23 @@ export class CodeIndexOrchestrator {
const handleFileParsed = (fileBlockCount: number) => {
cumulativeBlocksFoundSoFar += fileBlockCount
this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)

// Add progress percentage to status message
if (cumulativeBlocksFoundSoFar > 0) {
const progressPercent = Math.round((cumulativeBlocksIndexed / cumulativeBlocksFoundSoFar) * 100)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? The progress percentage is updated on every file parsed and every batch indexed. For large codebases with thousands of files, this could result in excessive UI updates. Could we throttle these updates to reduce UI overhead?

this.stateManager.setSystemState("Indexing", `Indexing workspace... (${progressPercent}% complete)`)
}
}

const handleBlocksIndexed = (indexedCount: number) => {
cumulativeBlocksIndexed += indexedCount
this.stateManager.reportBlockIndexingProgress(cumulativeBlocksIndexed, cumulativeBlocksFoundSoFar)

// Add progress percentage to status message
if (cumulativeBlocksFoundSoFar > 0) {
const progressPercent = Math.round((cumulativeBlocksIndexed / cumulativeBlocksFoundSoFar) * 100)
this.stateManager.setSystemState("Indexing", `Indexing workspace... (${progressPercent}% complete)`)
}
}

const result = await this.scanner.scanDirectory(
Expand Down
68 changes: 68 additions & 0 deletions src/services/code-index/processors/scanner.ts
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,74 @@ export class DirectoryScanner implements IDirectoryScanner {
let processedCount = 0
let skippedCount = 0

// Early termination check: if all files are already indexed, skip processing
let allFilesUnchanged = true
let quickCheckCount = 0
const quickCheckLimit = Math.min(10, supportedPaths.length) // Check first 10 files for quick assessment
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we approach this differently to improve reliability? The quick check only samples the first 10 files. If a large codebase has changes only in files beyond position 10, this optimization might incorrectly skip indexing. Consider using a random sample instead?


for (const filePath of supportedPaths.slice(0, quickCheckLimit)) {
try {
const stats = await stat(filePath)
if (stats.size <= MAX_FILE_SIZE_BYTES) {
const content = await vscode.workspace.fs
.readFile(vscode.Uri.file(filePath))
.then((buffer) => Buffer.from(buffer).toString("utf-8"))
const currentFileHash = createHash("sha256").update(content).digest("hex")
const cachedFileHash = this.cacheManager.getHash(filePath)

if (cachedFileHash !== currentFileHash) {
allFilesUnchanged = false
break
}
}
quickCheckCount++
} catch (error) {
// If we can't check a file, assume it might have changed
allFilesUnchanged = false
break
}
}

// If quick check shows all sampled files are unchanged and we checked a reasonable sample,
// do a full check to confirm
if (allFilesUnchanged && quickCheckCount === quickCheckLimit && supportedPaths.length > quickCheckLimit) {
console.log(`[DirectoryScanner] Quick check passed, verifying all ${supportedPaths.length} files...`)
for (const filePath of supportedPaths.slice(quickCheckLimit)) {
try {
const stats = await stat(filePath)
if (stats.size <= MAX_FILE_SIZE_BYTES) {
const content = await vscode.workspace.fs
.readFile(vscode.Uri.file(filePath))
.then((buffer) => Buffer.from(buffer).toString("utf-8"))
const currentFileHash = createHash("sha256").update(content).digest("hex")
const cachedFileHash = this.cacheManager.getHash(filePath)

if (cachedFileHash !== currentFileHash) {
allFilesUnchanged = false
break
}
}
} catch (error) {
allFilesUnchanged = false
break
}
}
}

// If all files are unchanged, we can skip the entire indexing process
if (allFilesUnchanged && supportedPaths.length > 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This critical optimization needs test coverage. The early termination logic is a significant performance improvement but I don't see tests specifically covering this new behavior. Could we add tests to ensure this optimization works correctly in various scenarios?

console.log(
`[DirectoryScanner] All ${supportedPaths.length} files are already indexed and unchanged. Skipping indexing.`,
)
return {
stats: {
processed: 0,
skipped: supportedPaths.length,
},
totalBlockCount: 0,
}
}

// Initialize parallel processing tools
const parseLimiter = pLimit(PARSING_CONCURRENCY) // Concurrency for file parsing
const batchLimiter = pLimit(BATCH_PROCESSING_CONCURRENCY) // Concurrency for batch processing
Expand Down
Loading