feat: add ingest_directory tool for recursive directory ingestion#44
feat: add ingest_directory tool for recursive directory ingestion#44AlexanderV wants to merge 1 commit intoshinpr:mainfrom
Conversation
- Add ingest_directory MCP tool that recursively scans directories for supported files (PDF, DOCX, TXT, MD) and ingests them into the vector database - Support recursive/non-recursive scanning via 'recursive' parameter - Return per-file results with success/failure counts - Validate absolute paths and handle errors gracefully - Update README.md and SKILL.md with documentation - Update .gitignore for lancedb/ and models/ directories
|
175 lines of new logic with no tests. Other tools have thorough integration tests — see Required test cases:
|
shinpr
left a comment
There was a problem hiding this comment.
3 items to address before merge — see inline comments and the thread below for details.
| const recursive = args.recursive !== false | ||
|
|
||
| // Validate directory path is absolute | ||
| if (!args.directoryPath.startsWith('/') && !(/^[a-zA-Z]:[\\/]/.test(args.directoryPath))) { |
There was a problem hiding this comment.
Only checks whether the path is absolute — no BASE_DIR boundary check. collectFiles will scan any directory on the filesystem. Individual files would fail at the parser level, but the scan itself can enumerate files outside the allowed boundary.
DocumentParser.validateFilePath() already does both checks. Reuse it instead:
this.parser.validateFilePath(args.directoryPath)The underlying logic (isAbsolute + resolve + startsWith) operates on path strings and doesn't distinguish files from directories, so this works for directory paths too.
This also removes the one-off Windows path regex not used anywhere else.
| * ingest_directory tool handler | ||
| * Recursively finds and ingests all supported files in a directory | ||
| */ | ||
| async handleIngestDirectory( |
There was a problem hiding this comment.
No upper bound on how many files can be collected and ingested in one call. Each file goes through parse → chunk → embed → insert, taking seconds per file. A directory with thousands of files would run for minutes or hours, likely hitting MCP client timeouts.
Add a limit (100 files is a safe ceiling given the response payload stays well within client token limits) and return McpError when exceeded:
if (files.length > 100) {
throw new McpError(
ErrorCode.InvalidParams,
`Too many files: ${files.length} (limit: 100). Narrow the directory or split into multiple calls.`
)
}Also, collectFiles has no maximum recursion depth. This is less likely to be a real problem (OS path length limits and the 100-file cap provide practical bounds), but adding a depth guard is cheap insurance:
private async collectFiles(dirPath: string, recursive: boolean, depth = 0): Promise<string[]> {
if (depth > 10) return []
// ...
const subFiles = await this.collectFiles(fullPath, recursive, depth + 1)
}| @@ -1,7 +1,8 @@ | |||
| // RAGServer implementation with MCP tools | |||
There was a problem hiding this comment.
Already 733 lines, this PR brings it to ~900. collectFiles has no dependency on RAGServer state — extract it to a separate module (e.g., server/directory-scanner.ts). Not blocking.
| * RAG server compliant with MCP Protocol | ||
| * | ||
| * Responsibilities: | ||
| * - MCP tool integration (4 tools) |
There was a problem hiding this comment.
Says "4 tools" but the server now has 7. Update the count.
| CLAUDE.md | ||
| .claude/ | ||
| docs/ | ||
| /models/Xenova/all-MiniLM-L6-v2/*.json |
There was a problem hiding this comment.
models/ already ignores everything under it. The specific pattern is redundant — remove it.
Summary
Adds a new \ingest_directory\ MCP tool that allows users to recursively ingest all supported documents from a directory in a single operation.
Changes
ecursive\ (optional, default: \ rue) — whether to recurse into subdirectories
Updated files
Tool count
The server now provides 7 MCP tools (previously 6): \ingest_file, \ingest_directory, \ingest_data, \query_documents, \list_files, \delete_file, \status.