Skip to content

feat: add ingest_directory tool for recursive directory ingestion#44

Open
AlexanderV wants to merge 1 commit intoshinpr:mainfrom
AlexanderV:feat/ingest-directory-tool
Open

feat: add ingest_directory tool for recursive directory ingestion#44
AlexanderV wants to merge 1 commit intoshinpr:mainfrom
AlexanderV:feat/ingest-directory-tool

Conversation

@AlexanderV
Copy link

Summary

Adds a new \ingest_directory\ MCP tool that allows users to recursively ingest all supported documents from a directory in a single operation.

Changes

  • New \ingest_directory\ tool — recursively scans a directory for supported file types (PDF, DOCX, TXT, MD) and ingests them into the vector database
  • Parameters:
    • \directoryPath\ (required) — absolute path to the directory

    • ecursive\ (optional, default: \ rue) — whether to recurse into subdirectories
  • Returns per-file results with success/failure counts, chunk counts, and timestamps
  • Validation: absolute path check, directory readability, non-empty file list
  • Error handling: graceful per-file error collection (one failing file doesn't stop the rest)

Updated files

File Changes
\src/server/index.ts\ New tool implementation: interfaces, schema registration, handler, file collector
\README.md\ Added documentation for the new tool
\skills/mcp-local-rag/SKILL.md\ Added tool reference and usage examples
.gitignore\ Added \lancedb/\ and \models/\ directories

Tool count

The server now provides 7 MCP tools (previously 6): \ingest_file, \ingest_directory, \ingest_data, \query_documents, \list_files, \delete_file, \status.

- Add ingest_directory MCP tool that recursively scans directories for supported files (PDF, DOCX, TXT, MD) and ingests them into the vector database
- Support recursive/non-recursive scanning via 'recursive' parameter
- Return per-file results with success/failure counts
- Validate absolute paths and handle errors gracefully
- Update README.md and SKILL.md with documentation
- Update .gitignore for lancedb/ and models/ directories
@shinpr shinpr self-requested a review February 20, 2026 04:05
@shinpr
Copy link
Owner

shinpr commented Feb 20, 2026

175 lines of new logic with no tests. Other tools have thorough integration tests — see src/__tests__/server/ingest-data.test.ts and src/server/__tests__/rag-server.integration.test.ts for the pattern.

Required test cases:

  • Happy path: recursive scan finds and ingests supported files
  • Happy path: recursive: false scans only top-level
  • Relative path rejection
  • Empty directory (no supported files found)
  • Non-existent directory error
  • Partial failure: one file fails but others succeed, result reflects counts correctly

Copy link
Owner

@shinpr shinpr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 items to address before merge — see inline comments and the thread below for details.

const recursive = args.recursive !== false

// Validate directory path is absolute
if (!args.directoryPath.startsWith('/') && !(/^[a-zA-Z]:[\\/]/.test(args.directoryPath))) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only checks whether the path is absolute — no BASE_DIR boundary check. collectFiles will scan any directory on the filesystem. Individual files would fail at the parser level, but the scan itself can enumerate files outside the allowed boundary.

DocumentParser.validateFilePath() already does both checks. Reuse it instead:

this.parser.validateFilePath(args.directoryPath)

The underlying logic (isAbsolute + resolve + startsWith) operates on path strings and doesn't distinguish files from directories, so this works for directory paths too.

This also removes the one-off Windows path regex not used anywhere else.

* ingest_directory tool handler
* Recursively finds and ingests all supported files in a directory
*/
async handleIngestDirectory(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upper bound on how many files can be collected and ingested in one call. Each file goes through parse → chunk → embed → insert, taking seconds per file. A directory with thousands of files would run for minutes or hours, likely hitting MCP client timeouts.

Add a limit (100 files is a safe ceiling given the response payload stays well within client token limits) and return McpError when exceeded:

if (files.length > 100) {
  throw new McpError(
    ErrorCode.InvalidParams,
    `Too many files: ${files.length} (limit: 100). Narrow the directory or split into multiple calls.`
  )
}

Also, collectFiles has no maximum recursion depth. This is less likely to be a real problem (OS path length limits and the 100-file cap provide practical bounds), but adding a depth guard is cheap insurance:

private async collectFiles(dirPath: string, recursive: boolean, depth = 0): Promise<string[]> {
  if (depth > 10) return []
  // ...
  const subFiles = await this.collectFiles(fullPath, recursive, depth + 1)
}

@@ -1,7 +1,8 @@
// RAGServer implementation with MCP tools
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already 733 lines, this PR brings it to ~900. collectFiles has no dependency on RAGServer state — extract it to a separate module (e.g., server/directory-scanner.ts). Not blocking.

* RAG server compliant with MCP Protocol
*
* Responsibilities:
* - MCP tool integration (4 tools)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Says "4 tools" but the server now has 7. Update the count.

CLAUDE.md
.claude/
docs/
/models/Xenova/all-MiniLM-L6-v2/*.json
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

models/ already ignores everything under it. The specific pattern is redundant — remove it.

@shinpr shinpr mentioned this pull request Feb 20, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants