You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A lightweight, domain-agnostic embedded search engine exposed via the Model Context Protocol (MCP). **Beta.**
3
+
A lightweight, domain-agnostic hybrid search engine for markdown corpora, exposed via the [Model Context Protocol](https://modelcontextprotocol.io/) (MCP). While it can index and serve **any** markdown corpus, it is deeply optimized for serving SDK documentation to AI coding agents. **Beta.**
4
4
5
-
While it can index and serve **any** markdown corpus, it is deeply optimized to solve the unique challenges of serving [Speakeasy](https://www.speakeasy.com)-generated SDK documentation to AI coding agents.
5
+
## How It Works
6
6
7
-
## The Problem
7
+
Docs MCP provides a local, in-memory search engine (powered by [LanceDB](https://lancedb.github.io/lancedb/)) that runs inside a Node.js MCP server. Three core optimizations make it effective for structured documentation:
8
8
9
-
Enterprise coding agents need documentation to write correct code, but standard RAG (Retrieval-Augmented Generation) architectures fail in two major ways:
10
-
1.**Generic Corpus Failures:** Standard vector search doesn't understand the shape of your data. If you have multiple products or versions, the agent gets confused without explicit filtering, and hardcoding those filters ruins the reusability of the search engine.
11
-
2.**The "SDK Monolith" Problem:** Speakeasy generates comprehensive, highly structured documentation (often producing 13,000+ line READMEs for large services). Standard 1,000-character chunking destroys code blocks, and returning the whole file blows out an LLM's context window.
9
+
### Faceted Taxonomy
12
10
13
-
## The Solution
11
+
Metadata keys defined in [`.docs-mcp.json`](#corpus-structure) manifests become enum-injected JSON Schema parameters on the `search_docs` tool. The agent selects from a strict set of valid filter values (e.g. `language: ["typescript", "python", "go"]`). On zero results, the server returns structured hints (e.g. "0 results for 'typescript'. Matches found in: ['python']").
14
12
15
-
Docs MCP provides a local, in-memory Hybrid Search engine (powered by LanceDB) that runs directly inside your Node.js/TypeScript MCP server. LanceDB's memory-mapped architecture ensures lightning-fast retrieval even on large datasets. It solves the RAG problems through a flexible, agent-optimized architecture.
13
+
### Vector Collapse
16
14
17
-
### Generic Capabilities (For Any Markdown Corpus)
15
+
SDK documentation for the same API operation across multiple languages produces near-identical embeddings. Vector collapse deduplicates these at search time, keeping only the highest-scoring variant per taxonomy field:
18
16
19
-
-**Dynamic Tool Schemas & Enum Injection:** The engine is entirely domain-agnostic. It reads a `metadata.json` generated during indexing and dynamically constructs the MCP tools. It strictly enforces taxonomy by injecting valid values (e.g., `['typescript', 'python']`) directly into the JSON Schema as `enum`s. This guarantees the LLM provides valid filters upfront, eliminating hallucinated searches and wasted round-trips.
20
-
-**Agent Prompting via `corpus_description`:** The user configures a `corpus_description` (e.g., "Internal HR Policies" or "Acme Corp SDKs"). The MCP server uses this to dynamically write the tool descriptions, effectively giving the LLM a custom system prompt on exactly *how* and *when* to search.
21
-
-**Hybrid Search (RRF Baseline):** Combines exact-match Full-Text Search (critical for specific error codes or class names) with Semantic Vector Search (critical for conceptual queries). Ranking beyond the v1 baseline is eval-driven.
22
-
-**Stateless Pagination & Fallbacks:** Uses opaque cursor tokens to paginate results without server-side memory leaks. If a search yields zero results due to strict filtering, it returns structured hints (e.g., "0 results in 'typescript'. Matches found in: ['python'].") to guide the agent.
When the agent explicitly filters by language, collapse is automatically skipped — the filter already restricts to a single variant.
25
22
26
-
When used with Speakeasy SDKs, the engine leverages distributed manifests to enable powerful features:
23
+
### Hybrid FTS + Semantic Search
27
24
28
-
-**Intelligent Chunking Hints:** Instead of naive character limits, the indexer uses a "hinting" system (`h1`, `h2`, `h3`, `file`) to find perfect boundaries. These hints are distributed: they can be defined in a `.docs-mcp.json` within an imported SDK folder, or overridden by YAML frontmatter for specific guides.
29
-
-**Hierarchical Context Injection:** Ancestor headings (`Service: Auth > Method: Login`) are injected into the text sent to the embedding model, ensuring the vector perfectly captures the intent of the isolated code block.
30
-
-**Strict Resolution (Enforced Taxonomy):** Speakeasy generates docs for Python, TS, Go, etc., creating massive semantic duplication. Instead of trying to dynamically "collapse" results, the server relies on the dynamically injected JSON Schema `enum`s to force the LLM to define the language upfront based on the user's workspace.
25
+
Search combines three ranking signals via [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf):
31
26
32
-
## Architecture
27
+
1.**Full-text search** — multi-field matching on headings (boosted 3x) and content
28
+
2.**Phrase proximity** — rewards results where query terms appear close together
29
+
3.**Vector similarity** — semantic embedding distance (when an embedding provider is configured)
30
+
31
+
FTS dominates for exact class names and error codes. Vector similarity lifts conceptual and paraphrased queries. The blend is configurable via RRF weights.
32
+
33
+
### Hierarchical Context
34
+
35
+
Ancestor headings (breadcrumbs like `Auth SDK > AcmeAuthClientV2 > Initialization`) are prepended to each chunk's embedding input and returned with search results. This enables the calling agent to explore the corpus structure, navigating from high-level concepts down to specific implementation details.
36
+
37
+
## Benchmarks
38
+
39
+
On a realistic 28.8MB multi-language SDK corpus (38 eval cases across 9 categories), benchmarked with [`docs-mcp-eval benchmark`](docs/eval.md):
40
+
41
+
### Summary
42
+
43
+
| Metric | none | openai/text-embedding-3-large |
44
+
| --- | ---: | ---: |
45
+
| MRR@5 | 0.1803 | 0.2320 |
46
+
| NDCG@5 | 0.2136 | 0.2657 |
47
+
| Facet Precision | 0.3158 | 0.3684 |
48
+
| Search p50 (ms) | 5.2 | 242.6 |
49
+
| Search p95 (ms) | 6.6 | 5914.1 |
50
+
| Build Time (ms) | 6989 | 20448 |
51
+
| Peak RSS (MB) | 247.6 | 313.6 |
52
+
| Index Size (corpus 28.8MB) | 104.9MB | 356.9MB |
**Deeper manifests take exclusive precedence.** A file at `sdks/typescript/auth.md` is governed only by `sdks/typescript/.docs-mcp.json` — the root manifest is ignored for that subtree.
118
+
119
+
### `.docs-mcp.json`
120
+
121
+
```jsonc
122
+
{
123
+
// Required. Schema version.
124
+
"version":"1",
125
+
126
+
// Chunking strategy applied to all files in this directory tree.
Full schema: [`schemas/docs-mcp.schema.json`](schemas/docs-mcp.schema.json)
35
159
36
-
1.**`@speakeasy-api/docs-mcp-cli`**: The CLI toolchain for validation, manifest bootstrap (`docs-mcp fix`), and deterministic indexing (`docs-mcp build`).
37
-
2.**`@speakeasy-api/docs-mcp-core`**: Core retrieval primitives, AST parsing, and LanceDB queries.
38
-
3.**`@speakeasy-api/docs-mcp-server`**: The lean runtime Model Context Protocol surface.
39
-
4.**`@speakeasy-api/docs-mcp-eval`**: Standalone evaluation and benchmarking harness.
160
+
Individual files can also override their manifest via YAML frontmatter (`mcp_chunking_hint`, `metadata` keys). Frontmatter takes highest precedence. See the [manifest contract](docs/implementation/manifest_contract.md) for full resolution rules.
40
161
41
-
Source materialization (Git sync, sparse checkout, include/exclude policy) is host-owned by Static MCP and feeds a deterministic local docs directory into `docs-mcp`.
162
+
## Architecture
163
+
164
+
Structured as a Turborepo with four packages:
165
+
166
+
| Package | Role |
167
+
|---|---|
168
+
|`@speakeasy-api/docs-mcp-cli`| CLI for validation, manifest bootstrap (`fix`), and deterministic indexing (`build`) |
|`@speakeasy-api/docs-mcp-server`| Lean runtime MCP server surface |
171
+
|`@speakeasy-api/docs-mcp-eval`| Standalone evaluation and benchmarking harness |
42
172
43
173
```text
44
174
+---------------------------+
@@ -48,28 +178,23 @@ Source materialization (Git sync, sparse checkout, include/exclude policy) is ho
48
178
| Dynamic Tool Schema (with Enums)
49
179
v
50
180
+---------------------------+
51
-
| Speakeasy Static MCP |
52
-
| (TypeScript/Node) |
53
-
+-------------+-------------+
54
-
|
55
-
v
56
-
+---------------------------+
57
-
| @speakeasy-api/docs-mcp-server|
181
+
| @speakeasy-api/ |
182
+
| docs-mcp-server |
58
183
| search_docs, get_doc |
59
184
+-------------+-------------+
60
185
|
61
186
v
62
187
+---------------------------+
63
-
| @speakeasy-api/docs-mcp-core |
188
+
| @speakeasy-api/ |
189
+
| docs-mcp-core |
64
190
| LanceDB Engine |
65
191
| Memory-Mapped IO |
66
192
+-------------+-------------+
67
193
|
68
194
v
69
195
+-----------------+
70
196
| .lancedb/ index |
71
-
| (baked in image)|
72
-
+--------+--------+
197
+
+-----------------+
73
198
```
74
199
75
200
## MCP Tools
@@ -79,7 +204,7 @@ The tools exposed to the agent are dynamically generated based on your `corpus_d
79
204
| Tool | What it does |
80
205
|---|---|
81
206
|`search_docs`| Performs hybrid search. Tool names and descriptions are user-configurable. Parameters are dynamically generated with valid taxonomy injected as JSON Schema `enum`s. Supports stateless cursor pagination. Returns fallback hints on zero results. |
82
-
|`get_doc`| Returns a specific chunk, plus `context: N` neighboring chunks. This allows the agent to read surrounding implementation details without fetching massive monolithic files. |
207
+
|`get_doc`| Returns a specific chunk, plus `context: N` neighboring chunks for surrounding detail. |
Run the deterministic indexer against your corpus. The indexer reads manifests and frontmatter to chunk the docs, generates embeddings, and saves the local `.lancedb` directory. *This step makes no LLM calls.*
Run the deterministic indexer against your corpus. The indexer reads manifests and frontmatter to chunk the docs, generates embeddings, and saves the local `.lancedb` directory. Cache the output directory across CI runs to make builds incremental — only changed chunks are re-embedded.
230
+
```yaml
231
+
- uses: actions/cache@v4
232
+
with:
233
+
path: ./dist/.lancedb
234
+
# Unique key saves the updated cache after each build
The `.lancedb` directory is packaged with the MCP server. FTS search is fully local. If the index was built with embeddings, the server calls the embedding API at query time to embed the search query.
// The server reads corpus_description, taxonomy, and embedding config
117
-
// from the metadata.json generated alongside the .lancedb index at build time.
118
-
const server =newMcpDocsServer({
119
-
dbPath: './dist/.lancedb',
120
-
});
248
+
## Evaluation
121
249
122
-
server.start();
123
-
```
250
+
Docs MCP includes a standalone evaluation harness for measuring search quality with transparent, repeatable benchmarks. See the [Evaluation Framework](docs/eval.md) for how to build an eval suite, run benchmarks across embedding providers, and interpret results.
0 commit comments