Skip to content

Commit 22a0fee

Browse files
committed
chore: improve readme, drop useless arguments
1 parent 9998fcd commit 22a0fee

File tree

5 files changed

+313
-122
lines changed

5 files changed

+313
-122
lines changed

README.md

Lines changed: 179 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,174 @@
11
# Speakeasy Docs MCP
22

3-
A lightweight, domain-agnostic embedded search engine exposed via the Model Context Protocol (MCP). **Beta.**
3+
A lightweight, domain-agnostic hybrid search engine for markdown corpora, exposed via the [Model Context Protocol](https://modelcontextprotocol.io/) (MCP). While it can index and serve **any** markdown corpus, it is deeply optimized for serving SDK documentation to AI coding agents. **Beta.**
44

5-
While it can index and serve **any** markdown corpus, it is deeply optimized to solve the unique challenges of serving [Speakeasy](https://www.speakeasy.com)-generated SDK documentation to AI coding agents.
5+
## How It Works
66

7-
## The Problem
7+
Docs MCP provides a local, in-memory search engine (powered by [LanceDB](https://lancedb.github.io/lancedb/)) that runs inside a Node.js MCP server. Three core optimizations make it effective for structured documentation:
88

9-
Enterprise coding agents need documentation to write correct code, but standard RAG (Retrieval-Augmented Generation) architectures fail in two major ways:
10-
1. **Generic Corpus Failures:** Standard vector search doesn't understand the shape of your data. If you have multiple products or versions, the agent gets confused without explicit filtering, and hardcoding those filters ruins the reusability of the search engine.
11-
2. **The "SDK Monolith" Problem:** Speakeasy generates comprehensive, highly structured documentation (often producing 13,000+ line READMEs for large services). Standard 1,000-character chunking destroys code blocks, and returning the whole file blows out an LLM's context window.
9+
### Faceted Taxonomy
1210

13-
## The Solution
11+
Metadata keys defined in [`.docs-mcp.json`](#corpus-structure) manifests become enum-injected JSON Schema parameters on the `search_docs` tool. The agent selects from a strict set of valid filter values (e.g. `language: ["typescript", "python", "go"]`). On zero results, the server returns structured hints (e.g. "0 results for 'typescript'. Matches found in: ['python']").
1412

15-
Docs MCP provides a local, in-memory Hybrid Search engine (powered by LanceDB) that runs directly inside your Node.js/TypeScript MCP server. LanceDB's memory-mapped architecture ensures lightning-fast retrieval even on large datasets. It solves the RAG problems through a flexible, agent-optimized architecture.
13+
### Vector Collapse
1614

17-
### Generic Capabilities (For Any Markdown Corpus)
15+
SDK documentation for the same API operation across multiple languages produces near-identical embeddings. Vector collapse deduplicates these at search time, keeping only the highest-scoring variant per taxonomy field:
1816

19-
- **Dynamic Tool Schemas & Enum Injection:** The engine is entirely domain-agnostic. It reads a `metadata.json` generated during indexing and dynamically constructs the MCP tools. It strictly enforces taxonomy by injecting valid values (e.g., `['typescript', 'python']`) directly into the JSON Schema as `enum`s. This guarantees the LLM provides valid filters upfront, eliminating hallucinated searches and wasted round-trips.
20-
- **Agent Prompting via `corpus_description`:** The user configures a `corpus_description` (e.g., "Internal HR Policies" or "Acme Corp SDKs"). The MCP server uses this to dynamically write the tool descriptions, effectively giving the LLM a custom system prompt on exactly *how* and *when* to search.
21-
- **Hybrid Search (RRF Baseline):** Combines exact-match Full-Text Search (critical for specific error codes or class names) with Semantic Vector Search (critical for conceptual queries). Ranking beyond the v1 baseline is eval-driven.
22-
- **Stateless Pagination & Fallbacks:** Uses opaque cursor tokens to paginate results without server-side memory leaks. If a search yields zero results due to strict filtering, it returns structured hints (e.g., "0 results in 'typescript'. Matches found in: ['python'].") to guide the agent.
17+
```json
18+
{ "taxonomy": { "language": { "vector_collapse": true } } }
19+
```
2320

24-
### Speakeasy-Optimized Capabilities (For SDK Documentation)
21+
When the agent explicitly filters by language, collapse is automatically skipped — the filter already restricts to a single variant.
2522

26-
When used with Speakeasy SDKs, the engine leverages distributed manifests to enable powerful features:
23+
### Hybrid FTS + Semantic Search
2724

28-
- **Intelligent Chunking Hints:** Instead of naive character limits, the indexer uses a "hinting" system (`h1`, `h2`, `h3`, `file`) to find perfect boundaries. These hints are distributed: they can be defined in a `.docs-mcp.json` within an imported SDK folder, or overridden by YAML frontmatter for specific guides.
29-
- **Hierarchical Context Injection:** Ancestor headings (`Service: Auth > Method: Login`) are injected into the text sent to the embedding model, ensuring the vector perfectly captures the intent of the isolated code block.
30-
- **Strict Resolution (Enforced Taxonomy):** Speakeasy generates docs for Python, TS, Go, etc., creating massive semantic duplication. Instead of trying to dynamically "collapse" results, the server relies on the dynamically injected JSON Schema `enum`s to force the LLM to define the language upfront based on the user's workspace.
25+
Search combines three ranking signals via [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf):
3126

32-
## Architecture
27+
1. **Full-text search** — multi-field matching on headings (boosted 3x) and content
28+
2. **Phrase proximity** — rewards results where query terms appear close together
29+
3. **Vector similarity** — semantic embedding distance (when an embedding provider is configured)
30+
31+
FTS dominates for exact class names and error codes. Vector similarity lifts conceptual and paraphrased queries. The blend is configurable via RRF weights.
32+
33+
### Hierarchical Context
34+
35+
Ancestor headings (breadcrumbs like `Auth SDK > AcmeAuthClientV2 > Initialization`) are prepended to each chunk's embedding input and returned with search results. This enables the calling agent to explore the corpus structure, navigating from high-level concepts down to specific implementation details.
36+
37+
## Benchmarks
38+
39+
On a realistic 28.8MB multi-language SDK corpus (38 eval cases across 9 categories), benchmarked with [`docs-mcp-eval benchmark`](docs/eval.md):
40+
41+
### Summary
42+
43+
| Metric | none | openai/text-embedding-3-large |
44+
| --- | ---: | ---: |
45+
| MRR@5 | 0.1803 | 0.2320 |
46+
| NDCG@5 | 0.2136 | 0.2657 |
47+
| Facet Precision | 0.3158 | 0.3684 |
48+
| Search p50 (ms) | 5.2 | 242.6 |
49+
| Search p95 (ms) | 6.6 | 5914.1 |
50+
| Build Time (ms) | 6989 | 20448 |
51+
| Peak RSS (MB) | 247.6 | 313.6 |
52+
| Index Size (corpus 28.8MB) | 104.9MB | 356.9MB |
53+
| Embed Cost (est.) | $0 | $0.9825 |
54+
| Query Cost (est.) | $0 | $0.000003 |
55+
56+
### Per-Category Facet Precision
57+
58+
| Category | none | openai/text-embedding-3-large |
59+
| --- | ---: | ---: |
60+
| api-discovery | 0.0000 | 0.0000 |
61+
| cross-service | 0.3333 | 0.3333 |
62+
| distractor | 0.4000 | 0.4000 |
63+
| error-handling | 0.0000 | 0.0000 |
64+
| intent | 0.4000 | 0.4000 |
65+
| lexical | 0.8000 | 0.8000 |
66+
| multi-hop | 0.3333 | 0.3333 |
67+
| paraphrased | 0.1250 | 0.2500 |
68+
| sdk-reference | 0.3333 | 0.6667 |
69+
70+
### Per-Category MRR@5
71+
72+
| Category | none | openai/text-embedding-3-large |
73+
| --- | ---: | ---: |
74+
| api-discovery | 0.0000 | 0.0000 |
75+
| cross-service | 0.1667 | 0.3333 |
76+
| distractor | 0.3000 | 0.3000 |
77+
| error-handling | 0.0000 | 0.0000 |
78+
| intent | 0.0900 | 0.2667 |
79+
| lexical | 0.4800 | 0.5067 |
80+
| multi-hop | 0.3333 | 0.3333 |
81+
| paraphrased | 0.0625 | 0.0938 |
82+
| sdk-reference | 0.1667 | 0.2333 |
83+
84+
**Key takeaways:**
85+
- Embeddings double facet precision on `paraphrased` and `sdk-reference` categories
86+
- Embeddings triple MRR on `intent` queries (0.09 → 0.27)
87+
- `lexical`, `distractor`, `cross-service`, `multi-hop` — FTS alone matches embedding performance
88+
- FTS-only search: 5ms p50 latency, zero embedding cost
89+
90+
## Graceful Fallback
91+
92+
1. **No embeddings** (`--embedding-provider none`): FTS-only search, zero cost, zero API keys. Already effective for exact-match and lexical queries.
93+
2. **With embeddings** (`--embedding-provider openai`): Hybrid search with better recall on conceptual and paraphrased queries. ~$1 one-time embedding cost per 28.8MB corpus.
94+
3. **Runtime degradation**: If the embedding API is unavailable at query time, the server automatically falls back to FTS-only with a one-time warning.
95+
96+
## Corpus Structure
97+
98+
### Folder Layout
99+
100+
Documentation corpora use `.docs-mcp.json` manifests to control chunking and taxonomy. Manifests can be placed at any level of the directory tree:
33101

34-
Docs MCP is structured as a Turborepo to strictly decouple authoring, indexing, runtime retrieval, and evaluation:
102+
```
103+
my-docs/
104+
├── .docs-mcp.json ← root manifest (applies to guides/)
105+
├── guides/
106+
│ ├── retries.md
107+
│ └── pagination.md
108+
└── sdks/
109+
├── typescript/
110+
│ ├── .docs-mcp.json ← deeper manifest (exclusive precedence)
111+
│ └── auth.md
112+
└── python/
113+
├── .docs-mcp.json ← deeper manifest (exclusive precedence)
114+
└── auth.md
115+
```
116+
117+
**Deeper manifests take exclusive precedence.** A file at `sdks/typescript/auth.md` is governed only by `sdks/typescript/.docs-mcp.json` — the root manifest is ignored for that subtree.
118+
119+
### `.docs-mcp.json`
120+
121+
```jsonc
122+
{
123+
// Required. Schema version.
124+
"version": "1",
125+
126+
// Chunking strategy applied to all files in this directory tree.
127+
"strategy": {
128+
"chunk_by": "h2", // Split at ## headings. Options: h1, h2, h3, file
129+
"max_chunk_size": 8000, // Oversized chunks split recursively at finer headings
130+
"min_chunk_size": 200 // Tiny trailing chunks merge into preceding chunk
131+
},
132+
133+
// Key-value pairs attached to every chunk. Each key becomes a filterable
134+
// enum parameter on the search_docs tool.
135+
"metadata": {
136+
"language": "typescript",
137+
"scope": "sdk-specific"
138+
},
139+
140+
// Per-field search behavior. vector_collapse deduplicates cross-language
141+
// variants at search time (only active when no filter is set for that field).
142+
"taxonomy": {
143+
"language": { "vector_collapse": true }
144+
},
145+
146+
// File-pattern overrides. Evaluated top-to-bottom; last match wins.
147+
// Override metadata merges with root (override keys win).
148+
// Override strategy replaces root strategy entirely.
149+
"overrides": [
150+
{
151+
"pattern": "models/**/*.md",
152+
"strategy": { "chunk_by": "file" }
153+
}
154+
]
155+
}
156+
```
157+
158+
Full schema: [`schemas/docs-mcp.schema.json`](schemas/docs-mcp.schema.json)
35159

36-
1. **`@speakeasy-api/docs-mcp-cli`**: The CLI toolchain for validation, manifest bootstrap (`docs-mcp fix`), and deterministic indexing (`docs-mcp build`).
37-
2. **`@speakeasy-api/docs-mcp-core`**: Core retrieval primitives, AST parsing, and LanceDB queries.
38-
3. **`@speakeasy-api/docs-mcp-server`**: The lean runtime Model Context Protocol surface.
39-
4. **`@speakeasy-api/docs-mcp-eval`**: Standalone evaluation and benchmarking harness.
160+
Individual files can also override their manifest via YAML frontmatter (`mcp_chunking_hint`, `metadata` keys). Frontmatter takes highest precedence. See the [manifest contract](docs/implementation/manifest_contract.md) for full resolution rules.
40161

41-
Source materialization (Git sync, sparse checkout, include/exclude policy) is host-owned by Static MCP and feeds a deterministic local docs directory into `docs-mcp`.
162+
## Architecture
163+
164+
Structured as a Turborepo with four packages:
165+
166+
| Package | Role |
167+
|---|---|
168+
| `@speakeasy-api/docs-mcp-cli` | CLI for validation, manifest bootstrap (`fix`), and deterministic indexing (`build`) |
169+
| `@speakeasy-api/docs-mcp-core` | Core retrieval primitives, AST parsing, chunking, and LanceDB queries |
170+
| `@speakeasy-api/docs-mcp-server` | Lean runtime MCP server surface |
171+
| `@speakeasy-api/docs-mcp-eval` | Standalone evaluation and benchmarking harness |
42172

43173
```text
44174
+---------------------------+
@@ -48,28 +178,23 @@ Source materialization (Git sync, sparse checkout, include/exclude policy) is ho
48178
| Dynamic Tool Schema (with Enums)
49179
v
50180
+---------------------------+
51-
| Speakeasy Static MCP |
52-
| (TypeScript/Node) |
53-
+-------------+-------------+
54-
|
55-
v
56-
+---------------------------+
57-
| @speakeasy-api/docs-mcp-server|
181+
| @speakeasy-api/ |
182+
| docs-mcp-server |
58183
| search_docs, get_doc |
59184
+-------------+-------------+
60185
|
61186
v
62187
+---------------------------+
63-
| @speakeasy-api/docs-mcp-core |
188+
| @speakeasy-api/ |
189+
| docs-mcp-core |
64190
| LanceDB Engine |
65191
| Memory-Mapped IO |
66192
+-------------+-------------+
67193
|
68194
v
69195
+-----------------+
70196
| .lancedb/ index |
71-
| (baked in image)|
72-
+--------+--------+
197+
+-----------------+
73198
```
74199

75200
## MCP Tools
@@ -79,7 +204,7 @@ The tools exposed to the agent are dynamically generated based on your `corpus_d
79204
| Tool | What it does |
80205
|---|---|
81206
| `search_docs` | Performs hybrid search. Tool names and descriptions are user-configurable. Parameters are dynamically generated with valid taxonomy injected as JSON Schema `enum`s. Supports stateless cursor pagination. Returns fallback hints on zero results. |
82-
| `get_doc` | Returns a specific chunk, plus `context: N` neighboring chunks. This allows the agent to read surrounding implementation details without fetching massive monolithic files. |
207+
| `get_doc` | Returns a specific chunk, plus `context: N` neighboring chunks for surrounding detail. |
83208

84209
## Quick Start
85210

@@ -94,33 +219,35 @@ CMD ["docs-mcp-server", "--index-dir", "/index", "--transport", "http", "--port"
94219

95220
## Usage & Deployment
96221

97-
Docs MCP separates the heavy LLM/authoring workflows from the deterministic CI build and the lean runtime server.
98-
99222
**1. Authoring (Local Dev)**
100223
If you have legacy docs without chunking strategies, use the CLI locally to bootstrap a baseline `.docs-mcp.json`.
101224
```bash
102225
npx @speakeasy-api/docs-mcp-cli fix --docs-dir ./docs
103226
```
104227

105228
**2. Indexing (CI Build Step)**
106-
Run the deterministic indexer against your corpus. The indexer reads manifests and frontmatter to chunk the docs, generates embeddings, and saves the local `.lancedb` directory. *This step makes no LLM calls.*
107-
```bash
108-
npx @speakeasy-api/docs-mcp-cli build --docs-dir ./docs --out ./dist/.lancedb
229+
Run the deterministic indexer against your corpus. The indexer reads manifests and frontmatter to chunk the docs, generates embeddings, and saves the local `.lancedb` directory. Cache the output directory across CI runs to make builds incremental — only changed chunks are re-embedded.
230+
```yaml
231+
- uses: actions/cache@v4
232+
with:
233+
path: ./dist/.lancedb
234+
# Unique key saves the updated cache after each build
235+
key: docs-mcp-${{ github.run_id }}
236+
# Prefix match loads the most recent prior cache
237+
restore-keys: docs-mcp-
238+
239+
- run: npx @speakeasy-api/docs-mcp-cli build --docs-dir ./docs --out ./dist/.lancedb
109240
```
110241
111-
**3. Runtime (Static MCP)**
112-
The `.lancedb` directory is packaged with the MCP server. At runtime, the server operates entirely locally with zero external API calls for search.
113-
```typescript
114-
import { McpDocsServer } from '@speakeasy-api/docs-mcp-server';
242+
**3. Runtime (MCP Server)**
243+
The `.lancedb` directory is packaged with the MCP server. FTS search is fully local. If the index was built with embeddings, the server calls the embedding API at query time to embed the search query.
244+
```bash
245+
npx @speakeasy-api/docs-mcp-server --index-dir ./dist/.lancedb
246+
```
115247

116-
// The server reads corpus_description, taxonomy, and embedding config
117-
// from the metadata.json generated alongside the .lancedb index at build time.
118-
const server = new McpDocsServer({
119-
dbPath: './dist/.lancedb',
120-
});
248+
## Evaluation
121249

122-
server.start();
123-
```
250+
Docs MCP includes a standalone evaluation harness for measuring search quality with transparent, repeatable benchmarks. See the [Evaluation Framework](docs/eval.md) for how to build an eval suite, run benchmarks across embedding providers, and interpret results.
124251

125252
## License
126253

0 commit comments

Comments
 (0)