Skip to content

Commit 8466c0f

Browse files
Peterclaude
andcommitted
fix: resolve search_examples character fragmentation and ingestion failures
This commit fixes critical issues that prevented search_examples from returning results: 1. Fixed character fragmentation bug in code_examples.py where strings were iterated as individual characters instead of code blocks 2. Completely rewrote rustdoc_parser.py to properly handle actual rustdoc JSON structure with "index" and "paths" sections 3. Added NULL constraint protection in storage_manager.py to prevent database insertion failures 4. Added vector table sync in code_examples.py to populate vec_example_embeddings for search functionality 5. Fixed CrateService.search_examples to properly handle dictionary results and map to CodeExample model These changes ensure reliable ingestion of Rust crate documentation with functional code example extraction and search capabilities. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
1 parent 1e07220 commit 8466c0f

File tree

9 files changed

+485
-293
lines changed

9 files changed

+485
-293
lines changed

Architecture.md

Lines changed: 44 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22

33
## System Overview
44

5-
The docsrs-mcp server provides both REST API and Model Context Protocol (MCP) endpoints for querying Rust crate documentation using vector search. It features a service layer architecture with dual MCP implementation support, with Official Python MCP SDK 1.13.1 as the default implementation and legacy FastMCP 2.11.1 support (deprecated) through a --mcp-implementation CLI flag. The system includes comprehensive memory leak mitigation with automatic server restarts, string-first parameter handling for broad client compatibility, and transport-agnostic business services (CrateService and IngestionService) that decouple core functionality from MCP/REST layers. The architecture maintains a comprehensive asynchronous ingestion pipeline with enhanced rustdoc JSON processing, SQLite-based vector storage with intelligent caching, and dedicated code example extraction systems.
5+
The docsrs-mcp server provides both REST API and Model Context Protocol (MCP) endpoints for querying Rust crate documentation using vector search. It features a service layer architecture with dual MCP implementation support, with Official Python MCP SDK 1.13.1 as the default implementation and legacy FastMCP 2.11.1 support (deprecated) through a --mcp-implementation CLI flag. The system includes comprehensive memory leak mitigation with automatic server restarts, string-first parameter handling for broad client compatibility, and transport-agnostic business services (CrateService and IngestionService) that decouple core functionality from MCP/REST layers.
6+
7+
**Recent Architecture Enhancements**: The architecture maintains a comprehensively enhanced asynchronous ingestion pipeline with a completely rewritten rustdoc JSON parser that properly handles the actual rustdoc structure, robust storage manager with NULL constraint protection, fixed code example extraction with character fragmentation protection and vector sync capabilities, and corrected service layer dictionary handling for seamless code example search functionality. These fixes ensure reliable ingestion of Rust crate documentation with proper code example extraction and searchability through dual vector search systems for both documentation and code examples.
68

79
## High-Level Architecture
810

@@ -84,7 +86,7 @@ graph TB
8486
graph LR
8587
subgraph "docsrs_mcp Package"
8688
subgraph "Service Layer"
87-
CRATE_SVC[crate_service.py<br/>CrateService class<br/>Search, documentation, versions<br/>Transport-agnostic business logic<br/>_build_module_tree() transformation method]
89+
CRATE_SVC[crate_service.py<br/>CrateService class<br/>FIXED: search_examples method dictionary handling<br/>Proper mapping to CodeExample model requirements<br/>Search, documentation, versions<br/>Transport-agnostic business logic<br/>_build_module_tree() transformation method]
8890
INGEST_SVC[ingestion_service.py<br/>IngestionService class<br/>Pipeline management<br/>Pre-ingestion control<br/>Cargo file processing]
8991
TYPE_NAV_SVC[type_navigation_service.py<br/>TypeNavigationService class<br/>Code intelligence operations<br/>get_item_intelligence(), search_by_safety()<br/>get_error_catalog() methods]
9092
MCP_RUNNER[mcp_runner.py<br/>MCPServerRunner class<br/>Memory leak mitigation<br/>1000 calls/1GB restart<br/>Process health monitoring]
@@ -124,11 +126,11 @@ graph LR
124126
EMBED_MGR[embedding_manager.py<br/>ONNX model lifecycle (~170 LOC)<br/>Lazy loading pattern<br/>Memory-aware batch processing<br/>Warmup during startup]
125127
VER_RESOLVER[version_resolver.py<br/>Version resolution (~445 LOC)<br/>Rustdoc downloading<br/>Compression support (zst, gzip, json)<br/>docs.rs redirects]
126128
CACHE_MGR[cache_manager.py<br/>LRU cache eviction (~270 LOC)<br/>Priority scoring for popular crates<br/>2GB size limit enforcement<br/>File mtime-based eviction]
127-
RUSTDOC_PARSER[rustdoc_parser.py<br/>Streaming JSON parsing (~305 LOC)<br/>ijson-based memory efficiency<br/>Module hierarchy extraction<br/>Path validation with fallback]
129+
RUSTDOC_PARSER[rustdoc_parser.py<br/>Enhanced JSON parsing (~305 LOC)<br/>Complete rewrite for actual rustdoc structure<br/>Extracts from "index" and "paths" sections<br/>Direct code example extraction integration<br/>Generic parameters and trait bounds parsing<br/>Module hierarchy extraction<br/>Path validation with fallback]
128130
SIG_EXTRACTOR[signature_extractor.py<br/>Metadata extraction (~365 LOC)<br/>Complete item extraction<br/>Macro extraction patterns<br/>Enhanced schema validation]
129131
INTELLIGENCE_EXTRACTOR[intelligence_extractor.py<br/>Code Intelligence Extraction<br/>Error types, safety info, feature requirements<br/>Pre-compiled regex patterns<br/>Session-based caching mechanism]
130-
CODE_EXAMPLES[code_examples.py<br/>Code example extraction (~343 LOC)<br/>Language detection via pygments<br/>30% confidence threshold<br/>JSON structure with metadata]
131-
STORAGE_MGR[storage_manager.py<br/>Batch embedding storage (~296 LOC)<br/>Transaction management<br/>Streaming batch inserts<br/>Memory-aware chunking]
132+
CODE_EXAMPLES[code_examples.py<br/>Code example extraction (~343 LOC)<br/>FIXED: Character fragmentation bug at lines 234-242<br/>FIXED: Vector sync step for vec_example_embeddings<br/>Language detection via pygments<br/>30% confidence threshold<br/>Batch processing for embeddings sync<br/>JSON structure with metadata]
133+
STORAGE_MGR[storage_manager.py<br/>Batch embedding storage (~296 LOC)<br/>FIXED: NULL constraint protection for content field<br/>Enhanced robustness with explicit NULL checks<br/>Transaction management<br/>Streaming batch inserts<br/>Memory-aware chunking]
132134
end
133135
134136
ING[ingest.py<br/>Backward compatibility layer<br/>Re-exports from modular components<br/>Maintains existing API surface]
@@ -232,7 +234,7 @@ The docsrs-mcp server implements a service layer pattern that decouples business
232234

233235
#### Core Services
234236

235-
- **CrateService**: Handles all crate-related operations including search, documentation retrieval, and version management. **Phase 2 Enhancement**: Automatically populates `is_stdlib` and `is_dependency` fields in SearchResult and GetItemDocResponse models using DependencyFilter integration. **Critical Fix Applied**: Implements `_build_module_tree()` helper method that transforms flat database results into hierarchical ModuleTreeNode structures, resolving Pydantic validation errors by properly fulfilling service layer data transformation responsibility.
237+
- **CrateService**: Handles all crate-related operations including search, documentation retrieval, and version management. **Phase 2 Enhancement**: Automatically populates `is_stdlib` and `is_dependency` fields in SearchResult and GetItemDocResponse models using DependencyFilter integration. **Critical Fix Applied**: Implements `_build_module_tree()` helper method that transforms flat database results into hierarchical ModuleTreeNode structures, resolving Pydantic validation errors by properly fulfilling service layer data transformation responsibility. **Service Layer Fix**: Fixed search_examples method to properly handle dictionary results from search_example_embeddings, correctly mapping fields to CodeExample model requirements.
236238
- **IngestionService**: Manages the complete ingestion pipeline, pre-ingestion workflows, and cargo file processing
237239
- **CrossReferenceService**: **Phase 6 Enhancement**: Provides advanced cross-reference operations including import resolution, dependency graph analysis, migration suggestions, and re-export tracing. Implements circuit breaker pattern for resilience, LRU cache with 5-minute TTL for performance, and DFS algorithms for cycle detection in dependency graphs.
238240
- **Transport Layer Decoupling**: Business logic is independent of whether accessed via MCP or REST
@@ -4703,8 +4705,12 @@ The ingestion pipeline has been refactored from a monolithic 3609-line `ingest.p
47034705
- File mtime-based eviction strategies
47044706
- Cache health monitoring and cleanup operations
47054707

4706-
**rustdoc_parser.py (~305 LOC)**
4707-
- Streaming JSON parsing using ijson for memory efficiency
4708+
**rustdoc_parser.py (~305 LOC)** - MAJOR OVERHAUL COMPLETED
4709+
- **FIXED**: Complete rewrite of parse_rustdoc_items_streaming function for actual rustdoc JSON structure
4710+
- **CHANGED**: Replaced ijson streaming parser with regular json.loads for better compatibility
4711+
- **ENHANCED**: Direct extraction from "index" and "paths" sections of rustdoc JSON
4712+
- **ADDED**: Integrated code example extraction using extract_code_examples during parsing
4713+
- **ADDED**: Extraction of generic parameters and trait bounds from inner structure
47084714
- Module hierarchy extraction with parent-child relationships
47094715
- Path validation with fallback generation for robustness
47104716
- Progressive item streaming using generator-based architecture
@@ -4715,16 +4721,20 @@ The ingestion pipeline has been refactored from a monolithic 3609-line `ingest.p
47154721
- Schema validation with tier-aware different MIN_ITEMS_THRESHOLD values
47164722
- Cross-reference extraction from links fields
47174723

4718-
**code_examples.py (~343 LOC)** - ENHANCED WITH BUG FIXES
4724+
**code_examples.py (~343 LOC)** - ENHANCED WITH CRITICAL FIXES
47194725
- Code example extraction with structured JSON metadata
47204726
- Language detection via pygments with 30% confidence threshold
47214727
- SHA256-based deduplication using 16-character prefix
4722-
- **FIXED**: Character fragmentation bug in `generate_example_embeddings()` - prevented string iteration that caused character-by-character storage
4728+
- **FIXED**: Character fragmentation bug at lines 234-242 where examples were treated as individual characters
47234729
- **FIXED**: SQL column name mismatch - using 'id' instead of 'item_id' for database alignment
4730+
- **ADDED**: Critical sync step in generate_example_embeddings to populate vec_example_embeddings virtual table
4731+
- **ENHANCED**: Batch processing for efficiency when syncing to vector table
47244732
- Example processing integrated into description_only fallback tier (Tier 4) for comprehensive coverage
47254733
- Regression testing coverage for both stdlib and normal crates
47264734

4727-
**storage_manager.py (~296 LOC)**
4735+
**storage_manager.py (~296 LOC)** - ROBUSTNESS ENHANCED
4736+
- **FIXED**: Added NULL constraint protection for content field in _store_batch function
4737+
- **ENHANCED**: Explicit NULL checks and fallbacks to ensure content is never None
47284738
- Batch embedding storage with transaction management
47294739
- Streaming batch inserts with memory-aware chunking (size=999)
47304740
- Enhanced transaction management with retry logic
@@ -4746,6 +4756,21 @@ The main `ingest.py` file now serves as a compatibility layer:
47464756
- **Code Clarity**: LOC per module stays under 500 for better comprehension
47474757
- **Parallel Development**: Multiple developers can work on different modules simultaneously
47484758

4759+
#### Recent Architectural Fixes and Enhancements
4760+
4761+
**Critical Issues Resolved**:
4762+
- **Rustdoc Parser Overhaul**: Complete rewrite to handle actual rustdoc JSON structure with proper extraction from "index" and "paths" sections
4763+
- **Character Fragmentation Bug**: Fixed critical bug in code_examples.py where examples were treated as individual characters instead of whole code blocks
4764+
- **Storage Robustness**: Added NULL constraint protection in storage_manager.py to prevent database integrity issues
4765+
- **Vector Search Sync**: Added missing vector table sync step for example embeddings to enable semantic search
4766+
- **Service Layer Fix**: Corrected dictionary handling in CrateService.search_examples method for proper model mapping
4767+
4768+
**Architecture Impact**:
4769+
- Robust ingestion pipeline handles actual rustdoc JSON format correctly
4770+
- Code example extraction and search functionality fully operational
4771+
- Database integrity maintained through comprehensive NULL handling
4772+
- Dual vector search capability for both documentation and code examples
4773+
47494774
#### Module Dependencies and Data Flow
47504775

47514776
The modular architecture follows a clear dependency graph with the orchestrator coordinating all operations:
@@ -4757,16 +4782,17 @@ ingest_orchestrator.py (Main Controller)
47574782
└── embedding_manager.py → storage_manager.py
47584783
```
47594784

4760-
**Ingestion Flow Through Modules**:
4785+
**Enhanced Ingestion Flow Through Modules** (Updated with Recent Fixes):
47614786
1. **Orchestrator** receives ingestion request and determines tier strategy
47624787
2. **Version Resolver** downloads rustdoc JSON, consulting **Cache Manager** for existing files
4763-
3. **Rustdoc Parser** streams JSON content using ijson for memory efficiency
4764-
4. **Signature Extractor** processes each item, extracting metadata and cross-references
4788+
3. **Rustdoc Parser** (FIXED) processes actual rustdoc JSON structure from "index" and "paths" sections, extracting code examples directly during parsing
4789+
4. **Signature Extractor** processes each item, extracting metadata, cross-references, and generic parameters
47654790
5. **Intelligence Extractor** analyzes signatures and docs for error types, safety info, and features
4766-
6. **Code Examples** extractor processes documentation for example code blocks (including description_only fallback tier)
4767-
7. **Embedding Manager** generates vectors for processed items in adaptive batches
4768-
8. **Storage Manager** handles transactional database operations with streaming inserts
4769-
9. **Orchestrator** coordinates error handling and tier fallback across all modules
4791+
6. **Code Examples** extractor (FIXED) processes documentation with character fragmentation protection and proper vector sync
4792+
7. **Embedding Manager** generates vectors for both items and examples in adaptive batches
4793+
8. **Storage Manager** (FIXED) handles transactional database operations with NULL constraint protection and streaming inserts
4794+
9. **Vector Sync Step** (NEW) populates vec_example_embeddings virtual table for efficient similarity search
4795+
10. **Orchestrator** coordinates error handling and tier fallback across all modules
47704796

47714797
### Ingestion Layer Details
47724798

UsefulInformation.json

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"projectName": "docsrs-mcp",
3-
"lastUpdated": "2025-09-03",
3+
"lastUpdated": "2025-09-04",
44
"purpose": "Track errors, solutions, and lessons learned during development",
55
"categories": {
66
"errorSolutions": {
@@ -2759,6 +2759,26 @@
27592759
"codeExample": "# Added to VersionInfo model:\nis_latest: bool = Field(False, description=\"Whether this is the latest version of the crate\")\n\n# Added to ListVersionsResponse model:\nlatest: str | None = Field(None, description=\"Latest version string\")\n\n# Root cause - service returned extra fields:\nresponse_data = {\n 'versions': [...],\n 'latest': '1.2.3', # This field was missing from model\n # ... other fields\n}\n\n# Each VersionInfo contained:\n{\n 'version': '1.0.0',\n 'is_latest': False, # This field was missing from model\n # ... other fields\n}",
27602760
"debuggingTechnique": "Check service layer output fields against Pydantic model definitions when using strict validation to identify missing fields",
27612761
"impact": "Fixed list_versions endpoint validation errors and ensured proper response structure for version listing functionality"
2762+
},
2763+
{
2764+
"error": "search_examples Character Fragmentation Bug - Empty results despite successful ingestion with multiple cascading failures",
2765+
"rootCause": "Critical character fragmentation bug in code_examples.py:234-242 where string examples_data was iterated over individual characters instead of being treated as complete code blocks. This caused code examples to be stored as single characters rather than complete code blocks. Additional issues: Rustdoc parser wasn't correctly handling JSON structure, NULL constraint failures prevented data storage, and vector table sync was missing.",
2766+
"solution": "Fixed character fragmentation with type checking: if isinstance(examples_data, str): examples_data = [examples_data]. Rewrote rustdoc parser to use json.loads instead of ijson for proper JSON handling. Added NULL constraint protection in storage_manager.py with content fallbacks. Added vector table sync to vec_example_embeddings. Updated service layer to handle dictionary results correctly.",
2767+
"context": "search_examples feature returning empty results even after successful ingestion due to multiple cascading issues in ingestion, parsing, and storage layers",
2768+
"lesson": "Always validate data types before iteration - strings can be mistaken for iterables. Virtual tables in SQLite (vec0) require manual sync after data insertion. NULL constraints need explicit handling with fallbacks.",
2769+
"pattern": "Use isinstance() checks before iteration to prevent character fragmentation. Always sync virtual tables after embedding insertion. Provide fallback values for required database fields.",
2770+
"dateEncountered": "2025-09-04",
2771+
"relatedFiles": ["src/docsrs_mcp/ingestion/code_examples.py", "src/docsrs_mcp/ingestion/rustdoc_parser.py", "src/docsrs_mcp/storage/storage_manager.py", "src/docsrs_mcp/services/crate_service.py"],
2772+
"codeExample": "# CRITICAL BUG FIX: Handle string input - wrap in list to prevent character iteration\nif isinstance(examples_data, str):\n examples_data = [examples_data]\n\n# NULL constraint protection:\ncontent = chunk.get(\"doc\", \"\")\nif content is None:\n content = \"\"\n\n# Vector table sync:\nawait db.executemany(\n \"INSERT INTO vec_example_embeddings(rowid, example_embedding) VALUES (?, ?)\",\n vec_data\n)",
2773+
"debuggingTechnique": "Test with MCP client: uv run python test_mcp_string_k.py. Manual ingestion: asyncio.run(ingest_crate('serde', 'latest')). Check for character-level storage vs complete code blocks in search results.",
2774+
"testingConfirmed": [
2775+
"Fixed character fragmentation - no more single characters as code examples",
2776+
"Search_examples now returns proper code blocks with 5 examples for 'serialize struct'",
2777+
"Vector search working correctly with 42 synced entries",
2778+
"NULL constraint failures resolved with proper fallbacks"
2779+
],
2780+
"preventionStrategy": "Always use isinstance() checks before iterating over data that could be string or list. Implement comprehensive integration tests that verify end-to-end functionality including data storage format.",
2781+
"additionalBugFixed": "Rustdoc JSON parser rewritten to correctly extract from 'index' and 'paths' sections, service layer updated to handle dictionary results from embedding search"
27622782
}
27632783
]
27642784
},

src/docsrs_mcp/ingestion/code_examples.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,36 @@ async def generate_example_embeddings(
334334
)
335335

336336
await db.commit()
337+
338+
# Sync to vec_example_embeddings virtual table
339+
logger.info("Syncing example embeddings to vector index...")
340+
341+
# Populate the vector table from example_embeddings
342+
cursor = await db.execute("""
343+
SELECT id, embedding FROM example_embeddings
344+
""")
345+
346+
vec_data = []
347+
async for row in cursor:
348+
rowid, embedding_blob = row
349+
vec_data.append((rowid, embedding_blob))
350+
351+
# Process in batches for efficiency
352+
if len(vec_data) >= 100:
353+
await db.executemany(
354+
"INSERT INTO vec_example_embeddings(rowid, example_embedding) VALUES (?, ?)",
355+
vec_data
356+
)
357+
vec_data = []
358+
359+
# Insert remaining data
360+
if vec_data:
361+
await db.executemany(
362+
"INSERT INTO vec_example_embeddings(rowid, example_embedding) VALUES (?, ?)",
363+
vec_data
364+
)
365+
366+
await db.commit()
337367

338368
logger.info(
339369
f"Successfully generated embeddings for {len(all_examples)} examples"

0 commit comments

Comments
 (0)