You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: resolve search_examples character fragmentation and ingestion failures
This commit fixes critical issues that prevented search_examples from returning results:
1. Fixed character fragmentation bug in code_examples.py where strings were
iterated as individual characters instead of code blocks
2. Completely rewrote rustdoc_parser.py to properly handle actual rustdoc
JSON structure with "index" and "paths" sections
3. Added NULL constraint protection in storage_manager.py to prevent
database insertion failures
4. Added vector table sync in code_examples.py to populate
vec_example_embeddings for search functionality
5. Fixed CrateService.search_examples to properly handle dictionary
results and map to CodeExample model
These changes ensure reliable ingestion of Rust crate documentation with
functional code example extraction and search capabilities.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
Copy file name to clipboardExpand all lines: Architecture.md
+44-18Lines changed: 44 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,9 @@
2
2
3
3
## System Overview
4
4
5
-
The docsrs-mcp server provides both REST API and Model Context Protocol (MCP) endpoints for querying Rust crate documentation using vector search. It features a service layer architecture with dual MCP implementation support, with Official Python MCP SDK 1.13.1 as the default implementation and legacy FastMCP 2.11.1 support (deprecated) through a --mcp-implementation CLI flag. The system includes comprehensive memory leak mitigation with automatic server restarts, string-first parameter handling for broad client compatibility, and transport-agnostic business services (CrateService and IngestionService) that decouple core functionality from MCP/REST layers. The architecture maintains a comprehensive asynchronous ingestion pipeline with enhanced rustdoc JSON processing, SQLite-based vector storage with intelligent caching, and dedicated code example extraction systems.
5
+
The docsrs-mcp server provides both REST API and Model Context Protocol (MCP) endpoints for querying Rust crate documentation using vector search. It features a service layer architecture with dual MCP implementation support, with Official Python MCP SDK 1.13.1 as the default implementation and legacy FastMCP 2.11.1 support (deprecated) through a --mcp-implementation CLI flag. The system includes comprehensive memory leak mitigation with automatic server restarts, string-first parameter handling for broad client compatibility, and transport-agnostic business services (CrateService and IngestionService) that decouple core functionality from MCP/REST layers.
6
+
7
+
**Recent Architecture Enhancements**: The architecture maintains a comprehensively enhanced asynchronous ingestion pipeline with a completely rewritten rustdoc JSON parser that properly handles the actual rustdoc structure, robust storage manager with NULL constraint protection, fixed code example extraction with character fragmentation protection and vector sync capabilities, and corrected service layer dictionary handling for seamless code example search functionality. These fixes ensure reliable ingestion of Rust crate documentation with proper code example extraction and searchability through dual vector search systems for both documentation and code examples.
6
8
7
9
## High-Level Architecture
8
10
@@ -84,7 +86,7 @@ graph TB
84
86
graph LR
85
87
subgraph "docsrs_mcp Package"
86
88
subgraph "Service Layer"
87
-
CRATE_SVC[crate_service.py<br/>CrateService class<br/>Search, documentation, versions<br/>Transport-agnostic business logic<br/>_build_module_tree() transformation method]
89
+
CRATE_SVC[crate_service.py<br/>CrateService class<br/>FIXED: search_examples method dictionary handling<br/>Proper mapping to CodeExample model requirements<br/>Search, documentation, versions<br/>Transport-agnostic business logic<br/>_build_module_tree() transformation method]
RUSTDOC_PARSER[rustdoc_parser.py<br/>Enhanced JSON parsing (~305 LOC)<br/>Complete rewrite for actual rustdoc structure<br/>Extracts from "index" and "paths" sections<br/>Direct code example extraction integration<br/>Generic parameters and trait bounds parsing<br/>Module hierarchy extraction<br/>Path validation with fallback]
CODE_EXAMPLES[code_examples.py<br/>Code example extraction (~343 LOC)<br/>Language detection via pygments<br/>30% confidence threshold<br/>JSON structure with metadata]
CODE_EXAMPLES[code_examples.py<br/>Code example extraction (~343 LOC)<br/>FIXED: Character fragmentation bug at lines 234-242<br/>FIXED: Vector sync step for vec_example_embeddings<br/>Language detection via pygments<br/>30% confidence threshold<br/>Batch processing for embeddings sync<br/>JSON structure with metadata]
133
+
STORAGE_MGR[storage_manager.py<br/>Batch embedding storage (~296 LOC)<br/>FIXED: NULL constraint protection for content field<br/>Enhanced robustness with explicit NULL checks<br/>Transaction management<br/>Streaming batch inserts<br/>Memory-aware chunking]
132
134
end
133
135
134
136
ING[ingest.py<br/>Backward compatibility layer<br/>Re-exports from modular components<br/>Maintains existing API surface]
@@ -232,7 +234,7 @@ The docsrs-mcp server implements a service layer pattern that decouples business
232
234
233
235
#### Core Services
234
236
235
-
-**CrateService**: Handles all crate-related operations including search, documentation retrieval, and version management. **Phase 2 Enhancement**: Automatically populates `is_stdlib` and `is_dependency` fields in SearchResult and GetItemDocResponse models using DependencyFilter integration. **Critical Fix Applied**: Implements `_build_module_tree()` helper method that transforms flat database results into hierarchical ModuleTreeNode structures, resolving Pydantic validation errors by properly fulfilling service layer data transformation responsibility.
237
+
-**CrateService**: Handles all crate-related operations including search, documentation retrieval, and version management. **Phase 2 Enhancement**: Automatically populates `is_stdlib` and `is_dependency` fields in SearchResult and GetItemDocResponse models using DependencyFilter integration. **Critical Fix Applied**: Implements `_build_module_tree()` helper method that transforms flat database results into hierarchical ModuleTreeNode structures, resolving Pydantic validation errors by properly fulfilling service layer data transformation responsibility.**Service Layer Fix**: Fixed search_examples method to properly handle dictionary results from search_example_embeddings, correctly mapping fields to CodeExample model requirements.
236
238
-**IngestionService**: Manages the complete ingestion pipeline, pre-ingestion workflows, and cargo file processing
237
239
-**CrossReferenceService**: **Phase 6 Enhancement**: Provides advanced cross-reference operations including import resolution, dependency graph analysis, migration suggestions, and re-export tracing. Implements circuit breaker pattern for resilience, LRU cache with 5-minute TTL for performance, and DFS algorithms for cycle detection in dependency graphs.
238
240
-**Transport Layer Decoupling**: Business logic is independent of whether accessed via MCP or REST
@@ -4703,8 +4705,12 @@ The ingestion pipeline has been refactored from a monolithic 3609-line `ingest.p
4703
4705
- File mtime-based eviction strategies
4704
4706
- Cache health monitoring and cleanup operations
4705
4707
4706
-
**rustdoc_parser.py (~305 LOC)**
4707
-
- Streaming JSON parsing using ijson for memory efficiency
4708
+
**rustdoc_parser.py (~305 LOC)** - MAJOR OVERHAUL COMPLETED
4709
+
-**FIXED**: Complete rewrite of parse_rustdoc_items_streaming function for actual rustdoc JSON structure
4710
+
-**CHANGED**: Replaced ijson streaming parser with regular json.loads for better compatibility
4711
+
-**ENHANCED**: Direct extraction from "index" and "paths" sections of rustdoc JSON
4712
+
-**ADDED**: Integrated code example extraction using extract_code_examples during parsing
4713
+
-**ADDED**: Extraction of generic parameters and trait bounds from inner structure
4708
4714
- Module hierarchy extraction with parent-child relationships
4709
4715
- Path validation with fallback generation for robustness
4710
4716
- Progressive item streaming using generator-based architecture
@@ -4715,16 +4721,20 @@ The ingestion pipeline has been refactored from a monolithic 3609-line `ingest.p
4715
4721
- Schema validation with tier-aware different MIN_ITEMS_THRESHOLD values
4716
4722
- Cross-reference extraction from links fields
4717
4723
4718
-
**code_examples.py (~343 LOC)** - ENHANCED WITH BUG FIXES
4724
+
**code_examples.py (~343 LOC)** - ENHANCED WITH CRITICAL FIXES
4719
4725
- Code example extraction with structured JSON metadata
4720
4726
- Language detection via pygments with 30% confidence threshold
4721
4727
- SHA256-based deduplication using 16-character prefix
4722
-
-**FIXED**: Character fragmentation bug in `generate_example_embeddings()` - prevented string iteration that caused character-by-character storage
4728
+
-**FIXED**: Character fragmentation bug at lines 234-242 where examples were treated as individual characters
4723
4729
-**FIXED**: SQL column name mismatch - using 'id' instead of 'item_id' for database alignment
4730
+
-**ADDED**: Critical sync step in generate_example_embeddings to populate vec_example_embeddings virtual table
4731
+
-**ENHANCED**: Batch processing for efficiency when syncing to vector table
4724
4732
- Example processing integrated into description_only fallback tier (Tier 4) for comprehensive coverage
4725
4733
- Regression testing coverage for both stdlib and normal crates
-**FIXED**: Added NULL constraint protection for content field in _store_batch function
4737
+
-**ENHANCED**: Explicit NULL checks and fallbacks to ensure content is never None
4728
4738
- Batch embedding storage with transaction management
4729
4739
- Streaming batch inserts with memory-aware chunking (size=999)
4730
4740
- Enhanced transaction management with retry logic
@@ -4746,6 +4756,21 @@ The main `ingest.py` file now serves as a compatibility layer:
4746
4756
-**Code Clarity**: LOC per module stays under 500 for better comprehension
4747
4757
-**Parallel Development**: Multiple developers can work on different modules simultaneously
4748
4758
4759
+
#### Recent Architectural Fixes and Enhancements
4760
+
4761
+
**Critical Issues Resolved**:
4762
+
-**Rustdoc Parser Overhaul**: Complete rewrite to handle actual rustdoc JSON structure with proper extraction from "index" and "paths" sections
4763
+
-**Character Fragmentation Bug**: Fixed critical bug in code_examples.py where examples were treated as individual characters instead of whole code blocks
4764
+
-**Storage Robustness**: Added NULL constraint protection in storage_manager.py to prevent database integrity issues
4765
+
-**Vector Search Sync**: Added missing vector table sync step for example embeddings to enable semantic search
4766
+
-**Service Layer Fix**: Corrected dictionary handling in CrateService.search_examples method for proper model mapping
4767
+
4768
+
**Architecture Impact**:
4769
+
- Robust ingestion pipeline handles actual rustdoc JSON format correctly
4770
+
- Code example extraction and search functionality fully operational
4771
+
- Database integrity maintained through comprehensive NULL handling
4772
+
- Dual vector search capability for both documentation and code examples
4773
+
4749
4774
#### Module Dependencies and Data Flow
4750
4775
4751
4776
The modular architecture follows a clear dependency graph with the orchestrator coordinating all operations:
3.**Rustdoc Parser**streams JSON content using ijson for memory efficiency
4764
-
4.**Signature Extractor** processes each item, extracting metadata and cross-references
4788
+
3.**Rustdoc Parser**(FIXED) processes actual rustdoc JSON structure from "index" and "paths" sections, extracting code examples directly during parsing
4789
+
4.**Signature Extractor** processes each item, extracting metadata, cross-references, and generic parameters
4765
4790
5.**Intelligence Extractor** analyzes signatures and docs for error types, safety info, and features
4766
-
6.**Code Examples** extractor processes documentation for example code blocks (including description_only fallback tier)
4767
-
7.**Embedding Manager** generates vectors for processed items in adaptive batches
4768
-
8.**Storage Manager** handles transactional database operations with streaming inserts
4769
-
9.**Orchestrator** coordinates error handling and tier fallback across all modules
4791
+
6.**Code Examples** extractor (FIXED) processes documentation with character fragmentation protection and proper vector sync
4792
+
7.**Embedding Manager** generates vectors for both items and examples in adaptive batches
4793
+
8.**Storage Manager** (FIXED) handles transactional database operations with NULL constraint protection and streaming inserts
Copy file name to clipboardExpand all lines: UsefulInformation.json
+21-1Lines changed: 21 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
{
2
2
"projectName": "docsrs-mcp",
3
-
"lastUpdated": "2025-09-03",
3
+
"lastUpdated": "2025-09-04",
4
4
"purpose": "Track errors, solutions, and lessons learned during development",
5
5
"categories": {
6
6
"errorSolutions": {
@@ -2759,6 +2759,26 @@
2759
2759
"codeExample": "# Added to VersionInfo model:\nis_latest: bool = Field(False, description=\"Whether this is the latest version of the crate\")\n\n# Added to ListVersionsResponse model:\nlatest: str | None = Field(None, description=\"Latest version string\")\n\n# Root cause - service returned extra fields:\nresponse_data = {\n 'versions': [...],\n 'latest': '1.2.3', # This field was missing from model\n # ... other fields\n}\n\n# Each VersionInfo contained:\n{\n 'version': '1.0.0',\n 'is_latest': False, # This field was missing from model\n # ... other fields\n}",
2760
2760
"debuggingTechnique": "Check service layer output fields against Pydantic model definitions when using strict validation to identify missing fields",
2761
2761
"impact": "Fixed list_versions endpoint validation errors and ensured proper response structure for version listing functionality"
2762
+
},
2763
+
{
2764
+
"error": "search_examples Character Fragmentation Bug - Empty results despite successful ingestion with multiple cascading failures",
2765
+
"rootCause": "Critical character fragmentation bug in code_examples.py:234-242 where string examples_data was iterated over individual characters instead of being treated as complete code blocks. This caused code examples to be stored as single characters rather than complete code blocks. Additional issues: Rustdoc parser wasn't correctly handling JSON structure, NULL constraint failures prevented data storage, and vector table sync was missing.",
2766
+
"solution": "Fixed character fragmentation with type checking: if isinstance(examples_data, str): examples_data = [examples_data]. Rewrote rustdoc parser to use json.loads instead of ijson for proper JSON handling. Added NULL constraint protection in storage_manager.py with content fallbacks. Added vector table sync to vec_example_embeddings. Updated service layer to handle dictionary results correctly.",
2767
+
"context": "search_examples feature returning empty results even after successful ingestion due to multiple cascading issues in ingestion, parsing, and storage layers",
2768
+
"lesson": "Always validate data types before iteration - strings can be mistaken for iterables. Virtual tables in SQLite (vec0) require manual sync after data insertion. NULL constraints need explicit handling with fallbacks.",
2769
+
"pattern": "Use isinstance() checks before iteration to prevent character fragmentation. Always sync virtual tables after embedding insertion. Provide fallback values for required database fields.",
"codeExample": "# CRITICAL BUG FIX: Handle string input - wrap in list to prevent character iteration\nif isinstance(examples_data, str):\n examples_data = [examples_data]\n\n# NULL constraint protection:\ncontent = chunk.get(\"doc\", \"\")\nif content is None:\n content = \"\"\n\n# Vector table sync:\nawait db.executemany(\n\"INSERT INTO vec_example_embeddings(rowid, example_embedding) VALUES (?, ?)\",\n vec_data\n)",
2773
+
"debuggingTechnique": "Test with MCP client: uv run python test_mcp_string_k.py. Manual ingestion: asyncio.run(ingest_crate('serde', 'latest')). Check for character-level storage vs complete code blocks in search results.",
2774
+
"testingConfirmed": [
2775
+
"Fixed character fragmentation - no more single characters as code examples",
2776
+
"Search_examples now returns proper code blocks with 5 examples for 'serialize struct'",
2777
+
"Vector search working correctly with 42 synced entries",
2778
+
"NULL constraint failures resolved with proper fallbacks"
2779
+
],
2780
+
"preventionStrategy": "Always use isinstance() checks before iterating over data that could be string or list. Implement comprehensive integration tests that verify end-to-end functionality including data storage format.",
2781
+
"additionalBugFixed": "Rustdoc JSON parser rewritten to correctly extract from 'index' and 'paths' sections, service layer updated to handle dictionary results from embedding search"
0 commit comments