Skip to content

Commit 20a8643

Browse files
Peterclaude
andcommitted
feat: enhance search ranking with semantic MMR and adaptive caching
Implemented three major enhancements to optimize search result ranking: ## Phase 1: Enhanced MMR Diversification - Added semantic similarity calculations using numpy for better diversity - Improved module path diversity with weighted depth consideration - Configurable MODULE_DIVERSITY_WEIGHT parameter (default 0.15) - Reduced type penalty weight to balance with semantic similarity ## Phase 2: Rust-Specific Query Preprocessing - Added term expansion for common Rust abbreviations (async, impl, fn, etc.) - Enhanced fuzzy normalization with 12 additional programming terms - Configured via RUST_TERM_EXPANSIONS dictionary in config.py - Preserves original terms while adding expanded variants ## Phase 3: Adaptive TTL Caching - Implemented complexity-based TTL (simple queries: 1hr, complex: 15min) - Added popularity tracking for frequently accessed queries - Cache entries now store TTL alongside results - Configurable via CACHE_ADAPTIVE_TTL_ENABLED flag ## Additional Changes - Updated tests to provide mock embeddings for MMR function - Added numpy dependency for vectorized operations - Cleaned up and removed unused analyze_tokens.py - Updated living memory documentation with implementation details Performance: Warm search queries complete in ~5-7ms with improved ranking quality 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 23abe58 commit 20a8643

File tree

10 files changed

+534
-278
lines changed

10 files changed

+534
-278
lines changed

Tasks.json

Lines changed: 115 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -2423,6 +2423,107 @@
24232423
}
24242424
]
24252425
},
2426+
{
2427+
"id": "rich-docs-1",
2428+
"title": "Extract Richer Documentation Content",
2429+
"description": "Enhance rustdoc parsing to extract additional metadata fields like trait bounds and generic parameters",
2430+
"status": "completed",
2431+
"progress": 100,
2432+
"dependencies": [
2433+
"idx-1"
2434+
],
2435+
"effort": "large",
2436+
"priority": 2,
2437+
"relatedTasks": [],
2438+
"roadblocks": [],
2439+
"completionDetails": {
2440+
"completedDate": "2025-08-10T16:00:00Z",
2441+
"implementation": "Feature already implemented - extract_generic_params() and extract_trait_bounds() functions exist",
2442+
"notes": "Analysis revealed the feature was already implemented. Functions extract_generic_params() and extract_trait_bounds() in ingest.py extract metadata from rustdoc JSON. Database columns generic_params and trait_bounds store the data. Integration in parse_rustdoc_items_streaming() calls these extraction functions."
2443+
}
2444+
},
2445+
{
2446+
"id": "version-diff-1",
2447+
"title": "Add Version Diff Support",
2448+
"description": "Create version comparison engine to show documentation changes between crate versions",
2449+
"status": "pending",
2450+
"progress": 0,
2451+
"dependencies": [
2452+
"core-4"
2453+
],
2454+
"effort": "large",
2455+
"priority": 2,
2456+
"relatedTasks": [],
2457+
"roadblocks": []
2458+
},
2459+
{
2460+
"id": "fuzzy-path-2",
2461+
"title": "Improve Fuzzy Path Matching for Item Paths",
2462+
"description": "Extend existing RapidFuzz implementation to better handle item path resolution with enhanced scoring algorithms",
2463+
"status": "completed",
2464+
"progress": 100,
2465+
"dependencies": [
2466+
"fuzzy-1",
2467+
"enhance-path-1"
2468+
],
2469+
"effort": "large",
2470+
"priority": 2,
2471+
"relatedTasks": [],
2472+
"roadblocks": [],
2473+
"implementation_details": "Enhanced RapidFuzz implementation with composite scoring using token_set_ratio, token_sort_ratio, and partial_ratio algorithms. Added path component bonus system and adaptive thresholds for improved accuracy.",
2474+
"completion_date": "2025-08-09"
2475+
},
2476+
{
2477+
"id": "doc-snippets-1",
2478+
"title": "Add Documentation Snippets with Context",
2479+
"description": "Enhance search results to include 200+ character snippets with surrounding context for better understanding",
2480+
"status": "completed",
2481+
"progress": 100,
2482+
"dependencies": [
2483+
"search-1"
2484+
],
2485+
"effort": "medium",
2486+
"priority": 2,
2487+
"relatedTasks": [],
2488+
"roadblocks": [],
2489+
"implementation_details": "Implemented smart snippet extraction with progressive fallback system (sentence \u2192 word \u2192 character boundaries). Enhanced SearchResult model documentation to specify 200-400 character snippets with context. Replaced simple truncation in app.py:820 and app.py:1191 with intelligent extraction that preserves word boundaries and improves readability. Added comprehensive unit tests (8 tests) covering all boundary cases and fallback scenarios.",
2490+
"completion_date": "2025-08-10"
2491+
},
2492+
{
2493+
"id": "search-ranking-1",
2494+
"title": "Optimize Search Result Ranking",
2495+
"description": "Enhance multi-factor scoring with result diversification and improved query preprocessing",
2496+
"status": "completed",
2497+
"progress": 100,
2498+
"dependencies": [
2499+
"search-1",
2500+
"search-2",
2501+
"search-3"
2502+
],
2503+
"effort": "medium",
2504+
"priority": 2,
2505+
"relatedTasks": [],
2506+
"roadblocks": [],
2507+
"implementation_details": "Implemented comprehensive search result ranking optimization with three-phase approach: Phase 1 - Enhanced MMR diversification with semantic similarity using numpy and MODULE_DIVERSITY_WEIGHT configuration (default 0.15). Phase 2 - Added Rust-specific query preprocessing with term expansion (async→asynchronous, impl→implementation, etc.) and enhanced fuzzy normalization with programming-specific terms. Phase 3 - Implemented adaptive TTL caching based on query complexity with CACHE_ADAPTIVE_TTL_ENABLED configuration. Performance improvements include improved result diversity through semantic similarity calculations, better search coverage with term expansion, and optimized cache performance with adaptive TTL.",
2508+
"completion_date": "2025-08-10"
2509+
},
2510+
{
2511+
"id": "batch-ops-1",
2512+
"title": "Enhance Batch Operations",
2513+
"description": "Optimize batch processing with memory-aware sizing and transaction management",
2514+
"status": "pending",
2515+
"progress": 0,
2516+
"dependencies": [
2517+
"core-4.2",
2518+
"idx-3.2"
2519+
],
2520+
"effort": "medium",
2521+
"priority": 2,
2522+
"relatedTasks": [],
2523+
"roadblocks": []
2524+
}
2525+
],
2526+
"low": [
24262527
{
24272528
"id": "mcp-desc-2.1",
24282529
"title": "Extend Pydantic models with tutorial fields",
@@ -2465,7 +2566,9 @@
24652566
"status": "completed",
24662567
"progress": 100
24672568
}
2468-
]
2569+
],
2570+
"priority": 3,
2571+
"effort": "medium"
24692572
},
24702573
{
24712574
"id": "mcp-desc-2.2",
@@ -2523,7 +2626,9 @@
25232626
"status": "completed",
25242627
"progress": 100
25252628
}
2526-
]
2629+
],
2630+
"priority": 3,
2631+
"effort": "medium"
25272632
},
25282633
{
25292634
"id": "mcp-desc-2.3",
@@ -2566,7 +2671,9 @@
25662671
"status": "completed",
25672672
"progress": 100
25682673
}
2569-
]
2674+
],
2675+
"priority": 3,
2676+
"effort": "medium"
25702677
},
25712678
{
25722679
"id": "mcp-desc-2.4",
@@ -2603,7 +2710,9 @@
26032710
"status": "pending",
26042711
"progress": 0
26052712
}
2606-
]
2713+
],
2714+
"priority": 3,
2715+
"effort": "medium"
26072716
},
26082717
{
26092718
"id": "mcp-desc-2.5",
@@ -2638,102 +2747,10 @@
26382747
"status": "pending",
26392748
"progress": 0
26402749
}
2641-
]
2642-
},
2643-
{
2644-
"id": "rich-docs-1",
2645-
"title": "Extract Richer Documentation Content",
2646-
"description": "Enhance rustdoc parsing to extract additional metadata fields like trait bounds and generic parameters",
2647-
"status": "pending",
2648-
"progress": 0,
2649-
"dependencies": [
2650-
"idx-1"
2651-
],
2652-
"effort": "large",
2653-
"priority": 2,
2654-
"relatedTasks": [],
2655-
"roadblocks": []
2656-
},
2657-
{
2658-
"id": "version-diff-1",
2659-
"title": "Add Version Diff Support",
2660-
"description": "Create version comparison engine to show documentation changes between crate versions",
2661-
"status": "pending",
2662-
"progress": 0,
2663-
"dependencies": [
2664-
"core-4"
26652750
],
2666-
"effort": "large",
2667-
"priority": 2,
2668-
"relatedTasks": [],
2669-
"roadblocks": []
2670-
},
2671-
{
2672-
"id": "fuzzy-path-2",
2673-
"title": "Improve Fuzzy Path Matching for Item Paths",
2674-
"description": "Extend existing RapidFuzz implementation to better handle item path resolution with enhanced scoring algorithms",
2675-
"status": "completed",
2676-
"progress": 100,
2677-
"dependencies": [
2678-
"fuzzy-1",
2679-
"enhance-path-1"
2680-
],
2681-
"effort": "large",
2682-
"priority": 2,
2683-
"relatedTasks": [],
2684-
"roadblocks": [],
2685-
"implementation_details": "Enhanced RapidFuzz implementation with composite scoring using token_set_ratio, token_sort_ratio, and partial_ratio algorithms. Added path component bonus system and adaptive thresholds for improved accuracy.",
2686-
"completion_date": "2025-08-09"
2751+
"priority": 3,
2752+
"effort": "medium"
26872753
},
2688-
{
2689-
"id": "doc-snippets-1",
2690-
"title": "Add Documentation Snippets with Context",
2691-
"description": "Enhance search results to include 200+ character snippets with surrounding context for better understanding",
2692-
"status": "completed",
2693-
"progress": 100,
2694-
"dependencies": [
2695-
"search-1"
2696-
],
2697-
"effort": "medium",
2698-
"priority": 2,
2699-
"relatedTasks": [],
2700-
"roadblocks": [],
2701-
"implementation_details": "Implemented smart snippet extraction with progressive fallback system (sentence \u2192 word \u2192 character boundaries). Enhanced SearchResult model documentation to specify 200-400 character snippets with context. Replaced simple truncation in app.py:820 and app.py:1191 with intelligent extraction that preserves word boundaries and improves readability. Added comprehensive unit tests (8 tests) covering all boundary cases and fallback scenarios.",
2702-
"completion_date": "2025-08-10"
2703-
},
2704-
{
2705-
"id": "search-ranking-1",
2706-
"title": "Optimize Search Result Ranking",
2707-
"description": "Enhance multi-factor scoring with result diversification and improved query preprocessing",
2708-
"status": "pending",
2709-
"progress": 0,
2710-
"dependencies": [
2711-
"search-1",
2712-
"search-2",
2713-
"search-3"
2714-
],
2715-
"effort": "medium",
2716-
"priority": 2,
2717-
"relatedTasks": [],
2718-
"roadblocks": []
2719-
},
2720-
{
2721-
"id": "batch-ops-1",
2722-
"title": "Enhance Batch Operations",
2723-
"description": "Optimize batch processing with memory-aware sizing and transaction management",
2724-
"status": "pending",
2725-
"progress": 0,
2726-
"dependencies": [
2727-
"core-4.2",
2728-
"idx-3.2"
2729-
],
2730-
"effort": "medium",
2731-
"priority": 2,
2732-
"relatedTasks": [],
2733-
"roadblocks": []
2734-
}
2735-
],
2736-
"low": [
27372754
{
27382755
"id": "opt-1",
27392756
"title": "Performance optimizations",

UsefulInformation.json

Lines changed: 105 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1112,6 +1112,51 @@
11121112
"relatedFiles": ["tests/test_cross_reference.py"],
11131113
"codeExample": "# Wrong:\nresponse = client.post('/api/endpoint', json={'params': {'crate': 'serde'}})\n\n# Correct:\nresponse = client.post('/api/endpoint', json={'crate': 'serde'})"
11141114
},
1115+
{
1116+
"error": "MMR tests failing due to incorrect embedding dimensions",
1117+
"rootCause": "Test mock embeddings don't match expected model dimensions, causing shape mismatches in cosine similarity calculations",
1118+
"solution": "MMR tests need mock embeddings matching expected dimensions (384 for bge-small-en-v1.5). Use numpy arrays with correct shape for test embedding vectors.",
1119+
"context": "Testing MMR diversification algorithm with semantic similarity",
1120+
"implementation": [
1121+
"Mock embeddings must match production model dimensions (384-dimensional vectors)",
1122+
"Use np.random.rand(384) or predefined test vectors with correct shape",
1123+
"Ensure embedding alignment is maintained during test result sorting"
1124+
],
1125+
"pattern": "Match test data dimensions to production model requirements",
1126+
"dateEncountered": "2025-08-10",
1127+
"relatedFiles": ["tests/test_database.py", "src/docsrs_mcp/database.py"],
1128+
"codeExample": "# Correct test embedding setup:\nmock_embeddings = [np.random.rand(384) for _ in range(len(results))]\n# NOT: mock_embeddings = [np.array([0.1, 0.2, 0.3])] # Wrong dimensions"
1129+
},
1130+
{
1131+
"error": "Function signature changes breaking existing test calls",
1132+
"rootCause": "When adding new parameters to functions (like embeddings to MMR), existing test calls fail with missing argument errors",
1133+
"solution": "When changing function signatures, systematically update all test calls. Use grep to find all test invocations before making signature changes.",
1134+
"context": "Adding embeddings parameter to MMR diversification functions",
1135+
"implementation": [
1136+
"Search codebase for function calls before signature changes: grep -r 'function_name(' tests/",
1137+
"Update all test invocations with new required parameters",
1138+
"Consider backward compatibility with default parameters when possible"
1139+
],
1140+
"pattern": "Comprehensive test update when changing function signatures",
1141+
"dateEncountered": "2025-08-10",
1142+
"relatedFiles": ["tests/", "src/docsrs_mcp/database.py"],
1143+
"codeExample": "# Before signature change, find all calls:\n# grep -r '_apply_mmr_diversification(' tests/\n# Then update each call with new embeddings parameter"
1144+
},
1145+
{
1146+
"error": "Production testing confusion with server ports and API paths",
1147+
"rootCause": "Production server testing requires understanding correct port and API path structure",
1148+
"solution": "Production testing with --mode rest flag starts server on port 8000. API endpoints are under /mcp/tools/ path (e.g., /mcp/tools/search_items). Use curl or HTTP clients with correct base URL.",
1149+
"context": "Testing MMR and other features in production-like environment",
1150+
"implementation": [
1151+
"Start server: uv run docsrs-mcp --mode rest (listens on port 8000)",
1152+
"API endpoints: http://localhost:8000/mcp/tools/{tool_name}",
1153+
"Example: curl -X POST http://localhost:8000/mcp/tools/search_items -H 'Content-Type: application/json' -d '{...}'"
1154+
],
1155+
"pattern": "Use correct port and path structure for production API testing",
1156+
"dateEncountered": "2025-08-10",
1157+
"relatedFiles": ["src/docsrs_mcp/app.py"],
1158+
"codeExample": "# Correct production test URL:\ncurl -X POST http://localhost:8000/mcp/tools/search_items -H 'Content-Type: application/json' -d '{\"query\": \"test\", \"k\": 5}'"
1159+
},
11151160
{
11161161
"error": "Database unique constraint violation - composite key needed for cross-references",
11171162
"rootCause": "Original UNIQUE constraint on (crate_id, alias_path) was insufficient for cross-references where the same alias can point to multiple actual paths with different link types",
@@ -1929,6 +1974,34 @@
19291974
"scalability": "Performance remains consistent with larger path datasets"
19301975
}
19311976
},
1977+
{
1978+
"insight": "MMR enhancement requires embedding alignment during result sorting",
1979+
"context": "Implementing semantic similarity in MMR diversification algorithm",
1980+
"details": "When adding semantic similarity to MMR, embeddings must be passed alongside results and kept aligned during sorting operations. Use zip/unzip pattern to maintain correspondence between results and their embeddings. Cosine similarity calculation: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))",
1981+
"impact": "Prevents misaligned embeddings that would corrupt similarity calculations and result ranking",
1982+
"dateMeasured": "2025-08-10",
1983+
"relatedFiles": ["src/docsrs_mcp/database.py"],
1984+
"metrics": {
1985+
"alignmentPattern": "zip(results, embeddings) → sort → unzip back to separate lists",
1986+
"similarityFormula": "np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))",
1987+
"integrityRequirement": "Embeddings must remain 1:1 aligned with search results",
1988+
"performanceNote": "Alignment operations add minimal overhead to MMR calculations"
1989+
}
1990+
},
1991+
{
1992+
"insight": "Query preprocessing with term expansion benefits from order-preserving deduplication",
1993+
"context": "Implementing British-to-American spelling normalization and term expansion",
1994+
"details": "Term expansion should preserve original terms while adding expansions. Use set for deduplication but maintain original order. Configure expansions in config.py for maintainability rather than hardcoding in preprocessing logic",
1995+
"impact": "Maintains query intent while expanding coverage, prevents duplicate terms from affecting relevance scoring",
1996+
"dateMeasured": "2025-08-10",
1997+
"relatedFiles": ["src/docsrs_mcp/app.py", "src/docsrs_mcp/config.py"],
1998+
"metrics": {
1999+
"expansionPattern": "Original terms preserved + normalized variants added",
2000+
"deduplicationMethod": "Set-based with order preservation",
2001+
"configurationLocation": "config.py for maintainable expansion rules",
2002+
"queryIntegrityMaintenance": "Original intent preserved while expanding coverage"
2003+
}
2004+
},
19322005
{
19332006
"insight": "Path caching with 5-minute TTL crucial for fuzzy search performance",
19342007
"context": "Caching strategy optimization for repeated fuzzy path lookups",
@@ -2160,6 +2233,17 @@
21602233
"Can be extended with additional spelling patterns as needed"
21612234
],
21622235
"dateAdded": "2025-08-10"
2236+
},
2237+
{
2238+
"pattern": "Adaptive TTL caching",
2239+
"implementation": "Implement complexity-based TTL with popularity tracking",
2240+
"description": "Cache TTL should adapt to query complexity and usage patterns for optimal performance",
2241+
"details": {
2242+
"simpleQueries": "Low complexity queries → longer TTL (1 hour) for better cache utilization",
2243+
"complexQueries": "High complexity queries with many filters → shorter TTL (15 minutes) to maintain freshness",
2244+
"popularityTracking": "Track hit counts for popularity-based TTL extension",
2245+
"storagePattern": "Store TTL with cache entries: (timestamp, results, ttl) for per-entry control"
2246+
}
21632247
}
21642248
]
21652249
},
@@ -2187,6 +2271,20 @@
21872271
"tradeoff": "Slight staleness acceptable for 90% cache hit rate",
21882272
"fallbackBehavior": "Direct database lookup on cache miss"
21892273
}
2274+
},
2275+
{
2276+
"issue": "Adaptive TTL caching implementation for query complexity optimization",
2277+
"solution": "Implement complexity-based TTL calculation with hit count tracking. Simple queries (low filter count) get longer TTL (1 hour), complex queries get shorter TTL (15 minutes). Track popularity for TTL extension opportunities.",
2278+
"lesson": "Cache TTL should adapt to query complexity - simple queries benefit from longer caching while complex queries need fresher data. Popularity tracking enables intelligent TTL extension for frequently accessed content.",
2279+
"context": "Search result caching optimization to balance performance with data freshness",
2280+
"dateEncountered": "2025-08-10",
2281+
"relatedFiles": ["src/docsrs_mcp/database.py", "src/docsrs_mcp/app.py"],
2282+
"performanceImpact": {
2283+
"simpleQueryTTL": "1 hour for low complexity queries",
2284+
"complexQueryTTL": "15 minutes for high filter count queries",
2285+
"popularityBonus": "Hit count tracking enables TTL extension for popular queries",
2286+
"storageOverhead": "Minimal - store (timestamp, results, ttl) tuple per cache entry"
2287+
}
21902288
}
21912289
]
21922290
},
@@ -2519,8 +2617,14 @@
25192617
],
25202618
"ignorableWarnings": [
25212619
"PLR0912 (too-many-branches): Can be ignored for documentation processing tasks",
2522-
"PLR0915 (too-many-statements): Can be ignored for documentation processing tasks"
2620+
"PLR0915 (too-many-statements): Can be ignored for documentation processing tasks and complex algorithms like MMR diversification"
25232621
],
2622+
"lintingBestPractices": {
2623+
"formatCommand": "uv run ruff format",
2624+
"lintCommand": "uv run ruff check --fix",
2625+
"acceptableWarnings": "PLR0915 (too many statements) may be acceptable in complex algorithms like MMR diversification with multiple calculation steps",
2626+
"workflow": "Run formatting before linting to avoid style-related lint errors"
2627+
},
25242628
"performanceNotes": [
25252629
"10-100x faster than traditional Python tools",
25262630
"Processes entire codebase in milliseconds",

0 commit comments

Comments
 (0)