-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi INCF Team,
I've been exploring the knowledge-space-agent codebase to understand the current RAG pipeline for the upcoming 2026 cycle. I noticed the recent updates to the documentation and decided to dive into the core logic.
I identified two opportunities to significantly improve the agent's search relevance and stability. I would love to open a PR for these if aligned with the roadmap:
1. Robust JSON Parsing (Error Handling)
Current State: The LLM calls in agents.py rely on json.loads(resp.text).
Issue: Even at low temperatures, models like Gemini Flash often output Markdown formatting (e.g., ```json ... ```) or conversational filler. This currently causes JSONDecodeError exceptions that crash the pipeline.
Proposed Fix: Implement a clean_and_parse_json helper utility that strips Markdown and handles malformed output gracefully before parsing.
2. Search Ranking Upgrade (Reciprocal Rank Fusion)
Current State: fuse_results uses a linear weighted sum: vector_score * 0.6 + keyword_score * 0.4.
Issue: Keyword scores (BM25/Elastic) are often unbounded (e.g., 10.0+), while Vector scores are normalized (0.0–1.0). In practice, this allows keyword matches to overpower semantic vector matches, negating the benefit of the hybrid approach.
Proposed Fix: Switch to Reciprocal Rank Fusion (RRF). This ranks documents based on their position (1 / (k + rank)) rather than raw scores, ensuring a mathematically stable balance between semantic and keyword results.
I have a working implementation plan for both changes. Would you be open to a PR refactoring these components?
Best,
Somsubhra Nandi