Skip to content

Architecture Enhancements: Hybrid Search Ranking (RRF) & Robust JSON Parsing #13

@Somsubhra-Nandi

Description

@Somsubhra-Nandi

Hi INCF Team,

I've been exploring the knowledge-space-agent codebase to understand the current RAG pipeline for the upcoming 2026 cycle. I noticed the recent updates to the documentation and decided to dive into the core logic.

I identified two opportunities to significantly improve the agent's search relevance and stability. I would love to open a PR for these if aligned with the roadmap:

1. Robust JSON Parsing (Error Handling)

Current State: The LLM calls in agents.py rely on json.loads(resp.text).
Issue: Even at low temperatures, models like Gemini Flash often output Markdown formatting (e.g., ```json ... ```) or conversational filler. This currently causes JSONDecodeError exceptions that crash the pipeline.
Proposed Fix: Implement a clean_and_parse_json helper utility that strips Markdown and handles malformed output gracefully before parsing.

2. Search Ranking Upgrade (Reciprocal Rank Fusion)

Current State: fuse_results uses a linear weighted sum: vector_score * 0.6 + keyword_score * 0.4.
Issue: Keyword scores (BM25/Elastic) are often unbounded (e.g., 10.0+), while Vector scores are normalized (0.0–1.0). In practice, this allows keyword matches to overpower semantic vector matches, negating the benefit of the hybrid approach.
Proposed Fix: Switch to Reciprocal Rank Fusion (RRF). This ranks documents based on their position (1 / (k + rank)) rather than raw scores, ensuring a mathematically stable balance between semantic and keyword results.

I have a working implementation plan for both changes. Would you be open to a PR refactoring these components?

Best,
Somsubhra Nandi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions