Hello maintainers,
I’ve been exploring the KnowledgeSpace ingestion and retrieval pipeline to understand how dataset descriptions and metadata are converted into searchable chunks and embeddings.
Based on this review, I drafted a short proposal focusing on small, incremental improvements to:
- chunk quality validation,
- metadata grounding in embeddings, and
- safer deduplication logic.
The goal is to improve retrieval precision and grounding without changing models, infrastructure, or the overall system architecture.
Before starting implementation, I wanted to check whether this direction aligns with current priorities and whether such changes would be welcome as a contribution.
I’m happy to share a brief proposal document or adjust the scope based on feedback.
Thank you for your time.
Nitin_Krishna_KnowledgeSpace_Prior_Contribution_Proposal.pdf