Proposal: Improving chunk quality and metadata grounding in ingestion pipeline

Hello maintainers,

I’ve been exploring the KnowledgeSpace ingestion and retrieval pipeline to understand how dataset descriptions and metadata are converted into searchable chunks and embeddings.

Based on this review, I drafted a short proposal focusing on small, incremental improvements to:
- chunk quality validation,
- metadata grounding in embeddings, and
- safer deduplication logic.

The goal is to improve retrieval precision and grounding without changing models, infrastructure, or the overall system architecture.

Before starting implementation, I wanted to check whether this direction aligns with current priorities and whether such changes would be welcome as a contribution.

I’m happy to share a brief proposal document or adjust the scope based on feedback.

Thank you for your time.

[Nitin_Krishna_KnowledgeSpace_Prior_Contribution_Proposal.pdf](https://github.com/user-attachments/files/24280890/Nitin_Krishna_KnowledgeSpace_Prior_Contribution_Proposal.pdf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Improving chunk quality and metadata grounding in ingestion pipeline #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Improving chunk quality and metadata grounding in ingestion pipeline #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions