Skip to content

Feature Request: Metadata-Only Chunk Retrieval in Vector Store SearchΒ #5114

@jakub-walaszczyk

Description

@jakub-walaszczyk

πŸš€ Describe the new functionality needed

Note

Implementation proposal run with assistance of Claude Code Opus 4.6.
Requirement comes from the field as custom retrievers allow external user to more flexibly utilize llama-stack.
Users know their needs best and can make new features faster based on the llama-stack.

Important

This is purely implementation proposal that might be lacking some important context or not consider all architectural decisions.
Implementation proposal may change, but the spirit of changes remains the same.

Currently, both the internal query_chunks API and the OpenAI-compatible POST /v1/vector_stores/{vector_store_id}/search endpoint require a query parameter (text or multimodal content used for embedding similarity search). While metadata filters can be applied alongside a query to narrow results, there is no way to retrieve chunks based solely on metadata β€” without providing an embedding query.

This feature request proposes making the query parameter optional in:

  • QueryChunksRequest (llama_stack_api/vector_io/models.py) β€” change query: InterleavedContent to query: InterleavedContent | None = None
  • OpenAISearchVectorStoreRequest (llama_stack_api/vector_io/models.py) β€” change query: str | list[str] to query: str | list[str] | None = None

When query is None (or omitted) and filters are provided, the system should:

  1. Skip embedding generation and similarity search entirely β€” no vector distance computation is performed.
  2. Return all chunks matching the metadata filter, up to max_num_results (or max_chunks), ordered by insertion order or document/chunk ID (a deterministic, stable ordering).
  3. Set score to 1.0 (or a sentinel value) for all returned results, since no relevance scoring is applicable.
  4. Disallow the combination of query=None and filters=None β€” at least one must be provided. Return a 400 Bad Request / InvalidParameterError if both are absent.

Affected Components

Layer File Change
API models src/llama_stack_api/vector_io/models.py Make query optional in request models
Protocol src/llama_stack_api/vector_io/api.py Update docstrings to document metadata-only retrieval
Router src/llama_stack/core/routers/vector_io.py Skip query rewriting when query is None; validate that at least query or filters is present
Provider mixin src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py Branch on query is None to perform metadata-only scan instead of query_chunks with embedding
Inline providers providers/inline/vector_io/faiss, sqlite_vec, milvus, etc. Implement metadata-only retrieval path in query_chunks
Remote providers providers/remote/vector_io/chroma, pgvector, qdrant, weaviate, etc. Implement metadata-only retrieval path or raise NotImplementedError with a clear message
Filter utilities src/llama_stack/providers/utils/vector_io/filters.py No change needed β€” filter parsing is independent of query
FastAPI routes src/llama_stack_api/vector_io/fastapi_routes.py No change needed β€” request model change propagates automatically

Example Usage

# Retrieve all chunks tagged with topic="transformer-architectures"
# No similarity query β€” pure metadata retrieval
response = client.vector_stores.search(
    vector_store_id="vs_abc123",
    filters={"type": "eq", "key": "topic", "value": "transformer-architectures"},
    max_num_results=50,
    # query is omitted entirely
)

# Compound filter: all chunks from a specific document created after a date
response = client.vector_stores.search(
    vector_store_id="vs_abc123",
    filters={
        "type": "and",
        "filters": [
            {"type": "eq", "key": "document_id", "value": "doc_789"},
            {"type": "gte", "key": "chunk_index", "value": 5},
            {"type": "lte", "key": "chunk_index", "value": 15},
        ],
    },
    max_num_results=100,
)

πŸ’‘ Why is this needed? What if we don't build it?

Enabling Custom Retrieval Strategies Outside Llama Stack

Many production RAG systems go beyond a single similarity search call. Advanced retrieval strategies β€” such as window retrieval (expanding a match to include its neighboring chunks), parent-document retrieval (fetching the full parent after matching a child chunk), or chunk augmentation (enriching retrieved chunks with surrounding context) β€” require the ability to fetch specific chunks by metadata after an initial similarity search has identified relevant regions.

Today, to implement window retrieval with Llama Stack, a developer must:

  1. Perform a similarity search via query_chunks β€” this returns scored chunks, each carrying metadata (e.g., document_id, chunk_index).
  2. For each matched chunk, retrieve its neighbors (e.g., chunk_index - 1, chunk_index + 1) to build a wider context window.

Step 2 is currently impossible through the Llama Stack API because retrieving chunks by document_id + chunk_index requires a metadata-only query, and the API mandates a query string. The workaround β€” passing an empty string or dummy query β€” forces an unnecessary embedding computation and pollutes results with irrelevant similarity scores, defeating the purpose.

Concrete Use Cases Blocked Today

  • Window/contextual retrieval: After similarity search identifies chunk N of a document, fetch chunks N-2 through N+2 to provide broader context to the LLM. This is a widely-adopted technique (e.g., LlamaIndex's SentenceWindowNodeParser, LangChain's ParentDocumentRetriever).
  • Custom rerankers and fusion strategies: External reranking pipelines that first retrieve a broad candidate set by metadata, then apply custom scoring logic outside Llama Stack.
  • Batch export / audit: Retrieve all chunks belonging to a specific data source, user, or label for inspection, export, or compliance without needing a semantic query.
  • Deduplication and data management: Find all chunks with a specific hash, source URL, or ingestion batch ID for cleanup or update operations.
  • Hybrid orchestration: Multi-stage retrieval where the first stage is a fast metadata pre-filter (e.g., "all chunks from documents tagged 'legal' created this quarter") and the second stage is a targeted similarity search within that subset β€” orchestrated by application code outside Llama Stack.

What Happens If We Don't Build It

  • Developers must bypass the API: Teams needing advanced retrieval are forced to access the underlying vector database directly (Faiss index, SQLite, Milvus, etc.), breaking the abstraction that Llama Stack provides. This couples application code to a specific provider and eliminates the portability benefit.
  • Workaround tax: Passing a dummy query to satisfy the required parameter wastes compute on embedding generation and distance computation, introduces confusing similarity scores into results, and may return irrelevant chunks that pollute the actual metadata-filtered results (especially in vector-only mode where scores drive ranking).
  • Incomplete platform story: Llama Stack positions itself as a complete AI application development framework. Without metadata-only retrieval, it cannot support the retrieval patterns that production RAG systems commonly rely on, pushing advanced users toward alternative frameworks or direct database access.

Other thoughts

Alignment with OpenAI Vector Store API

OpenAI's own Vector Store Search API currently also requires a query. However, the underlying architecture (Responses API with file_search tool) supports metadata filtering as a first-class concept. As the ecosystem evolves toward more composable retrieval, making query optional positions Llama Stack ahead of the curve and gives developers capabilities they cannot yet get from the OpenAI API directly.

Search Mode Interaction

When query is None, the search_mode parameter (vector, keyword, hybrid) becomes meaningless. The implementation should either:

  • Ignore search_mode silently when query is absent, or
  • Return a validation error if search_mode is explicitly set alongside a None query.

The former is recommended for ergonomic reasons β€” it avoids forcing callers to conditionally omit the parameter.

Provider Support Matrix

Not all providers support metadata filtering today. The feature request should be implemented progressively:

  • Phase 1: Providers that already support filtering (faiss, sqlite_vec, milvus) add the metadata-only retrieval path.
  • Phase 2: Remote providers (chroma, pgvector, qdrant, weaviate) add support as their backends natively support metadata queries.
  • Providers that do not yet support filtering should raise a clear NotImplementedError("Metadata-only retrieval is not supported by this provider").

Ordering of Results

Without similarity scores to rank by, the result ordering must be well-defined. Suggested default: order by (document_id, chunk_index) ascending β€” matching the natural document structure. This is especially important for window retrieval, where the caller expects chunks to come back in document order.

Pagination Consideration

Metadata-only queries are more likely to return large result sets (compared to similarity search which is inherently top-K). The current response model includes has_more and next_page fields (both currently hardcoded to False/None). This feature would benefit from β€” and could motivate β€” implementing proper cursor-based pagination in a follow-up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions