-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
π Describe the new functionality needed
Note
Implementation proposal run with assistance of Claude Code Opus 4.6.
Requirement comes from the field as custom retrievers allow external user to more flexibly utilize llama-stack.
Users know their needs best and can make new features faster based on the llama-stack.
Important
This is purely implementation proposal that might be lacking some important context or not consider all architectural decisions.
Implementation proposal may change, but the spirit of changes remains the same.
Currently, both the internal query_chunks API and the OpenAI-compatible POST /v1/vector_stores/{vector_store_id}/search endpoint require a query parameter (text or multimodal content used for embedding similarity search). While metadata filters can be applied alongside a query to narrow results, there is no way to retrieve chunks based solely on metadata β without providing an embedding query.
This feature request proposes making the query parameter optional in:
QueryChunksRequest(llama_stack_api/vector_io/models.py) β changequery: InterleavedContenttoquery: InterleavedContent | None = NoneOpenAISearchVectorStoreRequest(llama_stack_api/vector_io/models.py) β changequery: str | list[str]toquery: str | list[str] | None = None
When query is None (or omitted) and filters are provided, the system should:
- Skip embedding generation and similarity search entirely β no vector distance computation is performed.
- Return all chunks matching the metadata filter, up to
max_num_results(ormax_chunks), ordered by insertion order or document/chunk ID (a deterministic, stable ordering). - Set
scoreto1.0(or a sentinel value) for all returned results, since no relevance scoring is applicable. - Disallow the combination of
query=Noneandfilters=Noneβ at least one must be provided. Return a400 Bad Request/InvalidParameterErrorif both are absent.
Affected Components
| Layer | File | Change |
|---|---|---|
| API models | src/llama_stack_api/vector_io/models.py |
Make query optional in request models |
| Protocol | src/llama_stack_api/vector_io/api.py |
Update docstrings to document metadata-only retrieval |
| Router | src/llama_stack/core/routers/vector_io.py |
Skip query rewriting when query is None; validate that at least query or filters is present |
| Provider mixin | src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py |
Branch on query is None to perform metadata-only scan instead of query_chunks with embedding |
| Inline providers | providers/inline/vector_io/faiss, sqlite_vec, milvus, etc. |
Implement metadata-only retrieval path in query_chunks |
| Remote providers | providers/remote/vector_io/chroma, pgvector, qdrant, weaviate, etc. |
Implement metadata-only retrieval path or raise NotImplementedError with a clear message |
| Filter utilities | src/llama_stack/providers/utils/vector_io/filters.py |
No change needed β filter parsing is independent of query |
| FastAPI routes | src/llama_stack_api/vector_io/fastapi_routes.py |
No change needed β request model change propagates automatically |
Example Usage
# Retrieve all chunks tagged with topic="transformer-architectures"
# No similarity query β pure metadata retrieval
response = client.vector_stores.search(
vector_store_id="vs_abc123",
filters={"type": "eq", "key": "topic", "value": "transformer-architectures"},
max_num_results=50,
# query is omitted entirely
)
# Compound filter: all chunks from a specific document created after a date
response = client.vector_stores.search(
vector_store_id="vs_abc123",
filters={
"type": "and",
"filters": [
{"type": "eq", "key": "document_id", "value": "doc_789"},
{"type": "gte", "key": "chunk_index", "value": 5},
{"type": "lte", "key": "chunk_index", "value": 15},
],
},
max_num_results=100,
)π‘ Why is this needed? What if we don't build it?
Enabling Custom Retrieval Strategies Outside Llama Stack
Many production RAG systems go beyond a single similarity search call. Advanced retrieval strategies β such as window retrieval (expanding a match to include its neighboring chunks), parent-document retrieval (fetching the full parent after matching a child chunk), or chunk augmentation (enriching retrieved chunks with surrounding context) β require the ability to fetch specific chunks by metadata after an initial similarity search has identified relevant regions.
Today, to implement window retrieval with Llama Stack, a developer must:
- Perform a similarity search via
query_chunksβ this returns scored chunks, each carrying metadata (e.g.,document_id,chunk_index). - For each matched chunk, retrieve its neighbors (e.g., chunk_index - 1, chunk_index + 1) to build a wider context window.
Step 2 is currently impossible through the Llama Stack API because retrieving chunks by document_id + chunk_index requires a metadata-only query, and the API mandates a query string. The workaround β passing an empty string or dummy query β forces an unnecessary embedding computation and pollutes results with irrelevant similarity scores, defeating the purpose.
Concrete Use Cases Blocked Today
- Window/contextual retrieval: After similarity search identifies chunk N of a document, fetch chunks N-2 through N+2 to provide broader context to the LLM. This is a widely-adopted technique (e.g., LlamaIndex's
SentenceWindowNodeParser, LangChain'sParentDocumentRetriever). - Custom rerankers and fusion strategies: External reranking pipelines that first retrieve a broad candidate set by metadata, then apply custom scoring logic outside Llama Stack.
- Batch export / audit: Retrieve all chunks belonging to a specific data source, user, or label for inspection, export, or compliance without needing a semantic query.
- Deduplication and data management: Find all chunks with a specific hash, source URL, or ingestion batch ID for cleanup or update operations.
- Hybrid orchestration: Multi-stage retrieval where the first stage is a fast metadata pre-filter (e.g., "all chunks from documents tagged 'legal' created this quarter") and the second stage is a targeted similarity search within that subset β orchestrated by application code outside Llama Stack.
What Happens If We Don't Build It
- Developers must bypass the API: Teams needing advanced retrieval are forced to access the underlying vector database directly (Faiss index, SQLite, Milvus, etc.), breaking the abstraction that Llama Stack provides. This couples application code to a specific provider and eliminates the portability benefit.
- Workaround tax: Passing a dummy query to satisfy the required parameter wastes compute on embedding generation and distance computation, introduces confusing similarity scores into results, and may return irrelevant chunks that pollute the actual metadata-filtered results (especially in vector-only mode where scores drive ranking).
- Incomplete platform story: Llama Stack positions itself as a complete AI application development framework. Without metadata-only retrieval, it cannot support the retrieval patterns that production RAG systems commonly rely on, pushing advanced users toward alternative frameworks or direct database access.
Other thoughts
Alignment with OpenAI Vector Store API
OpenAI's own Vector Store Search API currently also requires a query. However, the underlying architecture (Responses API with file_search tool) supports metadata filtering as a first-class concept. As the ecosystem evolves toward more composable retrieval, making query optional positions Llama Stack ahead of the curve and gives developers capabilities they cannot yet get from the OpenAI API directly.
Search Mode Interaction
When query is None, the search_mode parameter (vector, keyword, hybrid) becomes meaningless. The implementation should either:
- Ignore
search_modesilently whenqueryis absent, or - Return a validation error if
search_modeis explicitly set alongside aNonequery.
The former is recommended for ergonomic reasons β it avoids forcing callers to conditionally omit the parameter.
Provider Support Matrix
Not all providers support metadata filtering today. The feature request should be implemented progressively:
- Phase 1: Providers that already support filtering (
faiss,sqlite_vec,milvus) add the metadata-only retrieval path. - Phase 2: Remote providers (
chroma,pgvector,qdrant,weaviate) add support as their backends natively support metadata queries. - Providers that do not yet support filtering should raise a clear
NotImplementedError("Metadata-only retrieval is not supported by this provider").
Ordering of Results
Without similarity scores to rank by, the result ordering must be well-defined. Suggested default: order by (document_id, chunk_index) ascending β matching the natural document structure. This is especially important for window retrieval, where the caller expects chunks to come back in document order.
Pagination Consideration
Metadata-only queries are more likely to return large result sets (compared to similarity search which is inherently top-K). The current response model includes has_more and next_page fields (both currently hardcoded to False/None). This feature would benefit from β and could motivate β implementing proper cursor-based pagination in a follow-up.