Feature Request: Metadata-Only Chunk Retrieval in Vector Store Search

### 🚀 Describe the new functionality needed

> [!note]
> Implementation proposal run with assistance of Claude Code Opus 4.6.
> Requirement comes from the field as custom retrievers allow external user to more flexibly utilize `llama-stack`.
> Users know their needs best and can make new features faster based on the `llama-stack`.

> [!important]
> This is purely implementation proposal that might be lacking some important context or not consider all architectural decisions.
> Implementation proposal may change, but the spirit of changes remains the same.

Currently, both the internal `query_chunks` API and the OpenAI-compatible `POST /v1/vector_stores/{vector_store_id}/search` endpoint require a `query` parameter (text or multimodal content used for embedding similarity search). While metadata filters can be applied alongside a query to narrow results, there is no way to retrieve chunks based **solely** on metadata — without providing an embedding query.

This feature request proposes making the `query` parameter optional in:

- **`QueryChunksRequest`** (`llama_stack_api/vector_io/models.py`) — change `query: InterleavedContent` to `query: InterleavedContent | None = None`
- **`OpenAISearchVectorStoreRequest`** (`llama_stack_api/vector_io/models.py`) — change `query: str | list[str]` to `query: str | list[str] | None = None`

When `query` is `None` (or omitted) and `filters` are provided, the system should:

1. **Skip embedding generation and similarity search entirely** — no vector distance computation is performed.
2. **Return all chunks matching the metadata filter**, up to `max_num_results` (or `max_chunks`), ordered by insertion order or document/chunk ID (a deterministic, stable ordering).
3. **Set `score` to `1.0`** (or a sentinel value) for all returned results, since no relevance scoring is applicable.
4. **Disallow the combination** of `query=None` and `filters=None` — at least one must be provided. Return a `400 Bad Request` / `InvalidParameterError` if both are absent.

### Affected Components

| Layer | File | Change |
|-------|------|--------|
| API models | `src/llama_stack_api/vector_io/models.py` | Make `query` optional in request models |
| Protocol | `src/llama_stack_api/vector_io/api.py` | Update docstrings to document metadata-only retrieval |
| Router | `src/llama_stack/core/routers/vector_io.py` | Skip query rewriting when `query` is `None`; validate that at least `query` or `filters` is present |
| Provider mixin | `src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py` | Branch on `query is None` to perform metadata-only scan instead of `query_chunks` with embedding |
| Inline providers | `providers/inline/vector_io/faiss`, `sqlite_vec`, `milvus`, etc. | Implement metadata-only retrieval path in `query_chunks` |
| Remote providers | `providers/remote/vector_io/chroma`, `pgvector`, `qdrant`, `weaviate`, etc. | Implement metadata-only retrieval path or raise `NotImplementedError` with a clear message |
| Filter utilities | `src/llama_stack/providers/utils/vector_io/filters.py` | No change needed — filter parsing is independent of query |
| FastAPI routes | `src/llama_stack_api/vector_io/fastapi_routes.py` | No change needed — request model change propagates automatically |

### Example Usage

```python
# Retrieve all chunks tagged with topic="transformer-architectures"
# No similarity query — pure metadata retrieval
response = client.vector_stores.search(
    vector_store_id="vs_abc123",
    filters={"type": "eq", "key": "topic", "value": "transformer-architectures"},
    max_num_results=50,
    # query is omitted entirely
)

# Compound filter: all chunks from a specific document created after a date
response = client.vector_stores.search(
    vector_store_id="vs_abc123",
    filters={
        "type": "and",
        "filters": [
            {"type": "eq", "key": "document_id", "value": "doc_789"},
            {"type": "gte", "key": "chunk_index", "value": 5},
            {"type": "lte", "key": "chunk_index", "value": 15},
        ],
    },
    max_num_results=100,
)
```

### 💡 Why is this needed? What if we don't build it?

### Enabling Custom Retrieval Strategies Outside Llama Stack

Many production RAG systems go beyond a single similarity search call. Advanced retrieval strategies — such as **window retrieval** (expanding a match to include its neighboring chunks), **parent-document retrieval** (fetching the full parent after matching a child chunk), or **chunk augmentation** (enriching retrieved chunks with surrounding context) — require the ability to fetch specific chunks by metadata after an initial similarity search has identified relevant regions.

Today, to implement window retrieval with Llama Stack, a developer must:

1. Perform a similarity search via `query_chunks` — this returns scored chunks, each carrying metadata (e.g., `document_id`, `chunk_index`).
2. For each matched chunk, retrieve its neighbors (e.g., chunk_index - 1, chunk_index + 1) to build a wider context window.

Step 2 is currently **impossible** through the Llama Stack API because retrieving chunks by `document_id` + `chunk_index` requires a metadata-only query, and the API mandates a `query` string. The workaround — passing an empty string or dummy query — forces an unnecessary embedding computation and pollutes results with irrelevant similarity scores, defeating the purpose.

### Concrete Use Cases Blocked Today

- **Window/contextual retrieval**: After similarity search identifies chunk N of a document, fetch chunks N-2 through N+2 to provide broader context to the LLM. This is a widely-adopted technique (e.g., LlamaIndex's `SentenceWindowNodeParser`, LangChain's `ParentDocumentRetriever`).
- **Custom rerankers and fusion strategies**: External reranking pipelines that first retrieve a broad candidate set by metadata, then apply custom scoring logic outside Llama Stack.
- **Batch export / audit**: Retrieve all chunks belonging to a specific data source, user, or label for inspection, export, or compliance without needing a semantic query.
- **Deduplication and data management**: Find all chunks with a specific hash, source URL, or ingestion batch ID for cleanup or update operations.
- **Hybrid orchestration**: Multi-stage retrieval where the first stage is a fast metadata pre-filter (e.g., "all chunks from documents tagged 'legal' created this quarter") and the second stage is a targeted similarity search within that subset — orchestrated by application code outside Llama Stack.


### What Happens If We Don't Build It

- **Developers must bypass the API**: Teams needing advanced retrieval are forced to access the underlying vector database directly (Faiss index, SQLite, Milvus, etc.), breaking the abstraction that Llama Stack provides. This couples application code to a specific provider and eliminates the portability benefit.
- **Workaround tax**: Passing a dummy query to satisfy the required parameter wastes compute on embedding generation and distance computation, introduces confusing similarity scores into results, and may return irrelevant chunks that pollute the actual metadata-filtered results (especially in vector-only mode where scores drive ranking).
- **Incomplete platform story**: Llama Stack positions itself as a complete AI application development framework. Without metadata-only retrieval, it cannot support the retrieval patterns that production RAG systems commonly rely on, pushing advanced users toward alternative frameworks or direct database access.

### Other thoughts

### Alignment with OpenAI Vector Store API

OpenAI's own Vector Store Search API currently also requires a `query`. However, the underlying architecture (Responses API with `file_search` tool) supports metadata filtering as a first-class concept. As the ecosystem evolves toward more composable retrieval, making `query` optional positions Llama Stack ahead of the curve and gives developers capabilities they cannot yet get from the OpenAI API directly.

### Search Mode Interaction

When `query` is `None`, the `search_mode` parameter (`vector`, `keyword`, `hybrid`) becomes meaningless. The implementation should either:
- Ignore `search_mode` silently when `query` is absent, or
- Return a validation error if `search_mode` is explicitly set alongside a `None` query.

The former is recommended for ergonomic reasons — it avoids forcing callers to conditionally omit the parameter.

### Provider Support Matrix

Not all providers support metadata filtering today. The feature request should be implemented progressively:
- **Phase 1**: Providers that already support filtering (`faiss`, `sqlite_vec`, `milvus`) add the metadata-only retrieval path.
- **Phase 2**: Remote providers (`chroma`, `pgvector`, `qdrant`, `weaviate`) add support as their backends natively support metadata queries.
- Providers that do not yet support filtering should raise a clear `NotImplementedError("Metadata-only retrieval is not supported by this provider")`.

### Ordering of Results

Without similarity scores to rank by, the result ordering must be well-defined. Suggested default: order by `(document_id, chunk_index)` ascending — matching the natural document structure. This is especially important for window retrieval, where the caller expects chunks to come back in document order.

### Pagination Consideration

Metadata-only queries are more likely to return large result sets (compared to similarity search which is inherently top-K). The current response model includes `has_more` and `next_page` fields (both currently hardcoded to `False`/`None`). This feature would benefit from — and could motivate — implementing proper cursor-based pagination in a follow-up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Metadata-Only Chunk Retrieval in Vector Store Search #5114

🚀 Describe the new functionality needed

Affected Components

Example Usage

💡 Why is this needed? What if we don't build it?

Enabling Custom Retrieval Strategies Outside Llama Stack

Concrete Use Cases Blocked Today

What Happens If We Don't Build It

Other thoughts

Alignment with OpenAI Vector Store API

Search Mode Interaction

Provider Support Matrix

Ordering of Results

Pagination Consideration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Layer	File	Change
API models	`src/llama_stack_api/vector_io/models.py`	Make `query` optional in request models
Protocol	`src/llama_stack_api/vector_io/api.py`	Update docstrings to document metadata-only retrieval
Router	`src/llama_stack/core/routers/vector_io.py`	Skip query rewriting when `query` is `None`; validate that at least `query` or `filters` is present
Provider mixin	`src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py`	Branch on `query is None` to perform metadata-only scan instead of `query_chunks` with embedding
Inline providers	`providers/inline/vector_io/faiss`, `sqlite_vec`, `milvus`, etc.	Implement metadata-only retrieval path in `query_chunks`
Remote providers	`providers/remote/vector_io/chroma`, `pgvector`, `qdrant`, `weaviate`, etc.	Implement metadata-only retrieval path or raise `NotImplementedError` with a clear message
Filter utilities	`src/llama_stack/providers/utils/vector_io/filters.py`	No change needed — filter parsing is independent of query
FastAPI routes	`src/llama_stack_api/vector_io/fastapi_routes.py`	No change needed — request model change propagates automatically

Feature Request: Metadata-Only Chunk Retrieval in Vector Store Search #5114

Description

🚀 Describe the new functionality needed

Affected Components

Example Usage

💡 Why is this needed? What if we don't build it?

Enabling Custom Retrieval Strategies Outside Llama Stack

Concrete Use Cases Blocked Today

What Happens If We Don't Build It

Other thoughts

Alignment with OpenAI Vector Store API

Search Mode Interaction

Provider Support Matrix

Ordering of Results

Pagination Consideration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions