Contextual Retrieval: scalability and usability issues

### System Info

llama-stack 0.6.0

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### 🐛 Describe the bug

The current contextual retrieval implementation (`VectorStoreChunkingStrategyContextual`) in llama-stack has two main issues that make it unsuitable for real-world use:

1. Files API upload is mandatory — no external storage referenceContextual retrieval is only available via `vector_stores.files.create`, which requires a `file_id` from the Files API upload endpoint. There is no way to use a document that already exists in external storage (S3) without uploading it through the llama-stack server (duplication of input documents and the burden of syncing them).

2. One LLM call per chunk — no batchingFor each chunk, a separate `chat_completion` call is made. With default chunking (~700 tokens, 400 overlap), a 50-page doc yields ~200 chunks and 200 LLM calls; 500 such documents yield ~100,000 sequential requests. Concurrency is limited by a semaphore (default 3). There is no multi-chunk batching, so round-trip count and latency scale linearly with chunk count. Retry logic (up to 3 retries with exponential backoff per chunk) can make rate-limited runs much slower.


**Combined impact:** Indexing a 500-document corpus (50 pages each) requires uploading ~2.5 GB through the HTTP server (even if data is already in S3), ~100k LLM calls at low concurrency (hours of runtime), and there is no progress visibility or resumability.

---
### Steps to reproduce
**Problem 1 (no external reference):**
1. Put documents in S3 (or another object store) that llama-stack can access.
2. Try to attach those documents to a vector store using contextual retrieval without uploading them via the Files API (e.g. by providing an S3 URI or presigned URL as the document source).

**Problem 2 (one LLM call per chunk):**
1. Create a vector store with contextual chunking (e.g. `VectorStoreChunkingStrategyContextual` with default or custom chunk settings).
2. Attach a multi-page document (e.g. 50+ pages) via the Files API.
3. Inspect server logs or add logging; observe one `chat_completion` (or equivalent) request per chunk.
---

### Actual results
- **Problem 1:** Contextual retrieval is only usable with documents uploaded through the Files API.
- **Problem 2:** Each chunk triggers exactly one LLM request. For ~200 chunks per 50-page doc and 500 docs, that is ~100,000 LLM calls.


---
### Workarounds
**Problem 1 & 2:**  Using `vector_io.insert_chunks` with pre-computed embeddings avoids the Files API and allows a custom batching implementation (users must implement contextualization and chunking themselves).

---

#### Code references
- `src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py`: `_execute_contextual_chunk_transformation` (~line 1572), `openai_attach_file_to_vector_store` (~line 983); contextual chunk loop ~1619–1697 (one `openai_chat_completion` per chunk, semaphore, retries).
- Files API upload (full read into server):
  - `src/llama_stack/providers/inline/files/localfs/files.py` (~line 111)
  - `src/llama_stack/providers/remote/files/s3/files.py` (~line 236)
  - `src/llama_stack/providers/remote/files/openai/files.py` (~line 146)
- `src/llama_stack_api/vector_io/models.py`: `VectorStoreChunkingStrategyContextualConfig` (~line 371), `InsertChunksRequest` (~line 733).

### Error logs

N/A

### Expected behavior

- **Problem 1:** Support for external document references that can be used as the source for contextual retrieval without uploading files through the llama-stack server.
- **Problem 2:** Batch multiple chunks per LLM call (e.g. structured output) and/or expose an optimized path (e.g. `insert_chunks` with contextual strategy) so call count and runtime don’t scale one-to-one with chunks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contextual Retrieval: scalability and usability issues #5194

System Info

Information

🐛 Describe the bug

Steps to reproduce

Actual results

Workarounds

Code references

Error logs

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Contextual Retrieval: scalability and usability issues #5194

Description

System Info

Information

🐛 Describe the bug

Steps to reproduce

Actual results

Workarounds

Code references

Error logs

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions