-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
System Info
llama-stack 0.6.0
Information
- The official example scripts
- My own modified scripts
π Describe the bug
The current contextual retrieval implementation (VectorStoreChunkingStrategyContextual) in llama-stack has two main issues that make it unsuitable for real-world use:
-
Files API upload is mandatory β no external storage referenceContextual retrieval is only available via
vector_stores.files.create, which requires afile_idfrom the Files API upload endpoint. There is no way to use a document that already exists in external storage (S3) without uploading it through the llama-stack server (duplication of input documents and the burden of syncing them). -
One LLM call per chunk β no batchingFor each chunk, a separate
chat_completioncall is made. With default chunking (~700 tokens, 400 overlap), a 50-page doc yields ~200 chunks and 200 LLM calls; 500 such documents yield ~100,000 sequential requests. Concurrency is limited by a semaphore (default 3). There is no multi-chunk batching, so round-trip count and latency scale linearly with chunk count. Retry logic (up to 3 retries with exponential backoff per chunk) can make rate-limited runs much slower.
Combined impact: Indexing a 500-document corpus (50 pages each) requires uploading ~2.5 GB through the HTTP server (even if data is already in S3), ~100k LLM calls at low concurrency (hours of runtime), and there is no progress visibility or resumability.
Steps to reproduce
Problem 1 (no external reference):
- Put documents in S3 (or another object store) that llama-stack can access.
- Try to attach those documents to a vector store using contextual retrieval without uploading them via the Files API (e.g. by providing an S3 URI or presigned URL as the document source).
Problem 2 (one LLM call per chunk):
- Create a vector store with contextual chunking (e.g.
VectorStoreChunkingStrategyContextualwith default or custom chunk settings). - Attach a multi-page document (e.g. 50+ pages) via the Files API.
- Inspect server logs or add logging; observe one
chat_completion(or equivalent) request per chunk.
Actual results
- Problem 1: Contextual retrieval is only usable with documents uploaded through the Files API.
- Problem 2: Each chunk triggers exactly one LLM request. For ~200 chunks per 50-page doc and 500 docs, that is ~100,000 LLM calls.
Workarounds
Problem 1 & 2: Using vector_io.insert_chunks with pre-computed embeddings avoids the Files API and allows a custom batching implementation (users must implement contextualization and chunking themselves).
Code references
src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py:_execute_contextual_chunk_transformation(~line 1572),openai_attach_file_to_vector_store(~line 983); contextual chunk loop ~1619β1697 (oneopenai_chat_completionper chunk, semaphore, retries).- Files API upload (full read into server):
src/llama_stack/providers/inline/files/localfs/files.py(~line 111)src/llama_stack/providers/remote/files/s3/files.py(~line 236)src/llama_stack/providers/remote/files/openai/files.py(~line 146)
src/llama_stack_api/vector_io/models.py:VectorStoreChunkingStrategyContextualConfig(~line 371),InsertChunksRequest(~line 733).
Error logs
N/A
Expected behavior
- Problem 1: Support for external document references that can be used as the source for contextual retrieval without uploading files through the llama-stack server.
- Problem 2: Batch multiple chunks per LLM call (e.g. structured output) and/or expose an optimized path (e.g.
insert_chunkswith contextual strategy) so call count and runtime donβt scale one-to-one with chunks.