Skip to content

Contextual Retrieval: scalability and usability issuesΒ #5194

@LukaszCmielowski

Description

@LukaszCmielowski

System Info

llama-stack 0.6.0

Information

  • The official example scripts
  • My own modified scripts

πŸ› Describe the bug

The current contextual retrieval implementation (VectorStoreChunkingStrategyContextual) in llama-stack has two main issues that make it unsuitable for real-world use:

  1. Files API upload is mandatory β€” no external storage referenceContextual retrieval is only available via vector_stores.files.create, which requires a file_id from the Files API upload endpoint. There is no way to use a document that already exists in external storage (S3) without uploading it through the llama-stack server (duplication of input documents and the burden of syncing them).

  2. One LLM call per chunk β€” no batchingFor each chunk, a separate chat_completion call is made. With default chunking (~700 tokens, 400 overlap), a 50-page doc yields ~200 chunks and 200 LLM calls; 500 such documents yield ~100,000 sequential requests. Concurrency is limited by a semaphore (default 3). There is no multi-chunk batching, so round-trip count and latency scale linearly with chunk count. Retry logic (up to 3 retries with exponential backoff per chunk) can make rate-limited runs much slower.

Combined impact: Indexing a 500-document corpus (50 pages each) requires uploading ~2.5 GB through the HTTP server (even if data is already in S3), ~100k LLM calls at low concurrency (hours of runtime), and there is no progress visibility or resumability.


Steps to reproduce

Problem 1 (no external reference):

  1. Put documents in S3 (or another object store) that llama-stack can access.
  2. Try to attach those documents to a vector store using contextual retrieval without uploading them via the Files API (e.g. by providing an S3 URI or presigned URL as the document source).

Problem 2 (one LLM call per chunk):

  1. Create a vector store with contextual chunking (e.g. VectorStoreChunkingStrategyContextual with default or custom chunk settings).
  2. Attach a multi-page document (e.g. 50+ pages) via the Files API.
  3. Inspect server logs or add logging; observe one chat_completion (or equivalent) request per chunk.

Actual results

  • Problem 1: Contextual retrieval is only usable with documents uploaded through the Files API.
  • Problem 2: Each chunk triggers exactly one LLM request. For ~200 chunks per 50-page doc and 500 docs, that is ~100,000 LLM calls.

Workarounds

Problem 1 & 2: Using vector_io.insert_chunks with pre-computed embeddings avoids the Files API and allows a custom batching implementation (users must implement contextualization and chunking themselves).


Code references

  • src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py: _execute_contextual_chunk_transformation (~line 1572), openai_attach_file_to_vector_store (~line 983); contextual chunk loop ~1619–1697 (one openai_chat_completion per chunk, semaphore, retries).
  • Files API upload (full read into server):
    • src/llama_stack/providers/inline/files/localfs/files.py (~line 111)
    • src/llama_stack/providers/remote/files/s3/files.py (~line 236)
    • src/llama_stack/providers/remote/files/openai/files.py (~line 146)
  • src/llama_stack_api/vector_io/models.py: VectorStoreChunkingStrategyContextualConfig (~line 371), InsertChunksRequest (~line 733).

Error logs

N/A

Expected behavior

  • Problem 1: Support for external document references that can be used as the source for contextual retrieval without uploading files through the llama-stack server.
  • Problem 2: Batch multiple chunks per LLM call (e.g. structured output) and/or expose an optimized path (e.g. insert_chunks with contextual strategy) so call count and runtime don’t scale one-to-one with chunks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions