Skip to content

Vector store registrations lost after server/pod restart β€” orphaned collections in backendΒ #5209

@jakub-walaszczyk

Description

@jakub-walaszczyk

System Info

  • Llama Stack version: latest (main branch)
  • Deployment: Kubernetes (containerized), but reproducible in any restart scenario
  • Vector IO provider: remote::milvus (production Milvus) and inline::milvus (Milvus Lite with PVC)
  • Storage backend: Default SQLite KVStore (kv_sqlite)
  • Python: 3.12

Information

  • The official example scripts
  • My own modified scripts

πŸ› Describe the bug

When a Llama Stack server (or pod) restarts, all vector store registrations are lost even though the underlying vector database (e.g., Milvus) retains its collections and data. This creates orphaned collections in the vector database that cannot be accessed, queried, or deleted through Llama Stack anymore.

Root Cause

Llama Stack uses a two-level persistence architecture for vector stores:

  1. Distribution Registry β€” server-level metadata stored in a KVStore backend (default: SQLite at ~/.llama/distributions/<distro>/kvstore.db). This is where vector store registrations live.
  2. Provider-level metadata β€” each vector_io provider (Milvus, Faiss, ChromaDB, Qdrant) stores its own metadata in the same KVStore (under provider-specific namespace prefixes like vector_stores:milvus:v3::).
  3. Actual vector data β€” stored in the vector database backend itself (Milvus collections, Faiss indices, etc.).

The default storage configuration uses local SQLite:

# From distributions/starter/config.yaml
storage:
  backends:
    kv_default:
      type: kv_sqlite
      db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/kvstore.db
  stores:
    metadata:
      namespace: registry
      backend: kv_default

In a containerized (Kubernetes) deployment:

  • Users typically set up PVCs or external services for the vector database (Milvus), so vector data persists.
  • However, the Llama Stack KVStore (kvstore.db) resides on the container's ephemeral filesystem by default.
  • On pod restart, the SQLite file is lost, and with it all vector store registrations.
  • The Milvus collections still exist with all their data, but Llama Stack has no knowledge of them.

Consequences

After a restart with lost metadata:

  • GET /v1/vector_stores returns an empty list
  • Existing Milvus collections with data are inaccessible through Llama Stack
  • Users cannot delete orphaned collections via Llama Stack (since the registration is gone)
  • Users cannot re-register existing collections (no reconciliation/discovery mechanism)
  • The only way to clean up is to directly access Milvus and drop collections manually, bypassing Llama Stack entirely

Missing reconciliation mechanism

Even when the KVStore IS properly persisted (e.g., using kv_postgres or a PVC-backed SQLite path), there is no mechanism to reconcile Llama Stack's registry with the actual backend state. If the metadata store is corrupted, migrated, or if collections were created outside of Llama Stack, those collections become invisible orphans. The providers' initialize() methods only load from their own KVStore β€” they never scan the backend for existing collections.

Secondary bug: Milvus provider initialization order

In src/llama_stack/providers/remote/vector_io/milvus/milvus.py, the MilvusVectorIOAdapter.initialize() method loads cached vector stores from KVStore before creating the MilvusClient:

async def initialize(self) -> None:
    self.kvstore = await kvstore_impl(self.config.persistence)
    # ... loads vector stores from kvstore here ...
    for vector_store_data in stored_vector_stores:
        vector_store = VectorStore.model_validate_json(vector_store_data)
        index = VectorStoreWithIndex(
            vector_store,
            index=MilvusIndex(
                client=self.client,  # <-- self.client is None at this point!
                ...
            ),
            ...
        )
        self.cache[vector_store.identifier] = index

    # MilvusClient is created AFTER loading cached stores
    if isinstance(self.config, RemoteMilvusVectorIOConfig):
        self.client = MilvusClient(...)  # <-- too late, cached indexes already have client=None
    else:
        self.client = MilvusClient(uri=uri)

Since Python assigns the None value (not a reference) to MilvusIndex.client, all restored indexes have client=None. Any subsequent operation on these cached indexes would fail. This means even with properly persisted KVStore, restored vector stores would be broken after restart.

Affected code paths

Layer File What it stores How
Distribution Registry src/llama_stack/core/store/registry.py Vector store registration metadata KVStore key: distributions:registry:v10::vector_store:{id}
Milvus Provider src/llama_stack/providers/remote/vector_io/milvus/milvus.py Provider-level vector store metadata KVStore key: vector_stores:milvus:v3::{id}
OpenAI Mixin src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py OpenAI-compatible store metadata KVStore key: openai_vector_stores:milvus:v3::{id}
Milvus Backend External Milvus / Milvus Lite Actual embeddings and chunks Milvus collections

All three metadata layers use the same kv_default backend (SQLite by default). The Milvus backend is the only layer that persists independently.

Reproduction steps

  1. Deploy Llama Stack with Milvus provider in Kubernetes (or locally)
  2. Create a vector store via the API:
    curl -X POST http://localhost:8321/v1/vector_stores \
      -H "Content-Type: application/json" \
      -d '{"name": "my-store", "embedding_model": "all-MiniLM-L6-v2", "embedding_dimension": 384}'
  3. Insert data into the vector store
  4. Verify the vector store exists: GET /v1/vector_stores β€” returns the store
  5. Restart the Llama Stack server (or delete/recreate the pod)
  6. List vector stores again: GET /v1/vector_stores β€” returns empty list
  7. The Milvus collection still exists with all data (verifiable via Milvus SDK directly)
  8. Cannot access, query, or delete the collection through Llama Stack

Error logs

After restart, attempting to access a previously created vector store:

ValueError: vector_store `vs_<uuid>` not served by provider: `milvus`.
Make sure there is an VectorIO provider serving this vector_store.

Listing vector stores returns empty:

{
  "data": [],
  "has_more": false,
  "first_id": null,
  "last_id": null
}

Meanwhile, connecting to Milvus directly shows the collection still exists with all data intact.

Expected behavior

  1. Minimum viable fix: Vector store registrations should survive server restarts. Documentation should clearly state that the KVStore backend (kvstore.db) must be on persistent storage in containerized deployments, or users should use kv_postgres/kv_redis backends.

  2. Recommended fix: On provider initialization, implement a reconciliation mechanism that discovers existing collections in the backend and re-registers them in the Llama Stack registry if they are missing. This would handle:

    • Restarts with lost metadata
    • Metadata corruption
    • Migration between metadata backends
    • Collections created outside of Llama Stack
  3. Bug fix: Fix the Milvus provider initialization order β€” create the MilvusClient before loading cached vector stores from the KVStore, so restored MilvusIndex objects get a valid client reference.

Workaround

Use kv_postgres as the KVStore backend (requires a PostgreSQL instance), which persists independently of pod lifecycle:

storage:
  backends:
    kv_default:
      type: kv_postgres
      host: ${env.POSTGRES_HOST:=localhost}
      port: ${env.POSTGRES_PORT:=5432}
      db: ${env.POSTGRES_DB:=llamastack}
      user: ${env.POSTGRES_USER:=llamastack}
      password: ${env.POSTGRES_PASSWORD:=llamastack}

Or mount ~/.llama/distributions/<distro>/ on a PVC, but note the secondary initialization-order bug would still cause issues with the Milvus provider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions