Vector store registrations lost after server/pod restart — orphaned collections in backend

### System Info

- Llama Stack version: latest (main branch)
- Deployment: Kubernetes (containerized), but reproducible in any restart scenario
- Vector IO provider: `remote::milvus` (production Milvus) and `inline::milvus` (Milvus Lite with PVC)
- Storage backend: Default SQLite KVStore (`kv_sqlite`)
- Python: 3.12

### Information

- [x] The official example scripts
- [x] My own modified scripts

### 🐛 Describe the bug

When a Llama Stack server (or pod) restarts, all vector store registrations are lost even though the underlying vector database (e.g., Milvus) retains its collections and data. This creates **orphaned collections** in the vector database that cannot be accessed, queried, or deleted through Llama Stack anymore.

### Root Cause

Llama Stack uses a **two-level persistence architecture** for vector stores:

1. **Distribution Registry** — server-level metadata stored in a KVStore backend (default: SQLite at `~/.llama/distributions/<distro>/kvstore.db`). This is where vector store *registrations* live.
2. **Provider-level metadata** — each vector_io provider (Milvus, Faiss, ChromaDB, Qdrant) stores its own metadata in the same KVStore (under provider-specific namespace prefixes like `vector_stores:milvus:v3::`).
3. **Actual vector data** — stored in the vector database backend itself (Milvus collections, Faiss indices, etc.).

The default storage configuration uses local SQLite:

```yaml
# From distributions/starter/config.yaml
storage:
  backends:
    kv_default:
      type: kv_sqlite
      db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/kvstore.db
  stores:
    metadata:
      namespace: registry
      backend: kv_default
```

In a containerized (Kubernetes) deployment:
- Users typically set up PVCs or external services for the vector database (Milvus), so **vector data persists**.
- However, the Llama Stack KVStore (`kvstore.db`) resides on the container's **ephemeral filesystem** by default.
- On pod restart, the SQLite file is lost, and with it **all vector store registrations**.
- The Milvus collections still exist with all their data, but Llama Stack has no knowledge of them.

### Consequences

After a restart with lost metadata:
- `GET /v1/vector_stores` returns an empty list
- Existing Milvus collections with data are inaccessible through Llama Stack
- Users **cannot delete** orphaned collections via Llama Stack (since the registration is gone)
- Users **cannot re-register** existing collections (no reconciliation/discovery mechanism)
- The only way to clean up is to directly access Milvus and drop collections manually, bypassing Llama Stack entirely

### Missing reconciliation mechanism

Even when the KVStore IS properly persisted (e.g., using `kv_postgres` or a PVC-backed SQLite path), there is **no mechanism to reconcile** Llama Stack's registry with the actual backend state. If the metadata store is corrupted, migrated, or if collections were created outside of Llama Stack, those collections become invisible orphans. The providers' `initialize()` methods only load from their own KVStore — they never scan the backend for existing collections.

### Secondary bug: Milvus provider initialization order

In `src/llama_stack/providers/remote/vector_io/milvus/milvus.py`, the `MilvusVectorIOAdapter.initialize()` method loads cached vector stores from KVStore **before** creating the MilvusClient:

```python
async def initialize(self) -> None:
    self.kvstore = await kvstore_impl(self.config.persistence)
    # ... loads vector stores from kvstore here ...
    for vector_store_data in stored_vector_stores:
        vector_store = VectorStore.model_validate_json(vector_store_data)
        index = VectorStoreWithIndex(
            vector_store,
            index=MilvusIndex(
                client=self.client,  # <-- self.client is None at this point!
                ...
            ),
            ...
        )
        self.cache[vector_store.identifier] = index

    # MilvusClient is created AFTER loading cached stores
    if isinstance(self.config, RemoteMilvusVectorIOConfig):
        self.client = MilvusClient(...)  # <-- too late, cached indexes already have client=None
    else:
        self.client = MilvusClient(uri=uri)
```

Since Python assigns the `None` value (not a reference) to `MilvusIndex.client`, all restored indexes have `client=None`. Any subsequent operation on these cached indexes would fail. This means **even with properly persisted KVStore, restored vector stores would be broken after restart**.

### Affected code paths

| Layer | File | What it stores | How |
|-------|------|---------------|-----|
| Distribution Registry | `src/llama_stack/core/store/registry.py` | Vector store registration metadata | KVStore key: `distributions:registry:v10::vector_store:{id}` |
| Milvus Provider | `src/llama_stack/providers/remote/vector_io/milvus/milvus.py` | Provider-level vector store metadata | KVStore key: `vector_stores:milvus:v3::{id}` |
| OpenAI Mixin | `src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py` | OpenAI-compatible store metadata | KVStore key: `openai_vector_stores:milvus:v3::{id}` |
| Milvus Backend | External Milvus / Milvus Lite | Actual embeddings and chunks | Milvus collections |

All three metadata layers use the same `kv_default` backend (SQLite by default). The Milvus backend is the only layer that persists independently.

### Reproduction steps

1. Deploy Llama Stack with Milvus provider in Kubernetes (or locally)
2. Create a vector store via the API:
   ```bash
   curl -X POST http://localhost:8321/v1/vector_stores \
     -H "Content-Type: application/json" \
     -d '{"name": "my-store", "embedding_model": "all-MiniLM-L6-v2", "embedding_dimension": 384}'
   ```
3. Insert data into the vector store
4. Verify the vector store exists: `GET /v1/vector_stores` — returns the store
5. Restart the Llama Stack server (or delete/recreate the pod)
6. List vector stores again: `GET /v1/vector_stores` — returns empty list
7. The Milvus collection still exists with all data (verifiable via Milvus SDK directly)
8. Cannot access, query, or delete the collection through Llama Stack

### Error logs

After restart, attempting to access a previously created vector store:

```
ValueError: vector_store `vs_<uuid>` not served by provider: `milvus`.
Make sure there is an VectorIO provider serving this vector_store.
```

Listing vector stores returns empty:
```json
{
  "data": [],
  "has_more": false,
  "first_id": null,
  "last_id": null
}
```

Meanwhile, connecting to Milvus directly shows the collection still exists with all data intact.

### Expected behavior

1. **Minimum viable fix**: Vector store registrations should survive server restarts. Documentation should clearly state that the KVStore backend (`kvstore.db`) must be on persistent storage in containerized deployments, or users should use `kv_postgres`/`kv_redis` backends.

2. **Recommended fix**: On provider initialization, implement a **reconciliation mechanism** that discovers existing collections in the backend and re-registers them in the Llama Stack registry if they are missing. This would handle:
   - Restarts with lost metadata
   - Metadata corruption
   - Migration between metadata backends
   - Collections created outside of Llama Stack

3. **Bug fix**: Fix the Milvus provider initialization order — create the `MilvusClient` **before** loading cached vector stores from the KVStore, so restored `MilvusIndex` objects get a valid client reference.

### Workaround

Use `kv_postgres` as the KVStore backend (requires a PostgreSQL instance), which persists independently of pod lifecycle:

```yaml
storage:
  backends:
    kv_default:
      type: kv_postgres
      host: ${env.POSTGRES_HOST:=localhost}
      port: ${env.POSTGRES_PORT:=5432}
      db: ${env.POSTGRES_DB:=llamastack}
      user: ${env.POSTGRES_USER:=llamastack}
      password: ${env.POSTGRES_PASSWORD:=llamastack}
```

Or mount `~/.llama/distributions/<distro>/` on a PVC, but note the secondary initialization-order bug would still cause issues with the Milvus provider.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector store registrations lost after server/pod restart — orphaned collections in backend #5209

System Info

Information

🐛 Describe the bug

Root Cause

Consequences

Missing reconciliation mechanism

Secondary bug: Milvus provider initialization order

Affected code paths

Reproduction steps

Error logs

Expected behavior

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Layer	File	What it stores	How
Distribution Registry	`src/llama_stack/core/store/registry.py`	Vector store registration metadata	KVStore key: `distributions:registry:v10::vector_store:{id}`
Milvus Provider	`src/llama_stack/providers/remote/vector_io/milvus/milvus.py`	Provider-level vector store metadata	KVStore key: `vector_stores:milvus:v3::{id}`
OpenAI Mixin	`src/llama_stack/providers/utils/memory/openai_vector_store_mixin.py`	OpenAI-compatible store metadata	KVStore key: `openai_vector_stores:milvus:v3::{id}`
Milvus Backend	External Milvus / Milvus Lite	Actual embeddings and chunks	Milvus collections

Vector store registrations lost after server/pod restart — orphaned collections in backend #5209

Description

System Info

Information

🐛 Describe the bug

Root Cause

Consequences

Missing reconciliation mechanism

Secondary bug: Milvus provider initialization order

Affected code paths

Reproduction steps

Error logs

Expected behavior

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions