-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: 2.6 branch (image:
2.6-20260323-ffead09b-amd64) - Deployment mode: cluster
- MQ type: pulsar
- SDK version: pymilvus (bundled in chaos test image)
- OS: Linux (K8s)
- CPU/Memory: CI environment
- Others: Chaos test (chaos-mesh pod-failure injection on etcd-followers)
K8s Pod List
NAME READY STATUS RESTARTS AGE
etcd-followers-pod-failure-22986-0 1/1 Running 0 32m
etcd-followers-pod-failure-22986-1 1/1 Running 2 (8m24s ago) 32m
etcd-followers-pod-failure-22986-2 1/1 Running 0 32m
etcd-followers-pod-failure-22986-milvus-datanode-5cfdc88f64tgbc 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-datanode-5cfdc88f6v27nc 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-mixcoord-5875d97b8vqvwn 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-proxy-5bc6d97946-69twm 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-querynode-5768847c6vq24 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-querynode-5768847cfd9lb 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-querynode-5768847clr276 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-streamingnode-794cjrhhl 1/1 Running 3 (31m ago) 32m
etcd-followers-pod-failure-22986-milvus-streamingnode-794clv8lz 1/1 Running 3 (31m ago) 32m
Current Behavior
In the chaos release nightly cron test (chaos-test-for-release-cron), when InsertChecker and AddFieldChecker run concurrently, insert operations fail completely with Op.insert succ rate 0, total: 0.
Client-side error (what the test sees):
Error in InsertChecker.run_task: <MilvusException: (code=1, message=Unexpected error, message=<Cannot invoke RPC on closed channel!>)>
This error repeats continuously — the insert checker never recovers during the entire test run.
Server-side error (found in Milvus proxy logs via Loki):
After the gRPC channel eventually recovers, inserts are rejected by the proxy due to schema mismatch caused by concurrent add_field operations:
[WARN] [proxy/impl.go:2605] ["Failed to execute insert task in task scheduler: collection schema mismatch[collection=Checker__fu3YcGKb]"]
[INFO] [proxy/task_insert.go:151] ["collection schema mismatch"]
[collectionName=Checker__fu3YcGKb]
[requestSchemaTs=465128194619211786]
[collectionSchemaTs=465128214332441585]
[len(FieldsData)=39] [NumRows=3000]
The proxy does a strict schemaTimestamp equality check (task_insert.go:149-152). After add_field updates the schema, all insert requests carrying the old timestamp are rejected. pymilvus retry_on_schema_mismatch invalidates the cache and retries, but the retry still uses the same data (39 fields) while the new schema expects 40 fields — so it fails again.
This results in nearly all chaos test builds failing. Only ~9 out of 100 recent builds passed.
Jenkins reference: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/22986/pipeline
Expected Behavior
Since add_field requires the new field to be nullable (enforced by internal/proxy/task.go:672-673), insert requests with the old schema (missing the new nullable field) should still succeed. The server should accept compatible data rather than strictly rejecting on timestamp mismatch.
Steps To Reproduce
- Deploy Milvus cluster (2.6 branch)
- Create a collection with dynamic field enabled and multiple field types
- Start concurrent insert operations on the collection
- Concurrently call
add_fieldto add a new nullable field to the same collection - Observe that all insert operations after
add_fieldfail
Milvus Log
Proxy logs — schema mismatch rejection:
[2026/03/24 03:35:46.620 +00:00] [INFO] [proxy/task_insert.go:151] ["collection schema mismatch"]
[collectionName=Checker__fu3YcGKb]
[requestSchemaTs=465128194619211786]
[collectionSchemaTs=465128214332441585]
[error="collection schema mismatch[collection=Checker__fu3YcGKb]"]
Same issue affects upsert:
[2026/03/24 03:35:43.658 +00:00] [INFO] [proxy/task_upsert.go:1049] ["collection schema mismatch"]
[collectionName=Checker__fu3YcGKb]
[requestSchemaTs=465128194619211786]
[collectionSchemaTs=465128214332441585]
Client-side error (repeated throughout test):
[2026-03-24 03:18:40 - ERROR] Error in InsertChecker.run_task:
<MilvusException: (code=1, message=Unexpected error, message=<Cannot invoke RPC on closed channel!>)>
Anything else?
- This issue is persistent across multiple Milvus 2.6 nightly images (
f1494f66,855d7e2a,c83686f3,fd5c01c6,ffead09b) over the past week - Other concurrent operations (search, query, hybrid_search, delete, flush) all succeed with rate ~1.0
- Only insert and upsert are affected
- The few successful builds (~9%) appear to be timing-dependent