Skip to content

[Bug]: Insert fails during concurrent operations with add_field — closed channel and schema mismatch #48522

@zhuwenxing

Description

@zhuwenxing

Is there an existing issue for this?

  • I have searched the existing issues

Environment

  • Milvus version: 2.6 branch (image: 2.6-20260323-ffead09b-amd64)
  • Deployment mode: cluster
  • MQ type: pulsar
  • SDK version: pymilvus (bundled in chaos test image)
  • OS: Linux (K8s)
  • CPU/Memory: CI environment
  • Others: Chaos test (chaos-mesh pod-failure injection on etcd-followers)

K8s Pod List

NAME                                                                 READY   STATUS    RESTARTS            AGE
etcd-followers-pod-failure-22986-0                                   1/1     Running   0                   32m
etcd-followers-pod-failure-22986-1                                   1/1     Running   2 (8m24s ago)       32m
etcd-followers-pod-failure-22986-2                                   1/1     Running   0                   32m
etcd-followers-pod-failure-22986-milvus-datanode-5cfdc88f64tgbc      1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-datanode-5cfdc88f6v27nc      1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-mixcoord-5875d97b8vqvwn      1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-proxy-5bc6d97946-69twm       1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-querynode-5768847c6vq24      1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-querynode-5768847cfd9lb      1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-querynode-5768847clr276      1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-streamingnode-794cjrhhl      1/1     Running   3 (31m ago)         32m
etcd-followers-pod-failure-22986-milvus-streamingnode-794clv8lz      1/1     Running   3 (31m ago)         32m

Current Behavior

In the chaos release nightly cron test (chaos-test-for-release-cron), when InsertChecker and AddFieldChecker run concurrently, insert operations fail completely with Op.insert succ rate 0, total: 0.

Client-side error (what the test sees):

Error in InsertChecker.run_task: <MilvusException: (code=1, message=Unexpected error, message=<Cannot invoke RPC on closed channel!>)>

This error repeats continuously — the insert checker never recovers during the entire test run.

Server-side error (found in Milvus proxy logs via Loki):

After the gRPC channel eventually recovers, inserts are rejected by the proxy due to schema mismatch caused by concurrent add_field operations:

[WARN] [proxy/impl.go:2605] ["Failed to execute insert task in task scheduler: collection schema mismatch[collection=Checker__fu3YcGKb]"]
[INFO] [proxy/task_insert.go:151] ["collection schema mismatch"]
  [collectionName=Checker__fu3YcGKb]
  [requestSchemaTs=465128194619211786]
  [collectionSchemaTs=465128214332441585]
  [len(FieldsData)=39] [NumRows=3000]

The proxy does a strict schemaTimestamp equality check (task_insert.go:149-152). After add_field updates the schema, all insert requests carrying the old timestamp are rejected. pymilvus retry_on_schema_mismatch invalidates the cache and retries, but the retry still uses the same data (39 fields) while the new schema expects 40 fields — so it fails again.

This results in nearly all chaos test builds failing. Only ~9 out of 100 recent builds passed.

Jenkins reference: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/22986/pipeline

Expected Behavior

Since add_field requires the new field to be nullable (enforced by internal/proxy/task.go:672-673), insert requests with the old schema (missing the new nullable field) should still succeed. The server should accept compatible data rather than strictly rejecting on timestamp mismatch.

Steps To Reproduce

  1. Deploy Milvus cluster (2.6 branch)
  2. Create a collection with dynamic field enabled and multiple field types
  3. Start concurrent insert operations on the collection
  4. Concurrently call add_field to add a new nullable field to the same collection
  5. Observe that all insert operations after add_field fail

Milvus Log

Proxy logs — schema mismatch rejection:

[2026/03/24 03:35:46.620 +00:00] [INFO] [proxy/task_insert.go:151] ["collection schema mismatch"]
  [collectionName=Checker__fu3YcGKb]
  [requestSchemaTs=465128194619211786]
  [collectionSchemaTs=465128214332441585]
  [error="collection schema mismatch[collection=Checker__fu3YcGKb]"]

Same issue affects upsert:

[2026/03/24 03:35:43.658 +00:00] [INFO] [proxy/task_upsert.go:1049] ["collection schema mismatch"]
  [collectionName=Checker__fu3YcGKb]
  [requestSchemaTs=465128194619211786]
  [collectionSchemaTs=465128214332441585]

Client-side error (repeated throughout test):

[2026-03-24 03:18:40 - ERROR] Error in InsertChecker.run_task:
  <MilvusException: (code=1, message=Unexpected error, message=<Cannot invoke RPC on closed channel!>)>

Anything else?

  • This issue is persistent across multiple Milvus 2.6 nightly images (f1494f66, 855d7e2a, c83686f3, fd5c01c6, ffead09b) over the past week
  • Other concurrent operations (search, query, hybrid_search, delete, flush) all succeed with rate ~1.0
  • Only insert and upsert are affected
  • The few successful builds (~9%) appear to be timing-dependent

Metadata

Metadata

Assignees

Labels

kind/bugIssues or changes related a bugpriority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.severity/criticalCritical, lead to crash, data missing, wrong result, function totally doesn't work.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions