-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Open
Description
Summary
In a CDC (Change Data Capture) replication setup, the primary and secondary Milvus clusters may have different hardware resources (different number of QueryNodes, different resource group topology). Currently, when a LoadCollection is replicated from the primary to the secondary via CDC, the secondary blindly applies the same replica_number and resource_groups from the primary. This can cause load failures (not enough nodes) or resource waste (over-provisioning) on the secondary.
Proposal
Allow the secondary cluster to use its own cluster-level replica configuration (ClusterLevelLoadReplicaNumber / ClusterLevelLoadResourceGroups) instead of the primary's replica config when processing replicated AlterLoadConfigMessage.
Approach
- Proto change: Add
use_local_replica_configbool field toAlterLoadConfigMessageHeader - ReplicateService: Set the flag to
trueon every replicatedAlterLoadConfigMessageon the secondary - DDL callback: In
LoadCollectionJob.Execute(), when the flag istrue, read local cluster-level config and substitute it for the primary's replica config. Fall back to primary's config if local config is not set.
Key Design Decisions
- No new config parameters - reuses existing
ClusterLevelLoadReplicaNumber/ClusterLevelLoadResourceGroups - Backward compatible - when local config is not set (default), falls back to primary's config
- No changes to primary cluster behavior - only the secondary side is affected
- Covers TransferReplica too - also uses
AlterLoadConfigMessage, preventing primary from rearranging secondary's resource groups
Data Flow
Primary: LoadCollection(replica_number=3, rgs=["rg1","rg2","rg3"])
→ WAL: AlterLoadConfigMessage(replicas=[{rg1},{rg2},{rg3}])
→ CDC replicates to secondary
Secondary ReplicateService:
→ Sets header.use_local_replica_config = true
Secondary DDL callback (LoadCollectionJob.Execute()):
→ Sees use_local_replica_config == true
→ Reads local ClusterLevelLoadReplicaNumber = 1
→ Reads local ClusterLevelLoadResourceGroups = ["__default_resource_group"]
→ Rebuilds replicas = [{__default_resource_group}]
→ Spawns 1 replica in default RG
Affected Files
| File | Change |
|---|---|
pkg/proto/messages.proto |
Add use_local_replica_config field |
internal/distributed/streaming/replicate_service.go |
Set flag on replicated AlterLoadConfig |
internal/querycoordv2/job/job_load.go |
Read local config when flag is set |
| Tests | Unit tests for both components |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels