Skip to content

enhance: support different replica numbers on secondary CDC cluster #47779

@bigsheeper

Description

@bigsheeper

Summary

In a CDC (Change Data Capture) replication setup, the primary and secondary Milvus clusters may have different hardware resources (different number of QueryNodes, different resource group topology). Currently, when a LoadCollection is replicated from the primary to the secondary via CDC, the secondary blindly applies the same replica_number and resource_groups from the primary. This can cause load failures (not enough nodes) or resource waste (over-provisioning) on the secondary.

Proposal

Allow the secondary cluster to use its own cluster-level replica configuration (ClusterLevelLoadReplicaNumber / ClusterLevelLoadResourceGroups) instead of the primary's replica config when processing replicated AlterLoadConfigMessage.

Approach

  1. Proto change: Add use_local_replica_config bool field to AlterLoadConfigMessageHeader
  2. ReplicateService: Set the flag to true on every replicated AlterLoadConfigMessage on the secondary
  3. DDL callback: In LoadCollectionJob.Execute(), when the flag is true, read local cluster-level config and substitute it for the primary's replica config. Fall back to primary's config if local config is not set.

Key Design Decisions

  • No new config parameters - reuses existing ClusterLevelLoadReplicaNumber / ClusterLevelLoadResourceGroups
  • Backward compatible - when local config is not set (default), falls back to primary's config
  • No changes to primary cluster behavior - only the secondary side is affected
  • Covers TransferReplica too - also uses AlterLoadConfigMessage, preventing primary from rearranging secondary's resource groups

Data Flow

Primary: LoadCollection(replica_number=3, rgs=["rg1","rg2","rg3"])
  → WAL: AlterLoadConfigMessage(replicas=[{rg1},{rg2},{rg3}])
  → CDC replicates to secondary

Secondary ReplicateService:
  → Sets header.use_local_replica_config = true

Secondary DDL callback (LoadCollectionJob.Execute()):
  → Sees use_local_replica_config == true
  → Reads local ClusterLevelLoadReplicaNumber = 1
  → Reads local ClusterLevelLoadResourceGroups = ["__default_resource_group"]
  → Rebuilds replicas = [{__default_resource_group}]
  → Spawns 1 replica in default RG

Affected Files

File Change
pkg/proto/messages.proto Add use_local_replica_config field
internal/distributed/streaming/replicate_service.go Set flag on replicated AlterLoadConfig
internal/querycoordv2/job/job_load.go Read local config when flag is set
Tests Unit tests for both components

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions