[BUG] dual-channel-replication-enabled causes "duplicate" replica on Sentinel

**Describe the bug**

We run Valkey on K8s as a statefulset. A statefulset has 3 pods, each with a Valkey container and a Sentinel container for managing HA. One of these pods is the master, and the other 2 are replicas. Pod IPs are ephemeral, so as pods cycle they will come up with a new IP address which Sentinel will detect as a new replica, marking the old one down. This causes stale replicas in the output of `sentinel replicas master_name`.

To mitigate this we configure K8s Services, one for each pod. We have an init container on the pods to find the ClusterIP for their respective Service, and then configures itself with `replica-announce-ip` for Valkey and `announce-ip` for Sentinel. This allows the replica IPs to remain the same as pods are cycled.

We recently migrated some of our workloads to Valkey 8.1.1 and are trialling the `dual-channel-replication-enabled yes` config. When enabled during a failover, the additional channel of type=rdb-channel appears as the pod IP. If sentinel were to be polling for replicas during this time, it will save this as a distinct replica. When the replica fully resyncs, it will use the announce IP address configured.

This duplication is bad for Valkey backed by Sentinel, as during a failover Sentinel would elect a new master from the list of replicas, and will instruct the new master to replicate from itself, which it cannot do.


Pod IPs

```
NAME                  READY   STATUS    RESTARTS   AGE     IP             
redis-test-0-server-0   5/5     Running   0          8m12s   100.104.75.103 
redis-test-0-server-1   5/5     Running   0          7m21s   100.111.250.31 
redis-test-0-server-2   5/5     Running   0          12m     100.99.64.48   
```

Service Cluster IPs

```
NAME                    TYPE        CLUSTER-IP     
redis-test-0-announce-0   ClusterIP   100.67.128.152 
redis-test-0-announce-1   ClusterIP   100.70.100.164 
redis-test-0-announce-2   ClusterIP   100.71.105.52  
```

In this scenario, redis-test-0-server-1 has been promoted to a master after a failover. The output of `info replication` is:

```
# Replication
role:master
connected_slaves:2
slave0:ip=100.71.105.52,port=6379,state=online,offset=7448294713549,lag=1,type=replica
slave1:ip=100.104.75.103,port=6379,state=wait_bgsave,offset=0,lag=0,type=rdb-channel
```

- slave0 is the ClusterIP for redis-test-0-announce-2, this maps to pod redis-test-0-server-2.
- slave1 is the Pod IP for redis-test-0-server-0

When slave1 resyncs, the output of `info replication` is:

```
# Replication
role:master
connected_slaves:2
slave0:ip=100.71.105.52,port=6379,state=online,offset=7448295098696,lag=1,type=replica
slave1:ip=100.67.128.152,port=6379,state=online,offset=7448295102193,lag=1,type=replica
```

- slave0 is unchanged to what it was previously
- slave1 is now the ClusterIP for redis-test-0-announce-0, this maps to pod redis-test-0-server-0.

When executing `sentinel replicas master_name | grep name -A1`, there are 4 replicas.

```
/data $ redis-cli -p 26379 sentinel replicas master_name | grep name -A1
name
100.99.64.48:6379
--
name
100.104.75.103:6379
--
name
100.67.128.152:6379
--
name
100.71.105.52:6379
```

In order they are:
- pod IP for redis-test-0-server-2
- pod IP for redis-test-0-server-0
- ClusterIP for redis-test-0-announce-2 (redis-test-0-server-2)
- ClusterIP for redis-test-0-announce-0 (redis-test-0-server-0)

Issuing `sentinel reset master_name` fixes this list. But this is not an appropriate solution given that pods can cycle for any reason and would resync with the master with a new pod IP.

**To reproduce**

- Establish any environment with announce IPs, backed by Sentinel
- Enable `dual-channel-replication-enabled yes`
- Execute a failover
- Retrieve replicas from Sentinel via `sentinel failover master_name`

**Expected behavior**

Stale/duplicate replicas are not persisted when a failover happens, or when pods are cycled.

**Additional information**

N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] dual-channel-replication-enabled causes "duplicate" replica on Sentinel #2338

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] dual-channel-replication-enabled causes "duplicate" replica on Sentinel #2338

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions