[BUG] redis cluster new primary CrashLoopBackOff after Vertical Scaling or modify parameter  cause a restart

**Describe the bug**
- when redis cluster issue vertical scaling ,or modify parameter ,it will cause a restart.
after performing a switchover to promote the updated replica as primary, the new secondary (the original primary node) will write new cluster config into nodes.conf file .  At the same time,to apply parameter modifications or vertical scaling，kubeblock restarting the the new secondary (the original primary node)  .
- Since these two actions occurred simultaneously, it resulted in the corruption of the nodes.conf file during writing. This will result in the new secondary (the original primary) pod being unable to start and cause the vertical scaling or parameter modification operations to fail.
- In our testing, although the probability of this occurring is extremely low, it does happen occasionally.


**To Reproduce**
Steps to reproduce the behavior:
1. issue vertical scaling 
Kubectl apply -f  verticalscale.yaml 
the content of verticalscale.yaml  as follows:
```
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: redis-verticalscaling
  namespace: redis-2zce1yoxjt
spec:
  - clusterName: redis-sharding
  type: VerticalScaling
  verticalScaling:
    componentName: shard
    requests:
      cpu: '1'
      memory: 2Gi
    limits:
      cpu: '1'
      memory: 2Gi
```

2. the opsrequest status as follows

```
kubectl get ops -n redis-2zce1yoxjt
NAME                                         TYPE              CLUSTER            STATUS    PROGRESS   AGE
redis-2zce1yoxjt-20251219175600-memory-ops   VerticalScaling   redis-2zce1yoxjt   Failed    6/6        15m
```

the new secondary pod (before switchover it role is primary )  redis-2zce1yoxjt-shard-rz9-0 is CrashLoopBackOff 

<img width="822" height="177" alt="Image" src="https://github.com/user-attachments/assets/fd2ca226-8bf5-43dc-bd9e-0d0983866145" />

3. we find the tail of running.log report error as follows:
```1:M 19 Dec 2025 10:10:31.710 # Unrecoverable error: corrupted cluster config file.
1:C 19 Dec 2025 10:15:34.392 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 19 Dec 2025 10:15:34.394 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 19 Dec 2025 10:15:34.396 # Configuration loaded
1:M 19 Dec 2025 10:15:34.399 * monotonic clock: POSIX clock_gettime
1:M 19 Dec 2025 10:15:34.403 # Unrecoverable error: corrupted cluster config file.
```
4. The nodes.conf file that caused the error at that time was backed up as nodes.conf.20251219.bak, with its content as follows:

<img width="1811" height="152" alt="Image" src="https://github.com/user-attachments/assets/0f56ba30-2304-49e5-a2cc-219b375d2cc7" />



**In an attempt to resolve this issue, we attempted to modify portions of the code.**

5. The solution we attempted


we modify  redis-cluster-server-start-dme.sh 
5-1.add part of the logic added for the is_rebuild_instance check
```
is_rebuild_instance() {
  
  # Check if nodes.conf is corrupted
  if [[ -f /data/nodeconfcorrupted.flag ]]; then
    echo "Rebuild instance detected: nodes.conf is corrupted"
    return 0
  fi

  # Early return if rebuild flag doesn't exist
  [[ ! -f /data/rebuild.flag ]] && return 1

......
```

5-2.add an function to remove  nodeconfcorrupted.flag
```
remove_nodeconfcorrupted_flag() {
  if [ -f /data/nodeconfcorrupted.flag ]; then
    rm -f /data/nodeconfcorrupted.flag
    echo "remove nodeconfcorrupted.flag file succeeded!"
  fi
} 
```

5-3.For Redis version 7, the following content remains unchanged. 
```
  if is_rebuild_instance; then
    echo "Current instance is a rebuild-instance, forget node id in the cluster firstly."
    node_id=$(get_cluster_id_with_retry "$primary_node_endpoint" "$primary_node_port" "$current_pod_fqdn")
    if [ -z ${REDIS_DEFAULT_PASSWORD} ]; then
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id}
    else
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id} -a ${REDIS_DEFAULT_PASSWORD}
    fi
  fi  
```

However, in versions 5 and 6, due to the inability to retrieve the cluster ID using the pod IP through function get_cluster_id_with_retry when nodes.conf is rebuild, we adopted an alternative method to obtain the cluster ID when the nodes.conf file is corrupted. The details in 5 or 6 redis cluster are as follows:
```
  if is_rebuild_instance; then
    echo "Current instance is a rebuild-instance, forget node id in the cluster firstly."
    #node_id=$(get_cluster_id_with_retry "$primary_node_endpoint" "$primary_node_port" "$CURRENT_POD_IP")
    if [ -f /data/nodeconfcorrupted.flag ]; then
      node_id=$(grep "$CURRENT_POD_IP" $(ls -t /data/nodes.conf.*.bak | head -1) | awk '{print $1}')
    else
      node_id=$(get_cluster_id_with_retry "$primary_node_endpoint" "$primary_node_port" "$CURRENT_POD_IP")
    fi
    if [ -z ${REDIS_DEFAULT_PASSWORD} ]; then
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id}
    else
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id} -a ${REDIS_DEFAULT_PASSWORD}
    fi
  fi  

```


5-4.add  one line code  remove_nodeconfcorrupted_flag   in function scale_redis_cluster_replica

```
  if is_rebuild_instance; then
    echo "replicate the node $current_pod_fqdn to the primary node $primary_node_endpoint_with_port successfully in rebuild-instance, remove rebuild.flag file..."
    remove_rebuild_instance_flag
    remove_nodeconfcorrupted_flag    
  fi
```


5-5. add an function in  redis-cluster-server-start-dme.sh
```
check_and_backup_nodes_conf() {
    local LOG_FILE="/data/running.log"
    local NODES_CONF="/data/nodes.conf"
    local TIMESTAMP
    local MATCH_COUNT

    TIMESTAMP=$(date +"%Y%m%d%H%M%S")

    [ ! -f "$LOG_FILE" ] && return 0

    MATCH_COUNT=$(tail -n 5 "$LOG_FILE" | grep -c "corrupted cluster config file")

    # 已经备份过就不再处理
    ls ${NODES_CONF}.${TIMESTAMP}.bak >/dev/null 2>&1 && return 0

    if [ "$MATCH_COUNT" -ge 1 ] && [ -f "$NODES_CONF" ]; then
        mv "$NODES_CONF" "${NODES_CONF}.${TIMESTAMP}.bak"
        echo "[WARN] corrupted cluster config detected, backup nodes.conf"
        touch /data/nodeconfcorrupted.flag
        echo "[WARN] corrupted cluster config detected, touch /data/nodeconfcorrupted.flag" 
        exit 0
    fi
}
```
5-6.run function check_and_backup_nodes_confin background 
```
load_redis_cluster_common_utils
parse_redis_cluster_shard_announce_addr
build_redis_conf
# TODO: move to memberJoin action in the future
scale_redis_cluster_replica &
check_and_backup_nodes_conf &
start_redis_server
```


**The final outcome is as follows**
When the nodes.conf file is corrupted, the first startup of a new secondary instance is guaranteed to fail. After the failure, a nodeconfcorrupted.flag file is created, the pod is then identified as a rebuild pod, forgets the old cluster ID, rejoins the cluster, and removes the nodeconfcorrupted.flag file. Subsequently, it will succeed on the second startup attempt by Kubernetes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] redis cluster new primary CrashLoopBackOff after Vertical Scaling or modify parameter cause a restart #2363

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] redis cluster new primary CrashLoopBackOff after Vertical Scaling or modify parameter cause a restart #2363

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions