Skip to content

[BUG] redis cluster new primary CrashLoopBackOff after Vertical Scaling or modify parameter cause a restart #2363

@ronghuaihai

Description

@ronghuaihai

Describe the bug

  • when redis cluster issue vertical scaling ,or modify parameter ,it will cause a restart.
    after performing a switchover to promote the updated replica as primary, the new secondary (the original primary node) will write new cluster config into nodes.conf file . At the same time,to apply parameter modifications or vertical scaling,kubeblock restarting the the new secondary (the original primary node) .
  • Since these two actions occurred simultaneously, it resulted in the corruption of the nodes.conf file during writing. This will result in the new secondary (the original primary) pod being unable to start and cause the vertical scaling or parameter modification operations to fail.
  • In our testing, although the probability of this occurring is extremely low, it does happen occasionally.

To Reproduce
Steps to reproduce the behavior:

  1. issue vertical scaling
    Kubectl apply -f verticalscale.yaml
    the content of verticalscale.yaml as follows:
apiVersion: operations.kubeblocks.io/v1alpha1
kind: OpsRequest
metadata:
  name: redis-verticalscaling
  namespace: redis-2zce1yoxjt
spec:
  - clusterName: redis-sharding
  type: VerticalScaling
  verticalScaling:
    componentName: shard
    requests:
      cpu: '1'
      memory: 2Gi
    limits:
      cpu: '1'
      memory: 2Gi
  1. the opsrequest status as follows
kubectl get ops -n redis-2zce1yoxjt
NAME                                         TYPE              CLUSTER            STATUS    PROGRESS   AGE
redis-2zce1yoxjt-20251219175600-memory-ops   VerticalScaling   redis-2zce1yoxjt   Failed    6/6        15m

the new secondary pod (before switchover it role is primary ) redis-2zce1yoxjt-shard-rz9-0 is CrashLoopBackOff

Image
  1. we find the tail of running.log report error as follows:
1:C 19 Dec 2025 10:15:34.392 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 19 Dec 2025 10:15:34.394 # Redis version=7.0.12, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 19 Dec 2025 10:15:34.396 # Configuration loaded
1:M 19 Dec 2025 10:15:34.399 * monotonic clock: POSIX clock_gettime
1:M 19 Dec 2025 10:15:34.403 # Unrecoverable error: corrupted cluster config file.
  1. The nodes.conf file that caused the error at that time was backed up as nodes.conf.20251219.bak, with its content as follows:
Image

In an attempt to resolve this issue, we attempted to modify portions of the code.

  1. The solution we attempted

we modify redis-cluster-server-start-dme.sh
5-1.add part of the logic added for the is_rebuild_instance check

is_rebuild_instance() {
  
  # Check if nodes.conf is corrupted
  if [[ -f /data/nodeconfcorrupted.flag ]]; then
    echo "Rebuild instance detected: nodes.conf is corrupted"
    return 0
  fi

  # Early return if rebuild flag doesn't exist
  [[ ! -f /data/rebuild.flag ]] && return 1

......

5-2.add an function to remove nodeconfcorrupted.flag

remove_nodeconfcorrupted_flag() {
  if [ -f /data/nodeconfcorrupted.flag ]; then
    rm -f /data/nodeconfcorrupted.flag
    echo "remove nodeconfcorrupted.flag file succeeded!"
  fi
} 

5-3.For Redis version 7, the following content remains unchanged.

  if is_rebuild_instance; then
    echo "Current instance is a rebuild-instance, forget node id in the cluster firstly."
    node_id=$(get_cluster_id_with_retry "$primary_node_endpoint" "$primary_node_port" "$current_pod_fqdn")
    if [ -z ${REDIS_DEFAULT_PASSWORD} ]; then
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id}
    else
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id} -a ${REDIS_DEFAULT_PASSWORD}
    fi
  fi  

However, in versions 5 and 6, due to the inability to retrieve the cluster ID using the pod IP through function get_cluster_id_with_retry when nodes.conf is rebuild, we adopted an alternative method to obtain the cluster ID when the nodes.conf file is corrupted. The details in 5 or 6 redis cluster are as follows:

  if is_rebuild_instance; then
    echo "Current instance is a rebuild-instance, forget node id in the cluster firstly."
    #node_id=$(get_cluster_id_with_retry "$primary_node_endpoint" "$primary_node_port" "$CURRENT_POD_IP")
    if [ -f /data/nodeconfcorrupted.flag ]; then
      node_id=$(grep "$CURRENT_POD_IP" $(ls -t /data/nodes.conf.*.bak | head -1) | awk '{print $1}')
    else
      node_id=$(get_cluster_id_with_retry "$primary_node_endpoint" "$primary_node_port" "$CURRENT_POD_IP")
    fi
    if [ -z ${REDIS_DEFAULT_PASSWORD} ]; then
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id}
    else
      redis-cli -p $service_port --cluster call $primary_node_endpoint_with_port cluster forget ${node_id} -a ${REDIS_DEFAULT_PASSWORD}
    fi
  fi  

5-4.add one line code remove_nodeconfcorrupted_flag in function scale_redis_cluster_replica

  if is_rebuild_instance; then
    echo "replicate the node $current_pod_fqdn to the primary node $primary_node_endpoint_with_port successfully in rebuild-instance, remove rebuild.flag file..."
    remove_rebuild_instance_flag
    remove_nodeconfcorrupted_flag    
  fi

5-5. add an function in redis-cluster-server-start-dme.sh

check_and_backup_nodes_conf() {
    local LOG_FILE="/data/running.log"
    local NODES_CONF="/data/nodes.conf"
    local TIMESTAMP
    local MATCH_COUNT

    TIMESTAMP=$(date +"%Y%m%d%H%M%S")

    [ ! -f "$LOG_FILE" ] && return 0

    MATCH_COUNT=$(tail -n 5 "$LOG_FILE" | grep -c "corrupted cluster config file")

    # 已经备份过就不再处理
    ls ${NODES_CONF}.${TIMESTAMP}.bak >/dev/null 2>&1 && return 0

    if [ "$MATCH_COUNT" -ge 1 ] && [ -f "$NODES_CONF" ]; then
        mv "$NODES_CONF" "${NODES_CONF}.${TIMESTAMP}.bak"
        echo "[WARN] corrupted cluster config detected, backup nodes.conf"
        touch /data/nodeconfcorrupted.flag
        echo "[WARN] corrupted cluster config detected, touch /data/nodeconfcorrupted.flag" 
        exit 0
    fi
}

5-6.run function check_and_backup_nodes_confin background

load_redis_cluster_common_utils
parse_redis_cluster_shard_announce_addr
build_redis_conf
# TODO: move to memberJoin action in the future
scale_redis_cluster_replica &
check_and_backup_nodes_conf &
start_redis_server

The final outcome is as follows
When the nodes.conf file is corrupted, the first startup of a new secondary instance is guaranteed to fail. After the failure, a nodeconfcorrupted.flag file is created, the pod is then identified as a rebuild pod, forgets the old cluster ID, rejoins the cluster, and removes the nodeconfcorrupted.flag file. Subsequently, it will succeed on the second startup attempt by Kubernetes.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions