fix: prevent data loss during cross-version rolling upgrade by chideat · Pull Request #66 · chideat/valkey-operator

chideat · 2026-03-14T12:34:06Z

Summary

During a rolling upgrade (e.g. Valkey 7.2 → 9.0), a node demoted to replica may attempt a fullsync from the new-version master. If the RDB format is incompatible, the load fails after memory has already been cleared, leaving an empty in-memory dataset.
A subsequent SHUTDOWN (default: save) then overwrites the valid dump.rdb with empty data — causing data loss if the node later becomes master before restarting.
Fix: re-check Dbsize right before SHUTDOWN. If 0, use SHUTDOWN NOSAVE to preserve the existing dump.rdb. If > 0, use normal SHUTDOWN.
Applied to both the failover/sentinel and cluster shutdown paths.

Known edge case (acknowledged acceptable)

If all keys were intentionally deleted (FLUSHDB/FLUSHALL), Dbsize == 0 and the node will use NOSAVE, preserving the pre-flush dump.rdb. The node would reload old data on restart. This is a rare operational scenario.

Test plan

go test ./cmd/helper/commands/failover/... — all existing + new tests pass
go test ./cmd/helper/commands/cluster/... — all existing + new tests pass
go build ./cmd/helper/... — compiles cleanly
Manual: 2-node failover cluster, upgrade replica first then trigger master upgrade — verify dump.rdb on master is non-empty after rolling update

🤖 Generated with Claude Code

Copilot

Pull request overview

Prevents potential data loss during cross-version rolling upgrades by avoiding an empty snapshot overwrite: re-checks Dbsize immediately before shutdown and uses SHUTDOWN NOSAVE when the in-memory dataset is empty, applied to both failover/sentinel and cluster shutdown paths.

Changes:

Add pre-shutdown Dbsize re-check in failover shutdown flow; use SHUTDOWN NOSAVE when Dbsize == 0.
Add the same Dbsize-gated shutdown behavior to the cluster shutdown flow.
Add unit tests (failover + cluster) using miniredis to validate Info().Dbsize on an empty instance and document the decision rule.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
cmd/helper/commands/failover/shutdown.go	Adds `Dbsize` re-check and conditionally uses `SHUTDOWN NOSAVE` to protect `dump.rdb`.
cmd/helper/commands/failover/shutdown_test.go	Adds tests around empty DB `Dbsize` and the NOSAVE decision (needs tightening).
cmd/helper/commands/cluster/shutdown.go	Adds `Dbsize` re-check and conditionally uses `SHUTDOWN NOSAVE` in cluster shutdown.
cmd/helper/commands/cluster/shutdown_test.go	Adds cluster-focused tests around empty DB `Dbsize` and the NOSAVE decision rule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

cmd/helper/commands/cluster/shutdown_test.go

+// Test_shutdownNosaveDecision verifies the shutdown save-mode decision logic:
+// Dbsize == 0 (e.g. failed cross-version fullsync) → SHUTDOWN NOSAVE to preserve dump.rdb.
+// Dbsize > 0 → normal SHUTDOWN.
+func Test_shutdownNosaveDecision(t *testing.T) {
+	tests := []struct {
+		name       string
+		dbsize     int64
+		wantNosave bool
+	}{
+		{
+			name:       "empty database - use SHUTDOWN NOSAVE",
+			dbsize:     0,
+			wantNosave: true,
+		},
+		{
+			name:       "non-empty database - use normal SHUTDOWN",
+			dbsize:     2,
+			wantNosave: false,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			gotNosave := tt.dbsize == 0
+			if gotNosave != tt.wantNosave {
+				t.Errorf("nosave decision for dbsize=%d: got %v, want %v", tt.dbsize, gotNosave, tt.wantNosave)
+			}
+		})
+	}
+}
+


cmd/helper/commands/failover/shutdown.go

+	// Re-check dbsize: if 0, memory may have been cleared by a failed fullsync
+	// (e.g., cross-version RDB incompatibility). Use NOSAVE to preserve existing dump.rdb.
+	latestInfo, _ := getValkeyInfo(ctx, valkeyClient, logger)
+	if latestInfo != nil && latestInfo.Dbsize == 0 {
+		logger.Info("dbsize is 0, using SHUTDOWN NOSAVE to preserve existing dump.rdb")
+		if _, err := valkeyClient.DoWithTimeout(ctx, time.Second*300, "SHUTDOWN", "NOSAVE"); err != nil && !errors.Is(err, io.EOF) {
+			logger.Error(err, "graceful shutdown failed")
+		}
+	} else {
+		// NOTE: here set timeout to 300s, which will try best to do a shutdown snapshot
+		// if the data is too large, this snapshot may not be completed
+		if _, err := valkeyClient.DoWithTimeout(ctx, time.Second*300, "SHUTDOWN"); err != nil && !errors.Is(err, io.EOF) {
+			logger.Error(err, "graceful shutdown failed")
+		}


cmd/helper/commands/cluster/shutdown.go

+	// Re-check dbsize: if 0, memory may have been cleared by a failed fullsync
+	// (e.g., cross-version RDB incompatibility). Use NOSAVE to preserve existing dump.rdb.
+	latestInfo, _ := valkeyClient.Info(ctx)
+	if latestInfo != nil && latestInfo.Dbsize == 0 {
+		logger.Info("dbsize is 0, using SHUTDOWN NOSAVE to preserve existing dump.rdb")
+		if _, err = valkeyClient.Do(ctx, "SHUTDOWN", "NOSAVE"); err != nil && !errors.Is(err, io.EOF) {
+			logger.Error(err, "graceful shutdown failed")
+		}
+	} else {
+		if _, err = valkeyClient.Do(ctx, "SHUTDOWN"); err != nil && !errors.Is(err, io.EOF) {
+			logger.Error(err, "graceful shutdown failed")
+		}
 	}


cmd/helper/commands/failover/shutdown_test.go

+	if info.Dbsize != 0 {
+		t.Errorf("expected Dbsize=0 for empty DB, got %d", info.Dbsize)
+	}
+	if info.Dbsize != 0 {


cmd/helper/commands/failover/shutdown_test.go

+// Test_shutdownNosaveDecision verifies the shutdown save-mode decision logic:
+// Dbsize == 0 (e.g. failed cross-version fullsync) → SHUTDOWN NOSAVE to preserve dump.rdb.
+// Dbsize > 0 → normal SHUTDOWN.
+func Test_shutdownNosaveDecision(t *testing.T) {
+	tests := []struct {
+		name       string
+		dbsize     int64
+		wantNosave bool
+	}{
+		{
+			name:       "empty database - use SHUTDOWN NOSAVE",
+			dbsize:     0,
+			wantNosave: true,
+		},
+		{
+			name:       "non-empty database - use normal SHUTDOWN",
+			dbsize:     2,
+			wantNosave: false,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			gotNosave := tt.dbsize == 0
+			if gotNosave != tt.wantNosave {
+				t.Errorf("nosave decision for dbsize=%d: got %v, want %v", tt.dbsize, gotNosave, tt.wantNosave)
+			}
+		})
+	}


… cross-version rolling upgrade During a rolling upgrade (e.g. 7.2 → 9.0), a node demoted to replica may attempt a fullsync from the new-version master. If the RDB format is incompatible, the load fails after memory has already been cleared, leaving an empty in-memory state. A subsequent SHUTDOWN (with save) would then overwrite the valid dump.rdb with empty data. Fix: re-check Dbsize right before SHUTDOWN. If 0, use SHUTDOWN NOSAVE to preserve the existing dump.rdb (which holds valid pre-upgrade data). If >0, use normal SHUTDOWN. Applies to both failover/sentinel and cluster shutdown paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… failure Extract shutdownNode helper in cluster and failover packages, switch from Info().Dbsize to DBSIZE command for testability with miniredis, and replace no-op tests with real tests that exercise both branches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 14, 2026 12:34

Copilot started reviewing on behalf of chideat March 14, 2026 12:34 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

chideat force-pushed the fix/cross-version-rolling-upgrade-data-loss branch from 3c497ef to 6ccaf41 Compare March 14, 2026 12:43

chideat and others added 2 commits March 14, 2026 22:38

chideat force-pushed the fix/cross-version-rolling-upgrade-data-loss branch from a0cff09 to 6a6fb99 Compare March 14, 2026 14:38

chideat merged commit 7f9290d into main Mar 14, 2026
3 checks passed

chideat deleted the fix/cross-version-rolling-upgrade-data-loss branch March 15, 2026 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent data loss during cross-version rolling upgrade#66

fix: prevent data loss during cross-version rolling upgrade#66
chideat merged 2 commits intomainfrom
fix/cross-version-rolling-upgrade-data-loss

chideat commented Mar 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chideat commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known edge case (acknowledged acceptable)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chideat commented Mar 14, 2026 •

edited

Loading