Skip to content

fix: prevent data loss during cross-version rolling upgrade#66

Merged
chideat merged 2 commits intomainfrom
fix/cross-version-rolling-upgrade-data-loss
Mar 14, 2026
Merged

fix: prevent data loss during cross-version rolling upgrade#66
chideat merged 2 commits intomainfrom
fix/cross-version-rolling-upgrade-data-loss

Conversation

@chideat
Copy link
Owner

@chideat chideat commented Mar 14, 2026

Summary

  • During a rolling upgrade (e.g. Valkey 7.2 → 9.0), a node demoted to replica may attempt a fullsync from the new-version master. If the RDB format is incompatible, the load fails after memory has already been cleared, leaving an empty in-memory dataset.
  • A subsequent SHUTDOWN (default: save) then overwrites the valid dump.rdb with empty data — causing data loss if the node later becomes master before restarting.
  • Fix: re-check Dbsize right before SHUTDOWN. If 0, use SHUTDOWN NOSAVE to preserve the existing dump.rdb. If > 0, use normal SHUTDOWN.
  • Applied to both the failover/sentinel and cluster shutdown paths.

Known edge case (acknowledged acceptable)

If all keys were intentionally deleted (FLUSHDB/FLUSHALL), Dbsize == 0 and the node will use NOSAVE, preserving the pre-flush dump.rdb. The node would reload old data on restart. This is a rare operational scenario.

Test plan

  • go test ./cmd/helper/commands/failover/... — all existing + new tests pass
  • go test ./cmd/helper/commands/cluster/... — all existing + new tests pass
  • go build ./cmd/helper/... — compiles cleanly
  • Manual: 2-node failover cluster, upgrade replica first then trigger master upgrade — verify dump.rdb on master is non-empty after rolling update

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 14, 2026 12:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents potential data loss during cross-version rolling upgrades by avoiding an empty snapshot overwrite: re-checks Dbsize immediately before shutdown and uses SHUTDOWN NOSAVE when the in-memory dataset is empty, applied to both failover/sentinel and cluster shutdown paths.

Changes:

  • Add pre-shutdown Dbsize re-check in failover shutdown flow; use SHUTDOWN NOSAVE when Dbsize == 0.
  • Add the same Dbsize-gated shutdown behavior to the cluster shutdown flow.
  • Add unit tests (failover + cluster) using miniredis to validate Info().Dbsize on an empty instance and document the decision rule.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
cmd/helper/commands/failover/shutdown.go Adds Dbsize re-check and conditionally uses SHUTDOWN NOSAVE to protect dump.rdb.
cmd/helper/commands/failover/shutdown_test.go Adds tests around empty DB Dbsize and the NOSAVE decision (needs tightening).
cmd/helper/commands/cluster/shutdown.go Adds Dbsize re-check and conditionally uses SHUTDOWN NOSAVE in cluster shutdown.
cmd/helper/commands/cluster/shutdown_test.go Adds cluster-focused tests around empty DB Dbsize and the NOSAVE decision rule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +27 to +57
// Test_shutdownNosaveDecision verifies the shutdown save-mode decision logic:
// Dbsize == 0 (e.g. failed cross-version fullsync) → SHUTDOWN NOSAVE to preserve dump.rdb.
// Dbsize > 0 → normal SHUTDOWN.
func Test_shutdownNosaveDecision(t *testing.T) {
tests := []struct {
name string
dbsize int64
wantNosave bool
}{
{
name: "empty database - use SHUTDOWN NOSAVE",
dbsize: 0,
wantNosave: true,
},
{
name: "non-empty database - use normal SHUTDOWN",
dbsize: 2,
wantNosave: false,
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
gotNosave := tt.dbsize == 0
if gotNosave != tt.wantNosave {
t.Errorf("nosave decision for dbsize=%d: got %v, want %v", tt.dbsize, gotNosave, tt.wantNosave)
}
})
}
}

Comment on lines +365 to +378
// Re-check dbsize: if 0, memory may have been cleared by a failed fullsync
// (e.g., cross-version RDB incompatibility). Use NOSAVE to preserve existing dump.rdb.
latestInfo, _ := getValkeyInfo(ctx, valkeyClient, logger)
if latestInfo != nil && latestInfo.Dbsize == 0 {
logger.Info("dbsize is 0, using SHUTDOWN NOSAVE to preserve existing dump.rdb")
if _, err := valkeyClient.DoWithTimeout(ctx, time.Second*300, "SHUTDOWN", "NOSAVE"); err != nil && !errors.Is(err, io.EOF) {
logger.Error(err, "graceful shutdown failed")
}
} else {
// NOTE: here set timeout to 300s, which will try best to do a shutdown snapshot
// if the data is too large, this snapshot may not be completed
if _, err := valkeyClient.DoWithTimeout(ctx, time.Second*300, "SHUTDOWN"); err != nil && !errors.Is(err, io.EOF) {
logger.Error(err, "graceful shutdown failed")
}
Comment on lines 206 to 218
// Re-check dbsize: if 0, memory may have been cleared by a failed fullsync
// (e.g., cross-version RDB incompatibility). Use NOSAVE to preserve existing dump.rdb.
latestInfo, _ := valkeyClient.Info(ctx)
if latestInfo != nil && latestInfo.Dbsize == 0 {
logger.Info("dbsize is 0, using SHUTDOWN NOSAVE to preserve existing dump.rdb")
if _, err = valkeyClient.Do(ctx, "SHUTDOWN", "NOSAVE"); err != nil && !errors.Is(err, io.EOF) {
logger.Error(err, "graceful shutdown failed")
}
} else {
if _, err = valkeyClient.Do(ctx, "SHUTDOWN"); err != nil && !errors.Is(err, io.EOF) {
logger.Error(err, "graceful shutdown failed")
}
}
if info.Dbsize != 0 {
t.Errorf("expected Dbsize=0 for empty DB, got %d", info.Dbsize)
}
if info.Dbsize != 0 {
Comment on lines +83 to +111
// Test_shutdownNosaveDecision verifies the shutdown save-mode decision logic:
// Dbsize == 0 (e.g. failed cross-version fullsync) → SHUTDOWN NOSAVE to preserve dump.rdb.
// Dbsize > 0 → normal SHUTDOWN.
func Test_shutdownNosaveDecision(t *testing.T) {
tests := []struct {
name string
dbsize int64
wantNosave bool
}{
{
name: "empty database - use SHUTDOWN NOSAVE",
dbsize: 0,
wantNosave: true,
},
{
name: "non-empty database - use normal SHUTDOWN",
dbsize: 2,
wantNosave: false,
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
gotNosave := tt.dbsize == 0
if gotNosave != tt.wantNosave {
t.Errorf("nosave decision for dbsize=%d: got %v, want %v", tt.dbsize, gotNosave, tt.wantNosave)
}
})
}
@chideat chideat force-pushed the fix/cross-version-rolling-upgrade-data-loss branch from 3c497ef to 6ccaf41 Compare March 14, 2026 12:43
chideat and others added 2 commits March 14, 2026 22:38
… cross-version rolling upgrade

During a rolling upgrade (e.g. 7.2 → 9.0), a node demoted to replica may attempt
a fullsync from the new-version master. If the RDB format is incompatible, the load
fails after memory has already been cleared, leaving an empty in-memory state.
A subsequent SHUTDOWN (with save) would then overwrite the valid dump.rdb with empty data.

Fix: re-check Dbsize right before SHUTDOWN. If 0, use SHUTDOWN NOSAVE to preserve the
existing dump.rdb (which holds valid pre-upgrade data). If >0, use normal SHUTDOWN.

Applies to both failover/sentinel and cluster shutdown paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… failure

Extract shutdownNode helper in cluster and failover packages, switch
from Info().Dbsize to DBSIZE command for testability with miniredis,
and replace no-op tests with real tests that exercise both branches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chideat chideat force-pushed the fix/cross-version-rolling-upgrade-data-loss branch from a0cff09 to 6a6fb99 Compare March 14, 2026 14:38
@chideat chideat merged commit 7f9290d into main Mar 14, 2026
3 checks passed
@chideat chideat deleted the fix/cross-version-rolling-upgrade-data-loss branch March 15, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants