fix: prevent data loss during cross-version rolling upgrade#66
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Prevents potential data loss during cross-version rolling upgrades by avoiding an empty snapshot overwrite: re-checks Dbsize immediately before shutdown and uses SHUTDOWN NOSAVE when the in-memory dataset is empty, applied to both failover/sentinel and cluster shutdown paths.
Changes:
- Add pre-shutdown
Dbsizere-check in failover shutdown flow; useSHUTDOWN NOSAVEwhenDbsize == 0. - Add the same
Dbsize-gated shutdown behavior to the cluster shutdown flow. - Add unit tests (failover + cluster) using
miniredisto validateInfo().Dbsizeon an empty instance and document the decision rule.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| cmd/helper/commands/failover/shutdown.go | Adds Dbsize re-check and conditionally uses SHUTDOWN NOSAVE to protect dump.rdb. |
| cmd/helper/commands/failover/shutdown_test.go | Adds tests around empty DB Dbsize and the NOSAVE decision (needs tightening). |
| cmd/helper/commands/cluster/shutdown.go | Adds Dbsize re-check and conditionally uses SHUTDOWN NOSAVE in cluster shutdown. |
| cmd/helper/commands/cluster/shutdown_test.go | Adds cluster-focused tests around empty DB Dbsize and the NOSAVE decision rule. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+27
to
+57
| // Test_shutdownNosaveDecision verifies the shutdown save-mode decision logic: | ||
| // Dbsize == 0 (e.g. failed cross-version fullsync) → SHUTDOWN NOSAVE to preserve dump.rdb. | ||
| // Dbsize > 0 → normal SHUTDOWN. | ||
| func Test_shutdownNosaveDecision(t *testing.T) { | ||
| tests := []struct { | ||
| name string | ||
| dbsize int64 | ||
| wantNosave bool | ||
| }{ | ||
| { | ||
| name: "empty database - use SHUTDOWN NOSAVE", | ||
| dbsize: 0, | ||
| wantNosave: true, | ||
| }, | ||
| { | ||
| name: "non-empty database - use normal SHUTDOWN", | ||
| dbsize: 2, | ||
| wantNosave: false, | ||
| }, | ||
| } | ||
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| gotNosave := tt.dbsize == 0 | ||
| if gotNosave != tt.wantNosave { | ||
| t.Errorf("nosave decision for dbsize=%d: got %v, want %v", tt.dbsize, gotNosave, tt.wantNosave) | ||
| } | ||
| }) | ||
| } | ||
| } | ||
|
|
Comment on lines
+365
to
+378
| // Re-check dbsize: if 0, memory may have been cleared by a failed fullsync | ||
| // (e.g., cross-version RDB incompatibility). Use NOSAVE to preserve existing dump.rdb. | ||
| latestInfo, _ := getValkeyInfo(ctx, valkeyClient, logger) | ||
| if latestInfo != nil && latestInfo.Dbsize == 0 { | ||
| logger.Info("dbsize is 0, using SHUTDOWN NOSAVE to preserve existing dump.rdb") | ||
| if _, err := valkeyClient.DoWithTimeout(ctx, time.Second*300, "SHUTDOWN", "NOSAVE"); err != nil && !errors.Is(err, io.EOF) { | ||
| logger.Error(err, "graceful shutdown failed") | ||
| } | ||
| } else { | ||
| // NOTE: here set timeout to 300s, which will try best to do a shutdown snapshot | ||
| // if the data is too large, this snapshot may not be completed | ||
| if _, err := valkeyClient.DoWithTimeout(ctx, time.Second*300, "SHUTDOWN"); err != nil && !errors.Is(err, io.EOF) { | ||
| logger.Error(err, "graceful shutdown failed") | ||
| } |
Comment on lines
206
to
218
| // Re-check dbsize: if 0, memory may have been cleared by a failed fullsync | ||
| // (e.g., cross-version RDB incompatibility). Use NOSAVE to preserve existing dump.rdb. | ||
| latestInfo, _ := valkeyClient.Info(ctx) | ||
| if latestInfo != nil && latestInfo.Dbsize == 0 { | ||
| logger.Info("dbsize is 0, using SHUTDOWN NOSAVE to preserve existing dump.rdb") | ||
| if _, err = valkeyClient.Do(ctx, "SHUTDOWN", "NOSAVE"); err != nil && !errors.Is(err, io.EOF) { | ||
| logger.Error(err, "graceful shutdown failed") | ||
| } | ||
| } else { | ||
| if _, err = valkeyClient.Do(ctx, "SHUTDOWN"); err != nil && !errors.Is(err, io.EOF) { | ||
| logger.Error(err, "graceful shutdown failed") | ||
| } | ||
| } |
| if info.Dbsize != 0 { | ||
| t.Errorf("expected Dbsize=0 for empty DB, got %d", info.Dbsize) | ||
| } | ||
| if info.Dbsize != 0 { |
Comment on lines
+83
to
+111
| // Test_shutdownNosaveDecision verifies the shutdown save-mode decision logic: | ||
| // Dbsize == 0 (e.g. failed cross-version fullsync) → SHUTDOWN NOSAVE to preserve dump.rdb. | ||
| // Dbsize > 0 → normal SHUTDOWN. | ||
| func Test_shutdownNosaveDecision(t *testing.T) { | ||
| tests := []struct { | ||
| name string | ||
| dbsize int64 | ||
| wantNosave bool | ||
| }{ | ||
| { | ||
| name: "empty database - use SHUTDOWN NOSAVE", | ||
| dbsize: 0, | ||
| wantNosave: true, | ||
| }, | ||
| { | ||
| name: "non-empty database - use normal SHUTDOWN", | ||
| dbsize: 2, | ||
| wantNosave: false, | ||
| }, | ||
| } | ||
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| gotNosave := tt.dbsize == 0 | ||
| if gotNosave != tt.wantNosave { | ||
| t.Errorf("nosave decision for dbsize=%d: got %v, want %v", tt.dbsize, gotNosave, tt.wantNosave) | ||
| } | ||
| }) | ||
| } |
3c497ef to
6ccaf41
Compare
… cross-version rolling upgrade During a rolling upgrade (e.g. 7.2 → 9.0), a node demoted to replica may attempt a fullsync from the new-version master. If the RDB format is incompatible, the load fails after memory has already been cleared, leaving an empty in-memory state. A subsequent SHUTDOWN (with save) would then overwrite the valid dump.rdb with empty data. Fix: re-check Dbsize right before SHUTDOWN. If 0, use SHUTDOWN NOSAVE to preserve the existing dump.rdb (which holds valid pre-upgrade data). If >0, use normal SHUTDOWN. Applies to both failover/sentinel and cluster shutdown paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… failure Extract shutdownNode helper in cluster and failover packages, switch from Info().Dbsize to DBSIZE command for testability with miniredis, and replace no-op tests with real tests that exercise both branches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
a0cff09 to
6a6fb99
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SHUTDOWN(default: save) then overwrites the validdump.rdbwith empty data — causing data loss if the node later becomes master before restarting.Dbsizeright beforeSHUTDOWN. If0, useSHUTDOWN NOSAVEto preserve the existingdump.rdb. If> 0, use normalSHUTDOWN.Known edge case (acknowledged acceptable)
If all keys were intentionally deleted (
FLUSHDB/FLUSHALL),Dbsize == 0and the node will useNOSAVE, preserving the pre-flushdump.rdb. The node would reload old data on restart. This is a rare operational scenario.Test plan
go test ./cmd/helper/commands/failover/...— all existing + new tests passgo test ./cmd/helper/commands/cluster/...— all existing + new tests passgo build ./cmd/helper/...— compiles cleanlydump.rdbon master is non-empty after rolling update🤖 Generated with Claude Code