Skip to content

feat: [cp2.6] implement data salvage for force failover#48527

Open
bigsheeper wants to merge 9 commits intomilvus-io:2.6from
bigsheeper:cp-2.6-47599
Open

feat: [cp2.6] implement data salvage for force failover#48527
bigsheeper wants to merge 9 commits intomilvus-io:2.6from
bigsheeper:cp-2.6-47599

Conversation

@bigsheeper
Copy link
Contributor

@bigsheeper bigsheeper commented Mar 25, 2026

Cherry-pick from master

pr: #47599

design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260205-data_salvage_for_force_failover.md

Summary

Cherry-picked from master PR #47599 (open - preemptive backport)

Note: Original PR is still open. This CP PR should be updated if the original PR changes.

Verification

  • File count matches original PR (46 files in original, 42 in CP due to 2.6 compatibility pruning of master-only methods)
  • Code changes verified
  • No conflict markers

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bigsheeper
To complete the pull request process, please assign xiaofan-luan after the PR has been reviewed.
You can assign the PR to them by writing /assign @xiaofan-luan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot added area/dependency Pull requests that update a dependency file size/XXL Denotes a PR that changes 1000+ lines. area/test sig/testing test/integration integration test labels Mar 25, 2026
@mergify mergify bot added the dco-passed DCO check passed. label Mar 25, 2026
@mergify
Copy link
Contributor

mergify bot commented Mar 25, 2026

@bigsheeper This is a feature PR (feat:). Please provide a design document.

How to resolve:
Link a design doc in the PR description:

design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/your_design.md

Design documents location: https://github.com/milvus-io/milvus-design-docs/tree/main/design_docs

@mergify mergify bot added do-not-merge/missing-design-doc kind/feature Issues related to feature request from users labels Mar 25, 2026
@sre-ci-robot
Copy link
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm
  • /ci-rerun-e2e-default // for ci-v2/e2e-default
  • /ci-rerun-ciloop // for ci-v2/ciloop (build + unit tests in one pipeline)
  • /ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, AMD)
  • /ci-rerun-gosdk-arm // for ci-v2/go-sdk-arm (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

@sre-ci-robot sre-ci-robot added do-not-merge/need-merge-master-first any pr merge to release branch need to merge master first do-not-merge/need-milestone generate by v2-label-manager labels Mar 25, 2026
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

[WARNING] Milestone not set

You can set milestone by commenting:
/set-milestone
Example:
/set-milestone 2.5.0

Use /refresh-label to update related check and label manually

@bigsheeper bigsheeper added this to the 2.6.14 milestone Mar 25, 2026
@bigsheeper
Copy link
Contributor Author

/refresh-label

@sre-ci-robot sre-ci-robot removed the do-not-merge/need-milestone generate by v2-label-manager label Mar 25, 2026
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Refresh-Label

  • Title: feat: [cp2.6] implement data salvage for force failover
  • Target: 2.6
  • Labels: kind/feature, area/dependency, sig/testing, size/XXL, area/test, dco-passed, test/integration, do-not-merge/need-merge-master-first, do-not-merge/need-milestone

[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

@codecov
Copy link

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 76.06178% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.25%. Comparing base (cb08db0) to head (c9c42b4).
⚠️ Report is 734 commits behind head on 2.6.

Files with missing lines Patch % Lines
...ternal/streamingcoord/server/service/assignment.go 26.92% 15 Missing and 4 partials ⚠️
...treamingnode/client/handler/handler_client_impl.go 48.14% 11 Missing and 3 partials ⚠️
...nternal/streamingnode/server/wal/adaptor/opener.go 33.33% 3 Missing and 3 partials ⚠️
...coord/server/broadcaster/ack_callback_scheduler.go 0.00% 4 Missing and 1 partial ⚠️
internal/metastore/kv/streamingnode/kv_catalog.go 88.00% 2 Missing and 1 partial ⚠️
...al/streamingnode/server/wal/adaptor/wal_adaptor.go 57.14% 2 Missing and 1 partial ⚠️
...ingnode/server/wal/interceptors/txn/txn_manager.go 25.00% 2 Missing and 1 partial ⚠️
...de/server/wal/recovery/recovery_background_task.go 57.14% 2 Missing and 1 partial ⚠️
internal/distributed/proxy/service.go 0.00% 2 Missing ⚠️
internal/distributed/streaming/test_streaming.go 0.00% 2 Missing ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              2.6   #48527      +/-   ##
==========================================
+ Coverage   76.99%   77.25%   +0.26%     
==========================================
  Files        1700     1960     +260     
  Lines      262533   309812   +47279     
==========================================
+ Hits       202142   239359   +37217     
- Misses      53550    62883    +9333     
- Partials     6841     7570     +729     
Components Coverage Δ
Client 79.48% <78.30%> (+1.33%) ⬆️
Core 83.39% <93.03%> (+1.18%) ⬆️
Go 75.86% <52.12%> (+0.47%) ⬆️
Files with missing lines Coverage Δ
...nternal/distributed/streaming/replicate_service.go 93.75% <100.00%> (+59.72%) ⬆️
internal/distributed/streaming/streaming.go 60.00% <ø> (+6.66%) ⬆️
internal/metastore/catalog.go 100.00% <ø> (ø)
internal/proxy/impl.go 72.30% <100.00%> (-1.60%) ⬇️
...nal/streamingnode/client/handler/handler_client.go 94.02% <ø> (ø)
...reamingnode/server/wal/interceptors/interceptor.go 0.00% <ø> (ø)
...al/interceptors/replicate/replicate_interceptor.go 81.57% <ø> (ø)
...rver/wal/interceptors/replicate/replicates/impl.go 85.54% <100.00%> (ø)
...r/wal/interceptors/replicate/replicates/manager.go 100.00% <ø> (ø)
...gnode/server/wal/recovery/recovery_storage_impl.go 89.49% <100.00%> (+6.89%) ⬆️
... and 11 more

... and 1253 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

1 similar comment
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

@bigsheeper
Copy link
Contributor Author

/ci-rerun-build

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Mar 25, 2026
@bigsheeper
Copy link
Contributor Author

/ci-rerun-all

@sre-ci-robot sre-ci-robot removed the low-code-coverage add test-label from zhikun, diff coverage > 80% label Mar 25, 2026
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

3 similar comments
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

@sre-ci-robot sre-ci-robot added low-code-coverage add test-label from zhikun, diff coverage > 80% and removed low-code-coverage add test-label from zhikun, diff coverage > 80% labels Mar 26, 2026
@bigsheeper
Copy link
Contributor Author

/ci-rerun-ut-go

@sre-ci-robot sre-ci-robot added low-code-coverage add test-label from zhikun, diff coverage > 80% and removed low-code-coverage add test-label from zhikun, diff coverage > 80% labels Mar 26, 2026
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

bigsheeper and others added 9 commits March 26, 2026 20:16
Cherry-pick from master PR milvus-io#47599

Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
…Append mock

- Remove trailing newline in internal/proxy/impl.go (gofmt)
- Remove mockBroadcastService.EXPECT().Append() call in TestForcePromoteUpdateConfigError:
  the Broadcast interface in 2.6 only has Ack(), not Append(), so the mock
  setup was invalid. The test validates early at request validation before
  reaching any broadcast, so the mock is unnecessary.

Signed-off-by: $(git config user.name) <$(git config user.email)>
Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
Add unit test coverage for salvage checkpoint functionality introduced in PR milvus-io#47599:

- kv_catalog_test.go: TestCatalogSalvageCheckpoint (save/get success, errors,
  multiple clusters) and extend TestBuildPrefixAndKey with salvage key builders
- replicate_service_test.go: TestReplicateServiceGetSalvageCheckpoint
  (success, error, closed_lifetime)
- manager_test.go: TestSalvageCheckpointLoadedFromEtcd and
  TestSalvageCheckpointMultipleForcePromotes
- salvage_checkpoint_test.go (new): TestUpdateCheckpointForcePromote and
  TestConsumeDirtySnapshotWithSalvageCheckpoint covering the force-promote
  path in recovery_storage_impl

Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
Covers all code paths in the DumpMessages RPC handler:
- Node health check (unhealthy → error)
- Parameter validation (missing pchannel, nil/empty start_message_id)
- Context cancellation (returns ctx.Err())
- Scanner error/done paths
- Start/end timetick filtering
- System message filtering (RollbackTxn filtered, data messages passed)
- Stream send error propagation
- Channel closed (returns nil)

Uses mock WAL via streaming.SetWALForTest with RunAndReturn to inject
messages synchronously into the buffered msgCh channel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
…eCheckpoint

Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
…effects

Tests using SetWALForTest(mockWAL)/defer SetWALForTest(nil) were
inadvertently resetting the streaming singleton to nil after tests in
proxy_test.go had set it to noopWALAccesser via SetupNoopWALForTest().
This caused TestDeleteTask_Execute/delete_produce_failed to panic on
streaming.WAL().AppendMessages() when the singleton was nil.

Fix: save and restore the previous WAL value in each test instead of
unconditionally resetting to nil.

Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
… no checkpoints exist

When salvageCheckpoints map is empty, make() returns an empty non-nil
slice, causing require.Nil assertion in TestWAL to fail. Return nil
explicitly when there are no salvage checkpoints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yihao Dai <yihao.dai@zilliz.com>
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[FAILED] PR #47599 not merged

Use /refresh-label to update related check and label manually

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Mar 26, 2026
@mergify mergify bot added the ci-passed label Mar 26, 2026
@sre-ci-robot sre-ci-robot removed the low-code-coverage add test-label from zhikun, diff coverage > 80% label Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/dependency Pull requests that update a dependency file area/test ci-passed dco-passed DCO check passed. do-not-merge/need-merge-master-first any pr merge to release branch need to merge master first kind/feature Issues related to feature request from users sig/testing size/XXL Denotes a PR that changes 1000+ lines. test/integration integration test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants