Skip to content

Comments

[CORE-6913] cloud/scrub: fix false positive for replaced segments#29655

Open
oleiman wants to merge 2 commits intoredpanda-data:devfrom
oleiman:ci/core-6913/internal-scrub
Open

[CORE-6913] cloud/scrub: fix false positive for replaced segments#29655
oleiman wants to merge 2 commits intoredpanda-data:devfrom
oleiman:ci/core-6913/internal-scrub

Conversation

@oleiman
Copy link
Member

@oleiman oleiman commented Feb 19, 2026

The scrub false-positive filter in process_anomalies() only checked whether a segment with the same offset range existed in the manifest. A compacted reupload produces a replacement segment at the same offset range but with a different name (different size). When GC deleted the old segment from cloud storage while the scrubber was still referencing a stale manifest, the filter kept the anomaly because the offset range still matched—even though the current segment at that range was a different (replacement) object that existed in cloud storage.

Compare generate_remote_segment_name() for the manifest entry and the reported-missing segment so that replacements with the same offset range but different identity are correctly recognized as false positives.

Fixes CORE-6913.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

Improvements

  • Improves false positive detection in the cloud storage scrubber to filter out cases where the scrubber's manifest is stale with respect to compacted reupload.

@oleiman oleiman self-assigned this Feb 19, 2026
@oleiman
Copy link
Member Author

oleiman commented Feb 19, 2026

/ci-repeat 3
release
dt-repeat=10
tests/rptest/scale_tests/shard_placement_scale_test.py::ShardPlacementScaleTest.test_node_add

@oleiman oleiman force-pushed the ci/core-6913/internal-scrub branch from 559ba36 to f523899 Compare February 19, 2026 22:23
@oleiman
Copy link
Member Author

oleiman commented Feb 19, 2026

/ci-repeat 3
release
dt-repeat=10
tests/rptest/scale_tests/shard_placement_scale_test.py::ShardPlacementScaleTest.test_node_add

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 19, 2026

Retry command for Build#80815

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/scale_tests/shard_placement_scale_test.py::ShardPlacementScaleTest.test_node_add

@oleiman
Copy link
Member Author

oleiman commented Feb 19, 2026

/cdt
rp_version=build
dt-repeat=10
tests/rptest/scale_tests/shard_placement_scale_test.py::ShardPlacementScaleTest.test_node_add

@oleiman oleiman changed the title cloud/scrub: fix false positive for replaced segments [CORE-6913] cloud/scrub: fix false positive for replaced segments Feb 19, 2026
@oleiman oleiman marked this pull request as ready for review February 20, 2026 02:16
Copilot AI review requested due to automatic review settings February 20, 2026 02:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a false positive issue in the cloud storage scrubber where compacted reuploads were incorrectly flagged as anomalies. The scrubber's manifest could become stale while GC deleted old segments, causing false positives when replacement segments existed at the same offset range but with different names/sizes.

Changes:

  • Enhanced anomaly filtering to compare segment identities (names) in addition to offset ranges
  • Applied the fix to both missing segment and segment metadata anomaly filtering paths

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 20, 2026

CI test results

test results on build#80836
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DataMigrationsApiTest test_concurrent_migrations_with_data_integrity {"transfer_leadership": true} integration https://buildkite.com/redpanda/redpanda/builds/80836#019c78e1-bada-44d2-a426-3e0f72d1235e FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0857, p0=0.5919, reject_threshold=0.0100. adj_baseline=0.2357, p1=0.2777, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_concurrent_migrations_with_data_integrity
test results on build#80905
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ScalingUpTest test_moves_with_local_retention {"use_topic_property": false} integration https://buildkite.com/redpanda/redpanda/builds/80905#019c88e0-8da4-4b14-ab3c-a015ce30d22a FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0106, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_moves_with_local_retention
SimpleEndToEndTest test_relaxed_acks {"write_caching": false} integration https://buildkite.com/redpanda/redpanda/builds/80905#019c88e0-8da7-4551-9996-22ece10cf941 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0026, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SimpleEndToEndTest&test_method=test_relaxed_acks
test results on build#80914
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
QuotaManagementUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/80914#019c8950-5fab-45cd-b0b8-d3e64ea3f3f4 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0271, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=QuotaManagementUpgradeTest&test_method=test_upgrade

@oleiman oleiman requested review from Lazin, andrwng and dotnwat February 23, 2026 00:31
The scrub false-positive filter in process_anomalies() only checked
whether a segment with the same offset range existed in the manifest.
A compacted reupload produces a replacement segment at the same
offset range but with a different name (different size). When GC
deleted the old segment from cloud storage while the scrubber was
still referencing a stale manifest, the filter kept the anomaly
because the offset range still matched—even though the current
segment at that range was a different (replacement) object that
existed in cloud storage.

Compare generate_remote_segment_name() for the manifest entry and
the reported-missing segment so that replacements with the same
offset range but different identity are correctly recognized as
false positives.

Fixes CORE-6913.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ci/core-6913/internal-scrub branch from f523899 to 221cceb Compare February 23, 2026 04:49
Test for race between scrubber and compacted segment reupload:
1. Create manifest with 3 segments, remove the middle one
   from cloud storage so the detector reports it missing
2. Replace it in the manifest with a compacted version at
   the same offset range but different size_bytes
3. Assert generate_remote_segment_name() differs for the
   original vs compacted segment (v2/v3 names encode size)
4. Call process_anomalies() and assert the anomaly is
   filtered out as a false positive
5. Verify no anomalies remain after filtering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ci/core-6913/internal-scrub branch from 221cceb to 1fd5cb6 Compare February 23, 2026 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants