Skip to content

[scrubber] phase1: add scrub manager#395

Open
JacksonYao287 wants to merge 1 commit intoeBay:mainfrom
JacksonYao287:add-scrub-manager
Open

[scrubber] phase1: add scrub manager#395
JacksonYao287 wants to merge 1 commit intoeBay:mainfrom
JacksonYao287:add-scrub-manager

Conversation

@JacksonYao287
Copy link
Collaborator

@JacksonYao287 JacksonYao287 commented Mar 8, 2026

this pr implements the framwork and basic logic of scrubber, including:
1 thread model
2 scrubber rpc
3 local scrub: deep and shallow scrub for pg, shard and blob

later PR will :
1 add http interface to trigger pg scrub
2 add more UT to cover more case
3 do optimization

@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 71.19816% with 375 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.45%. Comparing base (1746bcc) to head (86ce003).
⚠️ Report is 164 commits behind head on main.

Files with missing lines Patch % Lines
src/lib/homestore_backend/scrub_manager.cpp 69.33% 207 Missing and 88 partials ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp 59.74% 42 Missing and 20 partials ⚠️
src/lib/homestore_backend/scrub_manager.hpp 90.29% 10 Missing ⚠️
src/lib/homestore_backend/MPMCPriorityQueue.hpp 94.28% 1 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_homeobject.cpp 81.81% 0 Missing and 2 partials ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp 86.66% 2 Missing ⚠️
...ib/homestore_backend/replication_state_machine.cpp 87.50% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
- Coverage   63.15%   57.45%   -5.71%     
==========================================
  Files          32       39       +7     
  Lines        1900     6499    +4599     
  Branches      204      850     +646     
==========================================
+ Hits         1200     3734    +2534     
- Misses        600     2358    +1758     
- Partials      100      407     +307     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the add-scrub-manager branch 3 times, most recently from 577567f to e6e4ff8 Compare March 11, 2026 08:37
Add comprehensive scrub infrastructure to detect data corruption and
inconsistencies across replicas in HomeObject. This is phase 1 of the
scrubber implementation.

- Implements deep and shallow scrubbing for PG metadata, shards, and blobs
- Supports periodic and manual scrub triggering modes
- Uses priority queue (MPMCPriorityQueue) for scrub task scheduling
- Persists scrub metadata using superblocks to track last scrub times
- Coordinates scrub operations across all replicas in a PG

1. **Deep Scrub**: Full data integrity verification
   - PG metadata validation
   - Shard existence and consistency checks
   - Blob hash verification (reads data and computes checksums)
   - Detects corrupted, missing, and inconsistent data across replicas

2. **Shallow Scrub**: Lightweight metadata-only verification
   - Shard existence checks
   - Blob index validation (no data reads)
   - Faster execution for routine checks

- FlatBuffer-based serialization for scrub requests and responses
- Leader sends scrub requests to all replicas
- Followers return scrub maps with their local state
- Retry logic with configurable timeouts for reliability

- **ShallowScrubReport**: Tracks missing shards and blobs per peer
- **DeepScrubReport**: Extends shallow report with:
  - Corrupted blobs/shards with error details
  - Inconsistent blobs (different hashes across replicas)
  - Corrupted PG metadata

- Scrubs data in configurable ranges to avoid timeouts
- Shard range: 2M shards per request
- Blob range: Based on HDD IOPS for deep scrub, 2M for shallow
- Early cancellation support for graceful shutdown

1. **DeepScrubTest**: Verifies detection of:
   - Missing blobs on followers
   - Missing shards on followers
   - Corrupted blob data (IO errors)
   - Inconsistent blob hashes across replicas

2. **MPMCPriorityQueue Tests**: Lock-free queue validation
   - Concurrent push/pop operations
   - Priority ordering verification
   - Thread safety under contention
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants