feat: add verify_integrity() for full-file checksum verification#275
feat: add verify_integrity() for full-file checksum verification#275polaz wants to merge 2 commits intofjall-rs:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new public Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Verify as verify_integrity()
participant Manifest as Tree/Manifest
participant Files as SST/Blob Files
participant Stream as stream_checksum()
User->>Verify: verify_integrity(tree)
Verify->>Manifest: Read manifest metadata / enumerate files
loop For each SST table
Verify->>Stream: stream_checksum(sst_path)
Stream->>Files: Read file in chunks
Stream-->>Verify: Computed checksum
Verify->>Manifest: Compare checksum to expected
alt Mismatch
Verify-->>Verify: Record SstFileCorrupted
end
end
loop For each blob file
Verify->>Stream: stream_checksum(blob_path)
Stream->>Files: Read file in chunks
Stream-->>Verify: Computed checksum
Verify->>Manifest: Compare checksum to expected
alt Mismatch
Verify-->>Verify: Record BlobFileCorrupted
end
end
Verify-->>User: IntegrityReport (sst_files_checked, blob_files_checked, errors)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces an integrity-verification capability for lsm-tree that scans all on-disk SST and blob files, recomputes their full-file XXH3 128-bit checksums in a streaming fashion, and reports any mismatches or I/O failures back to the caller.
Changes:
- Added
src/verify.rswithverify::verify_integrity(&impl AbstractTree) -> IntegrityReport, plusIntegrityError/IntegrityReporttypes and streaming checksum calculation. - Exported the new
verifymodule fromsrc/lib.rs. - Added integration tests covering clean trees, SST/blob corruption detection, missing-file I/O errors, and
Display/Error::sourcebehavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/verify.rs |
New integrity verification module: streaming checksum recomputation + structured reporting. |
src/lib.rs |
Publicly exports the new verify module. |
tests/verify_integrity.rs |
New integration test suite for integrity verification scenarios and error formatting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/config/mod.rs (1)
304-315: 🧹 Nitpick | 🔵 TrivialConsider using
Fromtrait for consistency.
Default::default()usesSharedSequenceNumberGenerator::from()explicitly, butnew()usesArc::new()relying on unsized coercion. Both work correctly, but using theFromtrait consistently improves clarity.♻️ Suggested change for consistency
pub fn new<P: AsRef<Path>>( path: P, seqno: SequenceNumberCounter, visible_seqno: SequenceNumberCounter, ) -> Self { Self { path: absolute_path(path.as_ref()), - seqno: Arc::new(seqno), - visible_seqno: Arc::new(visible_seqno), + seqno: SharedSequenceNumberGenerator::from(seqno), + visible_seqno: SharedSequenceNumberGenerator::from(visible_seqno), ..Default::default() } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config/mod.rs` around lines 304 - 315, Replace the direct Arc::new(...) construction in the Config::new constructor with the From implementation for consistency with Default::default(); specifically, change the fields seqno: Arc::new(seqno) and visible_seqno: Arc::new(visible_seqno) to use SharedSequenceNumberGenerator::from(seqno) and SharedSequenceNumberGenerator::from(visible_seqno) (or the appropriate SequenceNumberCounter::from(...) if that type implements From) so both new() and Default use the From trait consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/copilot-instructions.md:
- Around line 36-41: The fenced code block containing the template
"<type>(scope): <description>" is missing a language tag which trips
markdownlint MD040; update that block to include a language identifier (e.g.,
add ```text before "<type>(scope): <description>" and close with ``` after the
list) so the block becomes a labeled fenced code block and satisfies the linter.
In @.github/workflows/coordinode-ci.yml:
- Around line 37-54: Replace the explicit pinned ref for the rust toolchain
action to a generic branch ref and keep the matrix-driven toolchain parameter:
locate the uses entry referencing dtolnay/rust-toolchain@stable and change the
ref (for example to dtolnay/rust-toolchain@main or `@master`) while retaining the
with: toolchain: ${{ matrix.rust_version }} block so the selected toolchain
still comes from the matrix.
In @.github/workflows/upstream-monitor.yml:
- Around line 37-41: The conditional in the step "Try merge and create PR or
issue" uses steps.check.outputs.behind > 0 which does string comparison in
GitHub Actions; change it to perform an explicit numeric comparison by parsing
the output to a number (e.g., using fromJSON or converting to an int) before
comparing so that steps.check.outputs.behind is compared numerically (refer to
the step name "Try merge and create PR or issue" and the output key
steps.check.outputs.behind).
- Around line 53-67: The heredoc uses a quoted delimiter <<'EOF' (which prevents
shell expansion) but also references $BEHIND, causing inconsistency; update the
gh pr create block to use the GitHub Actions expression consistently by
replacing the $BEHIND shell variable with the pre-evaluated expression ${{
steps.check.outputs.behind }} (keep the <<'EOF' heredoc) so the commits-behind
value is reliably expanded, and ensure the gh pr create command and its heredoc
remain intact.
In `@src/seqno.rs`:
- Around line 245-250: Add a short doc-comment to clarify that SeqNo::fetch_max
silently clamps inputs to MAX_SEQNO (to avoid reserved MSB range) while
SeqNo::set will panic on out-of-range values; update the comments above the
fetch_max method (and optionally above set) to explicitly state this behavioral
difference so callers know fetch_max tolerates and clamps recovery/overflow
values whereas set enforces range and panics.
---
Outside diff comments:
In `@src/config/mod.rs`:
- Around line 304-315: Replace the direct Arc::new(...) construction in the
Config::new constructor with the From implementation for consistency with
Default::default(); specifically, change the fields seqno: Arc::new(seqno) and
visible_seqno: Arc::new(visible_seqno) to use
SharedSequenceNumberGenerator::from(seqno) and
SharedSequenceNumberGenerator::from(visible_seqno) (or the appropriate
SequenceNumberCounter::from(...) if that type implements From) so both new() and
Default use the From trait consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ba34b925-bbc0-41a0-963a-c93f61f38b09
⛔ Files ignored due to path filters (1)
assets/usdt-qr.svgis excluded by!**/*.svg
📒 Files selected for processing (43)
.github/copilot-instructions.md.github/dependabot.yml.github/instructions/code-review.instructions.md.github/workflows/coordinode-ci.yml.github/workflows/coordinode-release.yml.github/workflows/upstream-monitor.yml.release-plz.tomlCargo.tomlREADME.mdclippy.tomlsrc/abstract_tree.rssrc/blob_tree/mod.rssrc/compaction/leveled/mod.rssrc/compaction/leveled/test.rssrc/compaction/worker.rssrc/compression.rssrc/config/mod.rssrc/error.rssrc/lib.rssrc/manifest.rssrc/seqno.rssrc/slice/slice_bytes/mod.rssrc/table/block/mod.rssrc/table/data_block/iter.rssrc/table/data_block/iter_test.rssrc/table/data_block/mod.rssrc/table/filter/mod.rssrc/table/iter.rssrc/table/mod.rssrc/table/util.rssrc/tree/mod.rssrc/verify.rssrc/version/mod.rssrc/version/run.rssrc/version/super_version.rssrc/vlog/blob_file/reader.rssrc/vlog/blob_file/writer.rssrc/vlog/mod.rstests/custom_seqno_generator.rstests/ingestion_seqno.rstests/multi_get.rstests/tree_contains_prefix.rstests/verify_integrity.rs
Add a public verify module with verify_integrity() that streams full-file xxh3 checksums over all segment and blob files in a tree, comparing them against the checksums stored in the version manifest. This enables detection of silent bit-rot, partial writes, and other on-disk corruption without reading individual blocks. Returns IntegrityReport with per-file pass/fail results and detailed IntegrityError variants. Implements std::error::Error for ergonomic error handling. Closes #187
36644eb to
baf6975
Compare
|
Branch cleaned up. The previous state of this branch accidentally included a merge of the fork's What was done:
|
There was a problem hiding this comment.
Pull request overview
Adds a new public integrity-verification API to the lsm_tree crate that scans all on-disk SST and blob files in a tree and validates their full-file XXH3-128 checksums against the manifest, returning a structured report of any corruption or I/O failures.
Changes:
- Introduce
verify::verify_integrity(&impl AbstractTree) -> IntegrityReportplusIntegrityError/IntegrityReporttypes. - Implement streaming full-file checksum computation to avoid loading entire files into memory.
- Add integration tests covering clean trees, SST/blob corruption detection, missing files, and
Display/Errorbehavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/verify.rs |
New verification module: streaming checksum + report/error types. |
src/lib.rs |
Exposes the new verify module as part of the public API. |
tests/verify_integrity.rs |
Integration tests validating correctness across corruption and missing-file scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Replace manual read loop + buffer with BufReader::fill_buf() - Eliminates per-file 64 KiB allocation (BufReader owns the buffer) - Removes silent no-op on theoretical get(..n) failure — fill_buf() returns slices directly, errors propagate via ?
There was a problem hiding this comment.
Pull request overview
Adds a new public verify module that can scan an AbstractTree’s on-disk SST + blob files, recompute full-file XXH3-128 checksums in a streaming manner, and return a structured IntegrityReport with per-file failures.
Changes:
- Introduces
verify::verify_integrity(&impl AbstractTree) -> IntegrityReportplus supportingIntegrityError/IntegrityReporttypes. - Implements streaming full-file checksum computation (
stream_checksum) to avoid loading entire files into memory. - Adds integration tests covering clean trees, corruption detection, missing files, and
Display/Error::sourcebehavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/verify.rs |
New integrity verification API (report + error types + streaming checksum + verification logic). |
src/lib.rs |
Exposes the new verify module from the crate root. |
tests/verify_integrity.rs |
New integration test suite for integrity verification across SST + blob scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| table_id: TableId, | ||
| /// Path to the corrupted file | ||
| path: PathBuf, | ||
| /// Checksum stored in the manifest | ||
| expected: Checksum, | ||
| /// Checksum computed from disk | ||
| got: Checksum, | ||
| }, | ||
|
|
||
| /// Full-file checksum mismatch for a blob file. | ||
| BlobFileCorrupted { | ||
| /// Blob file ID | ||
| blob_file_id: u64, | ||
| /// Path to the corrupted file | ||
| path: PathBuf, | ||
| /// Checksum stored in the manifest | ||
| expected: Checksum, | ||
| /// Checksum computed from disk | ||
| got: Checksum, | ||
| }, | ||
|
|
||
| /// I/O error while reading a file during verification. | ||
| IoError { | ||
| /// Path to the file that could not be read | ||
| path: PathBuf, | ||
| /// The underlying I/O error | ||
| error: std::io::Error, |
| BlobFileCorrupted { | ||
| /// Blob file ID | ||
| blob_file_id: u64, | ||
| /// Path to the corrupted file | ||
| path: PathBuf, | ||
| /// Checksum stored in the manifest | ||
| expected: Checksum, | ||
| /// Checksum computed from disk | ||
| got: Checksum, | ||
| }, |
| error: std::io::Error, | ||
| }, | ||
| } | ||
|
|
| } | ||
|
|
||
| // Verify all blob files | ||
| for blob_file in version.blob_files.iter() { |
Summary
verify::verify_integrity()— a module-level function that accepts any&impl AbstractTreeand streams full-file xxh3 checksums over all segments and blob filesIntegrityReportwith per-file pass/fail results and detailed error variantsstd::error::ErrorforIntegrityErrorfor ergonomic error handlingTest plan
verify_integritycovering intact tree, corrupted segment, and corrupted blob file scenarioscargo test --all-featuresgreenCloses #187
Supersedes #259 (clean rebased branch — sorry about the mess in that one).
Summary by CodeRabbit
New Features
Tests