feat: add verify_integrity() for full-file checksum verification#259
feat: add verify_integrity() for full-file checksum verification#259polaz wants to merge 116 commits intofjall-rs:mainfrom
Conversation
Adds a public verify module with verify_integrity() that scans all SST and blob files in a tree, computing XXHash-3 128-bit checksums and comparing them against the checksums stored in the version manifest. This enables detection of silent bit-rot, partial writes, and other on-disk corruption without reading individual blocks. The function returns an IntegrityReport with: - sst_files_checked / blob_files_checked counters - Per-file errors (SstFileCorrupted, BlobFileCorrupted, IoError) - is_ok() / files_checked() convenience methods Tests cover: - Clean tree verification (standard + blob tree) - SST corruption detection - Blob file corruption detection - Multiple SST files Closes fjall-rs#187
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new public verification module that streams SST and blob files, computes XXH3-128 checksums, compares them to manifest values, and returns an IntegrityReport with counts and collected IntegrityError entries. Includes tests for clean and corrupted SST/blob scenarios and missing files. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Suite
participant Verify as verify::verify_integrity
participant Tree as AbstractTree (manifest)
participant FS as FileSystem
participant XXH as XXH3-128
Test->>Verify: call verify_integrity(tree)
Verify->>Tree: list tables & blob files
loop for each file
Verify->>FS: open & stream file bytes
FS-->>Verify: bytes / read error
Verify->>XXH: feed bytes (streaming)
XXH-->>Verify: checksum
Verify->>Tree: read expected checksum
alt checksums match
Verify-->>Verify: increment checked count
else mismatch or IoError
Verify-->>Verify: record IntegrityError
end
end
Verify-->>Test: return IntegrityReport (counts + errors)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/verify.rs`:
- Around line 121-175: Extract the duplicated read/hash/compare/error logic in
the two loops into a single helper function (e.g., verify_file(path: &Path,
expected: Checksum, on_corrupt: impl FnOnce(Checksum)->IntegrityError,
report_counter: &mut usize, report: &mut Report) ) that performs std::fs::read,
computes Checksum::from_raw(crate::hash::hash128(&data)), compares to expected
and pushes either IntegrityError::SstFileCorrupted or
IntegrityError::BlobFileCorrupted (use the provided on_corrupt closure to create
the correct variant) or IntegrityError::IoError (with path.to_path_buf() and
e.to_string()), then increments the appropriate counter; replace the SST loop
(iter_tables(), table.path, table.checksum(), table.id(),
report.sst_files_checked) and the blob loop (version.blob_files,
blob_file.path(), blob_file.checksum(), blob_file.id(),
report.blob_files_checked) with calls to this helper so behavior and error
reporting is shared.
- Around line 108-111: Update the Rustdoc for verify_integrity to reflect that
it returns an IntegrityReport unconditionally rather than Result/Err; change the
`# Errors` section wording to explain that fatal I/O errors are recorded inside
the returned IntegrityReport (e.g., in IntegrityReport::errors) and that the
function does not return Err, referencing the function name `verify_integrity`
and the type `IntegrityReport` so readers can locate the code.
- Around line 126-128: The code currently uses std::fs::read(path) to read
entire file bytes then calls Checksum::from_raw(crate::hash::hash128(&data)),
which will OOM for large SST/blob files; change to stream the file by opening
with std::fs::File::open(path) and reading in a fixed-size buffer (e.g.,
8KB/64KB) while feeding each chunk into the incremental hashing API instead of
hash128(&data), then finalize the hasher and call Checksum::from_raw on the
resulting digest; apply the same streaming replacement wherever std::fs::read is
used for file verification (including the similar block around lines 154–156).
In `@tests/verify_integrity.rs`:
- Around line 10-16: Extract the repeated Config::new(...).open()? pattern into
a small helper function (e.g., fn open_test_tree(folder: &Path) ->
Result<YourTreeType, Error>) and replace each occurrence in tests (the blocks
using Config::new(...).open()? around SequenceNumberCounter::default()) with a
call to that helper; ensure the helper accepts the folder argument and returns
the same result type so tests remain unchanged otherwise, and update use/imports
if necessary to make the helper visible to all tests in the file.
- Around line 31-144: Add a test that deletes or makes unreachable a referenced
SST/blob file before calling verify::verify_integrity to exercise
IntegrityError::IoError: after creating and flushing the tree (reuse pattern in
verify_integrity_detect_sst_corruption and
verify_integrity_detect_blob_corruption), get the target using
version.iter_tables().next().unwrap() for SST or
version.blob_files.iter().next().unwrap() for blob, call
std::fs::remove_file(table.path()) (or blob_file.path()) to simulate an IO
failure, then call verify::verify_integrity(&tree) and assert the report is not
Ok and that report.errors contains an entry matching
verify::IntegrityError::IoError with the expected table_id/blob_file_id (use
table.id() / blob_file.id() for comparison).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1aa71f4f-c41d-43cd-bcb7-be6d611863f7
📒 Files selected for processing (3)
src/lib.rssrc/verify.rstests/verify_integrity.rs
There was a problem hiding this comment.
Pull request overview
This PR introduces an integrity verification facility to the lsm_tree crate, allowing callers to scan SST and blob files and detect on-disk corruption by comparing full-file checksums against the version manifest.
Changes:
- Added a new
verifymodule withverify_integrityplusIntegrityReport/IntegrityErrortypes. - Implemented checksum verification for both SST table files and blob (value log) files.
- Added integration tests covering clean trees, SST corruption detection, blob corruption detection, and multi-table scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
src/verify.rs |
New public API for integrity scanning and reporting checksum mismatches / I/O issues. |
src/lib.rs |
Exposes the new verify module from the crate root. |
tests/verify_integrity.rs |
New integration tests validating integrity checks for standard and blob trees. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Compute XXH3 via BufReader instead of loading entire files into memory (prevents OOM on large SST/blob files) - Store std::io::Error directly in IoError variant instead of String - Use BlobFileId type alias instead of raw u64 - Add #[non_exhaustive] to IntegrityError for forward compatibility - Fix rustdoc: remove incorrect # Errors section - Add test for missing file (IoError path)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/verify.rs`:
- Around line 8-43: Add an implementation of std::error::Error for the
IntegrityError enum: implement impl std::error::Error for IntegrityError { fn
source(&self) -> Option<&(dyn std::error::Error + 'static)> { match self {
IntegrityError::IoError { error, .. } => Some(error), _ => None, } } } so
callers can inspect the underlying I/O error via source(); keep existing
Debug/Display impls and ensure the IoError variant returns the inner
std::io::Error as the source while other variants return None.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9aa732f6-c06f-4b07-861b-3a87f6467fdd
📒 Files selected for processing (2)
src/verify.rstests/verify_integrity.rs
There was a problem hiding this comment.
Pull request overview
Adds a public integrity verification API to the lsm_tree crate that scans all SST and blob files and reports full-file checksum mismatches and I/O errors.
Changes:
- Introduces
src/verify.rswithverify_integrity()plusIntegrityReport/IntegrityErrortypes. - Exposes the new
verifymodule fromsrc/lib.rs. - Adds integration tests covering clean trees, SST/blob corruption, multiple SSTs, and missing-file behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/verify_integrity.rs | New integration tests for integrity verification behavior across SST and blob modes. |
| src/verify.rs | New verification implementation that streams files from disk and compares XXH3-128 checksums to manifest values. |
| src/lib.rs | Exposes pub mod verify; from the crate root. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
BlobFileId is defined in the private vlog module and not re-exported, so using it in the public IntegrityError enum would leak a private type to external consumers.
There was a problem hiding this comment.
Pull request overview
This PR introduces a public integrity verification API for lsm_tree that scans on-disk SST and blob files and reports checksum mismatches and I/O errors, along with a new integration test suite for the verifier.
Changes:
- Added
verifymodule withverify_integrity,IntegrityReport, andIntegrityError. - Implemented streaming XXH3-128 file checksumming to avoid loading whole files into memory.
- Added integration tests covering clean trees, corrupted SST/blob files, multiple tables, and missing SST files.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/verify.rs |
New public verifier API and streaming checksum implementation. |
src/lib.rs |
Exposes the new verify module as part of the public crate API. |
tests/verify_integrity.rs |
New integration tests validating verifier behavior across success and failure cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Enables error chain integration (anyhow, thiserror, ? propagation). IoError variant returns the underlying std::io::Error via source().
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/verify.rs`:
- Line 139: The function verify_integrity currently returns an IntegrityReport
that can be silently ignored; add the #[must_use] attribute to the
verify_integrity function declaration (the pub fn verify_integrity(tree: &impl
crate::AbstractTree) -> IntegrityReport signature) so callers are warned if they
discard the returned IntegrityReport, preserving API ergonomics and preventing
accidental ignores.
There was a problem hiding this comment.
Pull request overview
This PR introduces an integrity verification API to scan on-disk SST and blob files, comparing full-file XXH3-128 checksums against the version manifest, along with integration tests that validate detection of corruption and missing files.
Changes:
- Add
verifymodule withverify_integrity()that streams file contents to compute/check full-file checksums. - Introduce
IntegrityError/IntegrityReporttypes to report per-file verification outcomes without aborting the scan. - Add integration tests covering clean trees, SST/blob corruption, multiple tables, and missing SST files.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/verify.rs |
New verification implementation + public report/error types for checksum validation across SST/blob files. |
src/lib.rs |
Exposes the new verify module as part of the public API. |
tests/verify_integrity.rs |
Adds integration tests exercising verification success and failure scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new public integrity verification facility for lsm-tree, allowing callers to scan SST and blob files and detect on-disk corruption or unreadable files, with accompanying integration tests to validate expected behavior.
Changes:
- Added
verifymodule withverify_integrityAPI that computes full-file checksums and returns a structuredIntegrityReport. - Introduced
IntegrityErrorvariants for SST corruption, blob corruption, and I/O read failures. - Added integration tests covering clean trees, induced corruption, missing files, and multi-table scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/verify_integrity.rs | New integration tests validating integrity verification across SST/blob scenarios. |
| src/verify.rs | Implements file streaming checksum verification and report/error types. |
| src/lib.rs | Exposes the new verify module publicly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Adds a new public verify module to perform full-file integrity verification of SST and blob files by recomputing and comparing stored checksums, along with integration tests that exercise clean/corrupt/missing-file scenarios.
Changes:
- Introduces
src/verify.rswithIntegrityReport,IntegrityError, andverify_integrity()to scan SST/blob files and collect per-file errors. - Exposes the new module via
pub mod verify;insrc/lib.rs. - Adds
tests/verify_integrity.rsintegration tests covering OK, corruption, and missing-file cases for both standard and blob trees.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/verify_integrity.rs | New integration tests for integrity verification across multiple scenarios. |
| src/verify.rs | New implementation for streaming checksum verification and structured error reporting. |
| src/lib.rs | Exposes the new verify module publicly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Switch stream_checksum to Xxh3Default matching ChecksummedWriter - Use path reference comparison instead of deref in test assertion - Remove redundant doc comment on enum field visibility
- Document that new(MAX_SEQNO) is allowed but exhausts the counter - Remove #[allow(clippy::missing_panics_doc)] from trait impl (lint does not fire on trait impl methods, and the reason text was stale)
…insprefix feat: add optimized contains_prefix() method
…a-block-seqno-aware-seek
…bergenerator-trait
- Rename tests to follow <what>_<condition>_<expected> convention - Replace magic hex literals with MAX_SEQNO constant in tests
- Remove BufReader wrapper — 64KB app buffer makes it redundant - Add comment explaining why .get(..n) is used over direct indexing
…ock-seqno-aware-seek perf: seqno-aware seek in data block point reads
…bergenerator-trait
- Add Display format tests for all three IntegrityError variants - Test Error::source returns None for corruption, Some for IoError - Add missing blob file test covering blob IoError path
- next() doc: explicitly state pre-increment (fetch-then-add) return - get() doc: clarify it returns value that next next() will return
…structured-world/lsm-tree into feat/#174-sequencenumbergenerator-trait
feat: add multi_get() for batch point reads
… feat/#187-verify-integrity
Retry on ErrorKind::Interrupted instead of propagating — signals can interrupt read() syscalls on Unix systems.
…diagnostics - Use #[should_panic(expected = "...")] instead of shorthand = form - Trait doc: clarify get() must not return reserved values, set/fetch_max must not store them (previously conflated get with mutation methods) - OffsetGenerator: include current value in panic message
…structured-world/lsm-tree into feat/#174-sequencenumbergenerator-trait
…generator-trait feat: add SequenceNumberGenerator trait
|
Sorry about the noise here — the branch accumulated unrelated changes from our fork. Opened a clean replacement rebased on upstream main: #275 |
Summary
Adds a
verifymodule withverify_integrity()that scans all SST and blob files in a tree, computing XXHash-3 128-bit checksums and comparing them against the checksums stored in the version manifest.This enables detection of silent bit-rot, partial writes, and other on-disk corruption.
API
IntegrityReport:sst_files_checked/blob_files_checked— counterserrors: Vec<IntegrityError>— per-file errorsis_ok()—trueif no errorsfiles_checked()— total filesIntegrityErrorvariants:SstFileCorrupted { table_id, path, expected, got }BlobFileCorrupted { blob_file_id, path, expected, got }IoError { path, error }Test plan
Closes #187
Summary by CodeRabbit
New Features
Tests