feat: add verify_integrity() for full-file checksum verification by polaz · Pull Request #275 · fjall-rs/lsm-tree

polaz · 2026-03-16T13:45:43Z

Summary

Add verify::verify_integrity() — a module-level function that accepts any &impl AbstractTree and streams full-file xxh3 checksums over all segments and blob files
Returns IntegrityReport with per-file pass/fail results and detailed error variants
Streaming verification — reads files in chunks without loading entire files into memory
Implements std::error::Error for IntegrityError for ergonomic error handling

Test plan

New integration test verify_integrity covering intact tree, corrupted segment, and corrupted blob file scenarios
All existing tests pass
Full cargo test --all-features green

Closes #187

Supersedes #259 (clean rebased branch — sorry about the mess in that one).

Summary by CodeRabbit

New Features
- Added integrity verification to check health of SST and blob files and produce detailed reports with per-file counts and aggregated errors.
Tests
- Added a comprehensive test suite covering corruption detection, missing files, multi-file scenarios, error reporting, and display/source behavior.

coderabbitai · 2026-03-16T13:45:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c0b7da72-aa21-46b3-9020-beba2420d0ff

📥 Commits

Reviewing files that changed from the base of the PR and between baf6975 and 8e623a9.

📒 Files selected for processing (1)

src/verify.rs

📝 Walkthrough

Walkthrough

Adds a new public verify module that performs streaming XXH3-128 integrity checks over SST and blob files, compares results to manifest expectations, and returns an aggregated report listing per-file counts and any corruption or IO errors.

Changes

Cohort / File(s)	Summary
Module Export `src/lib.rs`	Exported new public `verify` module (doc-comment added).
Verification Implementation `src/verify.rs`	New integrity verification module: `IntegrityError` enum (`SstFileCorrupted`, `BlobFileCorrupted`, `IoError`), `IntegrityReport` struct, `verify_integrity(tree: &impl crate::AbstractTree) -> IntegrityReport`, and `stream_checksum()` helper implementing streaming XXH3-128 checksums and full-file scan with error aggregation.
Integration Tests `tests/verify_integrity.rs`	New comprehensive integration tests covering clean checks, SST/blob corruption detection, missing-file IO errors, multiple-file handling, and Display/Error trait behavior validations.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Verify as verify_integrity()
    participant Manifest as Tree/Manifest
    participant Files as SST/Blob Files
    participant Stream as stream_checksum()

    User->>Verify: verify_integrity(tree)
    Verify->>Manifest: Read manifest metadata / enumerate files
    loop For each SST table
        Verify->>Stream: stream_checksum(sst_path)
        Stream->>Files: Read file in chunks
        Stream-->>Verify: Computed checksum
        Verify->>Manifest: Compare checksum to expected
        alt Mismatch
            Verify-->>Verify: Record SstFileCorrupted
        end
    end
    loop For each blob file
        Verify->>Stream: stream_checksum(blob_path)
        Stream->>Files: Read file in chunks
        Stream-->>Verify: Computed checksum
        Verify->>Manifest: Compare checksum to expected
        alt Mismatch
            Verify-->>Verify: Record BlobFileCorrupted
        end
    end
    Verify-->>User: IntegrityReport (sst_files_checked, blob_files_checked, errors)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a verify_integrity() function for full-file checksum verification, which is the primary focus of the changeset.
Linked Issues check	✅ Passed	The PR fully implements issue `#187` requirements: iterates all tables and blob files, computes XXH3 checksums, compares against manifest checksums, and reports detailed per-file verification results with corruption detection.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to integrity verification: public verify module, IntegrityError/IntegrityReport types, verify_integrity() function, and comprehensive integration tests; no unrelated modifications present.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR introduces an integrity-verification capability for lsm-tree that scans all on-disk SST and blob files, recomputes their full-file XXH3 128-bit checksums in a streaming fashion, and reports any mismatches or I/O failures back to the caller.

Changes:

Added src/verify.rs with verify::verify_integrity(&impl AbstractTree) -> IntegrityReport, plus IntegrityError/IntegrityReport types and streaming checksum calculation.
Exported the new verify module from src/lib.rs.
Added integration tests covering clean trees, SST/blob corruption detection, missing-file I/O errors, and Display/Error::source behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/verify.rs`	New integrity verification module: streaming checksum recomputation + structured reporting.
`src/lib.rs`	Publicly exports the new `verify` module.
`tests/verify_integrity.rs`	New integration test suite for integrity verification scenarios and error formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/config/mod.rs (1)

304-315: 🧹 Nitpick | 🔵 Trivial

Consider using From trait for consistency.

Default::default() uses SharedSequenceNumberGenerator::from() explicitly, but new() uses Arc::new() relying on unsized coercion. Both work correctly, but using the From trait consistently improves clarity.

♻️ Suggested change for consistency

     pub fn new<P: AsRef<Path>>(
         path: P,
         seqno: SequenceNumberCounter,
         visible_seqno: SequenceNumberCounter,
     ) -> Self {
         Self {
             path: absolute_path(path.as_ref()),
-            seqno: Arc::new(seqno),
-            visible_seqno: Arc::new(visible_seqno),
+            seqno: SharedSequenceNumberGenerator::from(seqno),
+            visible_seqno: SharedSequenceNumberGenerator::from(visible_seqno),
             ..Default::default()
         }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/config/mod.rs` around lines 304 - 315, Replace the direct Arc::new(...)
construction in the Config::new constructor with the From implementation for
consistency with Default::default(); specifically, change the fields seqno:
Arc::new(seqno) and visible_seqno: Arc::new(visible_seqno) to use
SharedSequenceNumberGenerator::from(seqno) and
SharedSequenceNumberGenerator::from(visible_seqno) (or the appropriate
SequenceNumberCounter::from(...) if that type implements From) so both new() and
Default use the From trait consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/copilot-instructions.md:
- Around line 36-41: The fenced code block containing the template
"<type>(scope): <description>" is missing a language tag which trips
markdownlint MD040; update that block to include a language identifier (e.g.,
add ```text before "<type>(scope): <description>" and close with ``` after the
list) so the block becomes a labeled fenced code block and satisfies the linter.

In @.github/workflows/coordinode-ci.yml:
- Around line 37-54: Replace the explicit pinned ref for the rust toolchain
action to a generic branch ref and keep the matrix-driven toolchain parameter:
locate the uses entry referencing dtolnay/rust-toolchain@stable and change the
ref (for example to dtolnay/rust-toolchain@main or `@master`) while retaining the
with: toolchain: ${{ matrix.rust_version }} block so the selected toolchain
still comes from the matrix.

In @.github/workflows/upstream-monitor.yml:
- Around line 37-41: The conditional in the step "Try merge and create PR or
issue" uses steps.check.outputs.behind > 0 which does string comparison in
GitHub Actions; change it to perform an explicit numeric comparison by parsing
the output to a number (e.g., using fromJSON or converting to an int) before
comparing so that steps.check.outputs.behind is compared numerically (refer to
the step name "Try merge and create PR or issue" and the output key
steps.check.outputs.behind).
- Around line 53-67: The heredoc uses a quoted delimiter <<'EOF' (which prevents
shell expansion) but also references $BEHIND, causing inconsistency; update the
gh pr create block to use the GitHub Actions expression consistently by
replacing the $BEHIND shell variable with the pre-evaluated expression ${{
steps.check.outputs.behind }} (keep the <<'EOF' heredoc) so the commits-behind
value is reliably expanded, and ensure the gh pr create command and its heredoc
remain intact.

In `@src/seqno.rs`:
- Around line 245-250: Add a short doc-comment to clarify that SeqNo::fetch_max
silently clamps inputs to MAX_SEQNO (to avoid reserved MSB range) while
SeqNo::set will panic on out-of-range values; update the comments above the
fetch_max method (and optionally above set) to explicitly state this behavioral
difference so callers know fetch_max tolerates and clamps recovery/overflow
values whereas set enforces range and panics.

---

Outside diff comments:
In `@src/config/mod.rs`:
- Around line 304-315: Replace the direct Arc::new(...) construction in the
Config::new constructor with the From implementation for consistency with
Default::default(); specifically, change the fields seqno: Arc::new(seqno) and
visible_seqno: Arc::new(visible_seqno) to use
SharedSequenceNumberGenerator::from(seqno) and
SharedSequenceNumberGenerator::from(visible_seqno) (or the appropriate
SequenceNumberCounter::from(...) if that type implements From) so both new() and
Default use the From trait consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ba34b925-bbc0-41a0-963a-c93f61f38b09

📥 Commits

Reviewing files that changed from the base of the PR and between aae89a0 and 36644eb.

⛔ Files ignored due to path filters (1)

assets/usdt-qr.svg is excluded by !**/*.svg

📒 Files selected for processing (43)

.github/copilot-instructions.md
.github/dependabot.yml
.github/instructions/code-review.instructions.md
.github/workflows/coordinode-ci.yml
.github/workflows/coordinode-release.yml
.github/workflows/upstream-monitor.yml
.release-plz.toml
Cargo.toml
README.md
clippy.toml
src/abstract_tree.rs
src/blob_tree/mod.rs
src/compaction/leveled/mod.rs
src/compaction/leveled/test.rs
src/compaction/worker.rs
src/compression.rs
src/config/mod.rs
src/error.rs
src/lib.rs
src/manifest.rs
src/seqno.rs
src/slice/slice_bytes/mod.rs
src/table/block/mod.rs
src/table/data_block/iter.rs
src/table/data_block/iter_test.rs
src/table/data_block/mod.rs
src/table/filter/mod.rs
src/table/iter.rs
src/table/mod.rs
src/table/util.rs
src/tree/mod.rs
src/verify.rs
src/version/mod.rs
src/version/run.rs
src/version/super_version.rs
src/vlog/blob_file/reader.rs
src/vlog/blob_file/writer.rs
src/vlog/mod.rs
tests/custom_seqno_generator.rs
tests/ingestion_seqno.rs
tests/multi_get.rs
tests/tree_contains_prefix.rs
tests/verify_integrity.rs

Add a public verify module with verify_integrity() that streams full-file xxh3 checksums over all segment and blob files in a tree, comparing them against the checksums stored in the version manifest. This enables detection of silent bit-rot, partial writes, and other on-disk corruption without reading individual blocks. Returns IntegrityReport with per-file pass/fail results and detailed IntegrityError variants. Implements std::error::Error for ergonomic error handling. Closes #187

polaz · 2026-03-16T15:26:17Z

Branch cleaned up. The previous state of this branch accidentally included a merge of the fork's origin/main instead of upstream/main, which pulled in the entire fork history (dependabot configs, copilot instructions, etc.).

What was done:

Created a fresh branch from upstream/main
Cherry-picked only the original clean commit (20568354)
Verified git diff upstream/main shows only the 3 expected files: src/lib.rs, src/verify.rs, tests/verify_integrity.rs
Verified no fork-specific code leaked (no zstd, coordinode, dependabot, or upstream-monitor references)
All tests pass (cargo test --all-features — 0 failures)
Force-pushed to replace the corrupted branch history

Copilot

Pull request overview

Adds a new public integrity-verification API to the lsm_tree crate that scans all on-disk SST and blob files in a tree and validates their full-file XXH3-128 checksums against the manifest, returning a structured report of any corruption or I/O failures.

Changes:

Introduce verify::verify_integrity(&impl AbstractTree) -> IntegrityReport plus IntegrityError/IntegrityReport types.
Implement streaming full-file checksum computation to avoid loading entire files into memory.
Add integration tests covering clean trees, SST/blob corruption detection, missing files, and Display/Error behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/verify.rs`	New verification module: streaming checksum + report/error types.
`src/lib.rs`	Exposes the new `verify` module as part of the public API.
`tests/verify_integrity.rs`	Integration tests validating correctness across corruption and missing-file scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Replace manual read loop + buffer with BufReader::fill_buf() - Eliminates per-file 64 KiB allocation (BufReader owns the buffer) - Removes silent no-op on theoretical get(..n) failure — fill_buf() returns slices directly, errors propagate via ?

Copilot

Pull request overview

Adds a new public verify module that can scan an AbstractTree’s on-disk SST + blob files, recompute full-file XXH3-128 checksums in a streaming manner, and return a structured IntegrityReport with per-file failures.

Changes:

Introduces verify::verify_integrity(&impl AbstractTree) -> IntegrityReport plus supporting IntegrityError / IntegrityReport types.
Implements streaming full-file checksum computation (stream_checksum) to avoid loading entire files into memory.
Adds integration tests covering clean trees, corruption detection, missing files, and Display/Error::source behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`src/verify.rs`	New integrity verification API (report + error types + streaming checksum + verification logic).
`src/lib.rs`	Exposes the new `verify` module from the crate root.
`tests/verify_integrity.rs`	New integration test suite for integrity verification across SST + blob scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

+        table_id: TableId,
+        /// Path to the corrupted file
+        path: PathBuf,
+        /// Checksum stored in the manifest
+        expected: Checksum,
+        /// Checksum computed from disk
+        got: Checksum,
+    },
+
+    /// Full-file checksum mismatch for a blob file.
+    BlobFileCorrupted {
+        /// Blob file ID
+        blob_file_id: u64,
+        /// Path to the corrupted file
+        path: PathBuf,
+        /// Checksum stored in the manifest
+        expected: Checksum,
+        /// Checksum computed from disk
+        got: Checksum,
+    },
+
+    /// I/O error while reading a file during verification.
+    IoError {
+        /// Path to the file that could not be read
+        path: PathBuf,
+        /// The underlying I/O error
+        error: std::io::Error,


+    BlobFileCorrupted {
+        /// Blob file ID
+        blob_file_id: u64,
+        /// Path to the corrupted file
+        path: PathBuf,
+        /// Checksum stored in the manifest
+        expected: Checksum,
+        /// Checksum computed from disk
+        got: Checksum,
+    },


+        error: std::io::Error,
+    },
+}
+


+    }
+
+    // Verify all blob files
+    for blob_file in version.blob_files.iter() {


Copilot AI review requested due to automatic review settings March 16, 2026 13:45

polaz mentioned this pull request Mar 16, 2026

feat: add verify_integrity() for full-file checksum verification #259

Closed

5 tasks

Copilot started reviewing on behalf of polaz March 16, 2026 13:46 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Comment thread src/lib.rs

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

Comment thread .github/copilot-instructions.md Outdated

Comment thread .github/workflows/coordinode-ci.yml Outdated

Comment thread .github/workflows/upstream-monitor.yml Outdated

Comment thread .github/workflows/upstream-monitor.yml Outdated

Comment thread src/seqno.rs Outdated

polaz force-pushed the feat/verify-integrity-clean branch from 36644eb to baf6975 Compare March 16, 2026 15:25

polaz requested a review from Copilot March 16, 2026 21:21

Copilot started reviewing on behalf of polaz March 16, 2026 21:22 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Comment thread src/verify.rs

Comment thread src/verify.rs Outdated

polaz requested a review from Copilot March 18, 2026 01:38

Copilot started reviewing on behalf of polaz March 18, 2026 01:38 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add verify_integrity() for full-file checksum verification#275

feat: add verify_integrity() for full-file checksum verification#275
polaz wants to merge 2 commits intofjall-rs:mainfrom
structured-world:feat/verify-integrity-clean

polaz commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaz commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

polaz commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

polaz commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

polaz commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading