Skip to content

feat: add verify_integrity() for full-file checksum verification#275

Open
polaz wants to merge 2 commits intofjall-rs:mainfrom
structured-world:feat/verify-integrity-clean
Open

feat: add verify_integrity() for full-file checksum verification#275
polaz wants to merge 2 commits intofjall-rs:mainfrom
structured-world:feat/verify-integrity-clean

Conversation

@polaz
Copy link
Copy Markdown

@polaz polaz commented Mar 16, 2026

Summary

  • Add verify::verify_integrity() — a module-level function that accepts any &impl AbstractTree and streams full-file xxh3 checksums over all segments and blob files
  • Returns IntegrityReport with per-file pass/fail results and detailed error variants
  • Streaming verification — reads files in chunks without loading entire files into memory
  • Implements std::error::Error for IntegrityError for ergonomic error handling

Test plan

  • New integration test verify_integrity covering intact tree, corrupted segment, and corrupted blob file scenarios
  • All existing tests pass
  • Full cargo test --all-features green

Closes #187

Supersedes #259 (clean rebased branch — sorry about the mess in that one).

Summary by CodeRabbit

  • New Features

    • Added integrity verification to check health of SST and blob files and produce detailed reports with per-file counts and aggregated errors.
  • Tests

    • Added a comprehensive test suite covering corruption detection, missing files, multi-file scenarios, error reporting, and display/source behavior.

Copilot AI review requested due to automatic review settings March 16, 2026 13:45
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c0b7da72-aa21-46b3-9020-beba2420d0ff

📥 Commits

Reviewing files that changed from the base of the PR and between baf6975 and 8e623a9.

📒 Files selected for processing (1)
  • src/verify.rs

📝 Walkthrough

Walkthrough

Adds a new public verify module that performs streaming XXH3-128 integrity checks over SST and blob files, compares results to manifest expectations, and returns an aggregated report listing per-file counts and any corruption or IO errors.

Changes

Cohort / File(s) Summary
Module Export
src/lib.rs
Exported new public verify module (doc-comment added).
Verification Implementation
src/verify.rs
New integrity verification module: IntegrityError enum (SstFileCorrupted, BlobFileCorrupted, IoError), IntegrityReport struct, verify_integrity(tree: &impl crate::AbstractTree) -> IntegrityReport, and stream_checksum() helper implementing streaming XXH3-128 checksums and full-file scan with error aggregation.
Integration Tests
tests/verify_integrity.rs
New comprehensive integration tests covering clean checks, SST/blob corruption detection, missing-file IO errors, multiple-file handling, and Display/Error trait behavior validations.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Verify as verify_integrity()
    participant Manifest as Tree/Manifest
    participant Files as SST/Blob Files
    participant Stream as stream_checksum()

    User->>Verify: verify_integrity(tree)
    Verify->>Manifest: Read manifest metadata / enumerate files
    loop For each SST table
        Verify->>Stream: stream_checksum(sst_path)
        Stream->>Files: Read file in chunks
        Stream-->>Verify: Computed checksum
        Verify->>Manifest: Compare checksum to expected
        alt Mismatch
            Verify-->>Verify: Record SstFileCorrupted
        end
    end
    loop For each blob file
        Verify->>Stream: stream_checksum(blob_path)
        Stream->>Files: Read file in chunks
        Stream-->>Verify: Computed checksum
        Verify->>Manifest: Compare checksum to expected
        alt Mismatch
            Verify-->>Verify: Record BlobFileCorrupted
        end
    end
    Verify-->>User: IntegrityReport (sst_files_checked, blob_files_checked, errors)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a verify_integrity() function for full-file checksum verification, which is the primary focus of the changeset.
Linked Issues check ✅ Passed The PR fully implements issue #187 requirements: iterates all tables and blob files, computes XXH3 checksums, compares against manifest checksums, and reports detailed per-file verification results with corruption detection.
Out of Scope Changes check ✅ Passed All changes are directly scoped to integrity verification: public verify module, IntegrityError/IntegrityReport types, verify_integrity() function, and comprehensive integration tests; no unrelated modifications present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an integrity-verification capability for lsm-tree that scans all on-disk SST and blob files, recomputes their full-file XXH3 128-bit checksums in a streaming fashion, and reports any mismatches or I/O failures back to the caller.

Changes:

  • Added src/verify.rs with verify::verify_integrity(&impl AbstractTree) -> IntegrityReport, plus IntegrityError/IntegrityReport types and streaming checksum calculation.
  • Exported the new verify module from src/lib.rs.
  • Added integration tests covering clean trees, SST/blob corruption detection, missing-file I/O errors, and Display/Error::source behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/verify.rs New integrity verification module: streaming checksum recomputation + structured reporting.
src/lib.rs Publicly exports the new verify module.
tests/verify_integrity.rs New integration test suite for integrity verification scenarios and error formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/lib.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/config/mod.rs (1)

304-315: 🧹 Nitpick | 🔵 Trivial

Consider using From trait for consistency.

Default::default() uses SharedSequenceNumberGenerator::from() explicitly, but new() uses Arc::new() relying on unsized coercion. Both work correctly, but using the From trait consistently improves clarity.

♻️ Suggested change for consistency
     pub fn new<P: AsRef<Path>>(
         path: P,
         seqno: SequenceNumberCounter,
         visible_seqno: SequenceNumberCounter,
     ) -> Self {
         Self {
             path: absolute_path(path.as_ref()),
-            seqno: Arc::new(seqno),
-            visible_seqno: Arc::new(visible_seqno),
+            seqno: SharedSequenceNumberGenerator::from(seqno),
+            visible_seqno: SharedSequenceNumberGenerator::from(visible_seqno),
             ..Default::default()
         }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/mod.rs` around lines 304 - 315, Replace the direct Arc::new(...)
construction in the Config::new constructor with the From implementation for
consistency with Default::default(); specifically, change the fields seqno:
Arc::new(seqno) and visible_seqno: Arc::new(visible_seqno) to use
SharedSequenceNumberGenerator::from(seqno) and
SharedSequenceNumberGenerator::from(visible_seqno) (or the appropriate
SequenceNumberCounter::from(...) if that type implements From) so both new() and
Default use the From trait consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/copilot-instructions.md:
- Around line 36-41: The fenced code block containing the template
"<type>(scope): <description>" is missing a language tag which trips
markdownlint MD040; update that block to include a language identifier (e.g.,
add ```text before "<type>(scope): <description>" and close with ``` after the
list) so the block becomes a labeled fenced code block and satisfies the linter.

In @.github/workflows/coordinode-ci.yml:
- Around line 37-54: Replace the explicit pinned ref for the rust toolchain
action to a generic branch ref and keep the matrix-driven toolchain parameter:
locate the uses entry referencing dtolnay/rust-toolchain@stable and change the
ref (for example to dtolnay/rust-toolchain@main or `@master`) while retaining the
with: toolchain: ${{ matrix.rust_version }} block so the selected toolchain
still comes from the matrix.

In @.github/workflows/upstream-monitor.yml:
- Around line 37-41: The conditional in the step "Try merge and create PR or
issue" uses steps.check.outputs.behind > 0 which does string comparison in
GitHub Actions; change it to perform an explicit numeric comparison by parsing
the output to a number (e.g., using fromJSON or converting to an int) before
comparing so that steps.check.outputs.behind is compared numerically (refer to
the step name "Try merge and create PR or issue" and the output key
steps.check.outputs.behind).
- Around line 53-67: The heredoc uses a quoted delimiter <<'EOF' (which prevents
shell expansion) but also references $BEHIND, causing inconsistency; update the
gh pr create block to use the GitHub Actions expression consistently by
replacing the $BEHIND shell variable with the pre-evaluated expression ${{
steps.check.outputs.behind }} (keep the <<'EOF' heredoc) so the commits-behind
value is reliably expanded, and ensure the gh pr create command and its heredoc
remain intact.

In `@src/seqno.rs`:
- Around line 245-250: Add a short doc-comment to clarify that SeqNo::fetch_max
silently clamps inputs to MAX_SEQNO (to avoid reserved MSB range) while
SeqNo::set will panic on out-of-range values; update the comments above the
fetch_max method (and optionally above set) to explicitly state this behavioral
difference so callers know fetch_max tolerates and clamps recovery/overflow
values whereas set enforces range and panics.

---

Outside diff comments:
In `@src/config/mod.rs`:
- Around line 304-315: Replace the direct Arc::new(...) construction in the
Config::new constructor with the From implementation for consistency with
Default::default(); specifically, change the fields seqno: Arc::new(seqno) and
visible_seqno: Arc::new(visible_seqno) to use
SharedSequenceNumberGenerator::from(seqno) and
SharedSequenceNumberGenerator::from(visible_seqno) (or the appropriate
SequenceNumberCounter::from(...) if that type implements From) so both new() and
Default use the From trait consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ba34b925-bbc0-41a0-963a-c93f61f38b09

📥 Commits

Reviewing files that changed from the base of the PR and between aae89a0 and 36644eb.

⛔ Files ignored due to path filters (1)
  • assets/usdt-qr.svg is excluded by !**/*.svg
📒 Files selected for processing (43)
  • .github/copilot-instructions.md
  • .github/dependabot.yml
  • .github/instructions/code-review.instructions.md
  • .github/workflows/coordinode-ci.yml
  • .github/workflows/coordinode-release.yml
  • .github/workflows/upstream-monitor.yml
  • .release-plz.toml
  • Cargo.toml
  • README.md
  • clippy.toml
  • src/abstract_tree.rs
  • src/blob_tree/mod.rs
  • src/compaction/leveled/mod.rs
  • src/compaction/leveled/test.rs
  • src/compaction/worker.rs
  • src/compression.rs
  • src/config/mod.rs
  • src/error.rs
  • src/lib.rs
  • src/manifest.rs
  • src/seqno.rs
  • src/slice/slice_bytes/mod.rs
  • src/table/block/mod.rs
  • src/table/data_block/iter.rs
  • src/table/data_block/iter_test.rs
  • src/table/data_block/mod.rs
  • src/table/filter/mod.rs
  • src/table/iter.rs
  • src/table/mod.rs
  • src/table/util.rs
  • src/tree/mod.rs
  • src/verify.rs
  • src/version/mod.rs
  • src/version/run.rs
  • src/version/super_version.rs
  • src/vlog/blob_file/reader.rs
  • src/vlog/blob_file/writer.rs
  • src/vlog/mod.rs
  • tests/custom_seqno_generator.rs
  • tests/ingestion_seqno.rs
  • tests/multi_get.rs
  • tests/tree_contains_prefix.rs
  • tests/verify_integrity.rs

Comment thread .github/copilot-instructions.md Outdated
Comment thread .github/workflows/coordinode-ci.yml Outdated
Comment thread .github/workflows/upstream-monitor.yml Outdated
Comment thread .github/workflows/upstream-monitor.yml Outdated
Comment thread src/seqno.rs Outdated
Add a public verify module with verify_integrity() that streams full-file
xxh3 checksums over all segment and blob files in a tree, comparing them
against the checksums stored in the version manifest.

This enables detection of silent bit-rot, partial writes, and other
on-disk corruption without reading individual blocks.

Returns IntegrityReport with per-file pass/fail results and detailed
IntegrityError variants. Implements std::error::Error for ergonomic
error handling.

Closes #187
@polaz polaz force-pushed the feat/verify-integrity-clean branch from 36644eb to baf6975 Compare March 16, 2026 15:25
@polaz
Copy link
Copy Markdown
Author

polaz commented Mar 16, 2026

Branch cleaned up. The previous state of this branch accidentally included a merge of the fork's origin/main instead of upstream/main, which pulled in the entire fork history (dependabot configs, copilot instructions, etc.).

What was done:

  1. Created a fresh branch from upstream/main
  2. Cherry-picked only the original clean commit (20568354)
  3. Verified git diff upstream/main shows only the 3 expected files: src/lib.rs, src/verify.rs, tests/verify_integrity.rs
  4. Verified no fork-specific code leaked (no zstd, coordinode, dependabot, or upstream-monitor references)
  5. All tests pass (cargo test --all-features — 0 failures)
  6. Force-pushed to replace the corrupted branch history

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new public integrity-verification API to the lsm_tree crate that scans all on-disk SST and blob files in a tree and validates their full-file XXH3-128 checksums against the manifest, returning a structured report of any corruption or I/O failures.

Changes:

  • Introduce verify::verify_integrity(&impl AbstractTree) -> IntegrityReport plus IntegrityError/IntegrityReport types.
  • Implement streaming full-file checksum computation to avoid loading entire files into memory.
  • Add integration tests covering clean trees, SST/blob corruption detection, missing files, and Display/Error behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/verify.rs New verification module: streaming checksum + report/error types.
src/lib.rs Exposes the new verify module as part of the public API.
tests/verify_integrity.rs Integration tests validating correctness across corruption and missing-file scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/verify.rs
Comment thread src/verify.rs Outdated
- Replace manual read loop + buffer with BufReader::fill_buf()
- Eliminates per-file 64 KiB allocation (BufReader owns the buffer)
- Removes silent no-op on theoretical get(..n) failure — fill_buf()
  returns slices directly, errors propagate via ?
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new public verify module that can scan an AbstractTree’s on-disk SST + blob files, recompute full-file XXH3-128 checksums in a streaming manner, and return a structured IntegrityReport with per-file failures.

Changes:

  • Introduces verify::verify_integrity(&impl AbstractTree) -> IntegrityReport plus supporting IntegrityError / IntegrityReport types.
  • Implements streaming full-file checksum computation (stream_checksum) to avoid loading entire files into memory.
  • Adds integration tests covering clean trees, corruption detection, missing files, and Display/Error::source behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/verify.rs New integrity verification API (report + error types + streaming checksum + verification logic).
src/lib.rs Exposes the new verify module from the crate root.
tests/verify_integrity.rs New integration test suite for integrity verification across SST + blob scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/verify.rs
Comment on lines +15 to +41
table_id: TableId,
/// Path to the corrupted file
path: PathBuf,
/// Checksum stored in the manifest
expected: Checksum,
/// Checksum computed from disk
got: Checksum,
},

/// Full-file checksum mismatch for a blob file.
BlobFileCorrupted {
/// Blob file ID
blob_file_id: u64,
/// Path to the corrupted file
path: PathBuf,
/// Checksum stored in the manifest
expected: Checksum,
/// Checksum computed from disk
got: Checksum,
},

/// I/O error while reading a file during verification.
IoError {
/// Path to the file that could not be read
path: PathBuf,
/// The underlying I/O error
error: std::io::Error,
Comment thread src/verify.rs
Comment on lines +25 to +34
BlobFileCorrupted {
/// Blob file ID
blob_file_id: u64,
/// Path to the corrupted file
path: PathBuf,
/// Checksum stored in the manifest
expected: Checksum,
/// Checksum computed from disk
got: Checksum,
},
Comment thread src/verify.rs
error: std::io::Error,
},
}

Comment thread src/verify.rs
}

// Verify all blob files
for blob_file in version.blob_files.iter() {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tree checksum verification

2 participants