Skip to content

feat: add verify_integrity() for full-file checksum verification#259

Closed
polaz wants to merge 116 commits intofjall-rs:mainfrom
structured-world:feat/#187-verify-integrity
Closed

feat: add verify_integrity() for full-file checksum verification#259
polaz wants to merge 116 commits intofjall-rs:mainfrom
structured-world:feat/#187-verify-integrity

Conversation

@polaz
Copy link
Copy Markdown

@polaz polaz commented Mar 14, 2026

Summary

Adds a verify module with verify_integrity() that scans all SST and blob files in a tree, computing XXHash-3 128-bit checksums and comparing them against the checksums stored in the version manifest.

This enables detection of silent bit-rot, partial writes, and other on-disk corruption.

API

use lsm_tree::verify;

let report = verify::verify_integrity(&tree);
if !report.is_ok() {
    for error in &report.errors {
        eprintln!("Corruption: {error}");
    }
}

IntegrityReport:

  • sst_files_checked / blob_files_checked — counters
  • errors: Vec<IntegrityError> — per-file errors
  • is_ok()true if no errors
  • files_checked() — total files

IntegrityError variants:

  • SstFileCorrupted { table_id, path, expected, got }
  • BlobFileCorrupted { blob_file_id, path, expected, got }
  • IoError { path, error }

Test plan

  • Clean tree verification (standard + blob tree)
  • SST file corruption detection (corrupt byte → detected)
  • Blob file corruption detection (corrupt byte → detected)
  • Multiple SST files across flushes
  • All existing tests pass

Closes #187

Summary by CodeRabbit

  • New Features

    • Added on-disk integrity verification for storage files: scans table and blob files, computes and compares checksums, reports corrupted or unreadable files, and returns a summary with per-file counts. Scans continue to collect full diagnostics even if some files error.
  • Tests

    • Added comprehensive tests covering clean trees, table/blob corruption detection, missing files, multiple tables, and blob-enabled configurations.

Adds a public verify module with verify_integrity() that scans all SST
and blob files in a tree, computing XXHash-3 128-bit checksums and
comparing them against the checksums stored in the version manifest.

This enables detection of silent bit-rot, partial writes, and other
on-disk corruption without reading individual blocks.

The function returns an IntegrityReport with:
- sst_files_checked / blob_files_checked counters
- Per-file errors (SstFileCorrupted, BlobFileCorrupted, IoError)
- is_ok() / files_checked() convenience methods

Tests cover:
- Clean tree verification (standard + blob tree)
- SST corruption detection
- Blob file corruption detection
- Multiple SST files

Closes fjall-rs#187
Copilot AI review requested due to automatic review settings March 14, 2026 09:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new public verification module that streams SST and blob files, computes XXH3-128 checksums, compares them to manifest values, and returns an IntegrityReport with counts and collected IntegrityError entries. Includes tests for clean and corrupted SST/blob scenarios and missing files.

Changes

Cohort / File(s) Summary
Module Declaration
src/lib.rs
Adds pub mod verify; and documentation comment to expose the verification module.
Verification Implementation
src/verify.rs
New public module implementing IntegrityError (SstFileCorrupted, BlobFileCorrupted, IoError), IntegrityReport (fields, is_ok(), files_checked()), streaming stream_checksum helper, and verify_integrity(tree: &impl crate::AbstractTree) which enumerates tables and blob files, streams each file to compute XXH3-128 checksums, compares to manifest checksums, records mismatches or I/O errors, and returns an aggregated report while continuing on per-file errors.
Tests
tests/verify_integrity.rs
Adds tests: clean SST, SST corruption detection, clean blob-tree, blob corruption detection, multiple tables, and missing SST file (I/O error). Tests use temporary dirs and existing helpers to validate report counts and error variants.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Suite
    participant Verify as verify::verify_integrity
    participant Tree as AbstractTree (manifest)
    participant FS as FileSystem
    participant XXH as XXH3-128

    Test->>Verify: call verify_integrity(tree)
    Verify->>Tree: list tables & blob files
    loop for each file
        Verify->>FS: open & stream file bytes
        FS-->>Verify: bytes / read error
        Verify->>XXH: feed bytes (streaming)
        XXH-->>Verify: checksum
        Verify->>Tree: read expected checksum
        alt checksums match
            Verify-->>Verify: increment checked count
        else mismatch or IoError
            Verify-->>Verify: record IntegrityError
        end
    end
    Verify-->>Test: return IntegrityReport (counts + errors)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add verify_integrity() for full-file checksum verification' is clear, concise, and directly describes the main feature being added—a function for verifying file integrity through checksums.
Linked Issues check ✅ Passed The pull request fully implements the requirements from issue #187: iterates all SST and blob files, computes XXHash-3 128-bit checksums, compares against manifest values, detects corruption, and reports per-file errors.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the verify_integrity feature: new verify module, error types, report struct, integrity check function, and comprehensive tests—no out-of-scope changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/verify.rs`:
- Around line 121-175: Extract the duplicated read/hash/compare/error logic in
the two loops into a single helper function (e.g., verify_file(path: &Path,
expected: Checksum, on_corrupt: impl FnOnce(Checksum)->IntegrityError,
report_counter: &mut usize, report: &mut Report) ) that performs std::fs::read,
computes Checksum::from_raw(crate::hash::hash128(&data)), compares to expected
and pushes either IntegrityError::SstFileCorrupted or
IntegrityError::BlobFileCorrupted (use the provided on_corrupt closure to create
the correct variant) or IntegrityError::IoError (with path.to_path_buf() and
e.to_string()), then increments the appropriate counter; replace the SST loop
(iter_tables(), table.path, table.checksum(), table.id(),
report.sst_files_checked) and the blob loop (version.blob_files,
blob_file.path(), blob_file.checksum(), blob_file.id(),
report.blob_files_checked) with calls to this helper so behavior and error
reporting is shared.
- Around line 108-111: Update the Rustdoc for verify_integrity to reflect that
it returns an IntegrityReport unconditionally rather than Result/Err; change the
`# Errors` section wording to explain that fatal I/O errors are recorded inside
the returned IntegrityReport (e.g., in IntegrityReport::errors) and that the
function does not return Err, referencing the function name `verify_integrity`
and the type `IntegrityReport` so readers can locate the code.
- Around line 126-128: The code currently uses std::fs::read(path) to read
entire file bytes then calls Checksum::from_raw(crate::hash::hash128(&data)),
which will OOM for large SST/blob files; change to stream the file by opening
with std::fs::File::open(path) and reading in a fixed-size buffer (e.g.,
8KB/64KB) while feeding each chunk into the incremental hashing API instead of
hash128(&data), then finalize the hasher and call Checksum::from_raw on the
resulting digest; apply the same streaming replacement wherever std::fs::read is
used for file verification (including the similar block around lines 154–156).

In `@tests/verify_integrity.rs`:
- Around line 10-16: Extract the repeated Config::new(...).open()? pattern into
a small helper function (e.g., fn open_test_tree(folder: &Path) ->
Result<YourTreeType, Error>) and replace each occurrence in tests (the blocks
using Config::new(...).open()? around SequenceNumberCounter::default()) with a
call to that helper; ensure the helper accepts the folder argument and returns
the same result type so tests remain unchanged otherwise, and update use/imports
if necessary to make the helper visible to all tests in the file.
- Around line 31-144: Add a test that deletes or makes unreachable a referenced
SST/blob file before calling verify::verify_integrity to exercise
IntegrityError::IoError: after creating and flushing the tree (reuse pattern in
verify_integrity_detect_sst_corruption and
verify_integrity_detect_blob_corruption), get the target using
version.iter_tables().next().unwrap() for SST or
version.blob_files.iter().next().unwrap() for blob, call
std::fs::remove_file(table.path()) (or blob_file.path()) to simulate an IO
failure, then call verify::verify_integrity(&tree) and assert the report is not
Ok and that report.errors contains an entry matching
verify::IntegrityError::IoError with the expected table_id/blob_file_id (use
table.id() / blob_file.id() for comparison).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1aa71f4f-c41d-43cd-bcb7-be6d611863f7

📥 Commits

Reviewing files that changed from the base of the PR and between e11c791 and 947828b.

📒 Files selected for processing (3)
  • src/lib.rs
  • src/verify.rs
  • tests/verify_integrity.rs

Comment thread src/verify.rs Outdated
Comment thread src/verify.rs
Comment thread src/verify.rs Outdated
Comment thread tests/verify_integrity.rs
Comment thread tests/verify_integrity.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an integrity verification facility to the lsm_tree crate, allowing callers to scan SST and blob files and detect on-disk corruption by comparing full-file checksums against the version manifest.

Changes:

  • Added a new verify module with verify_integrity plus IntegrityReport / IntegrityError types.
  • Implemented checksum verification for both SST table files and blob (value log) files.
  • Added integration tests covering clean trees, SST corruption detection, blob corruption detection, and multi-table scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
src/verify.rs New public API for integrity scanning and reporting checksum mismatches / I/O issues.
src/lib.rs Exposes the new verify module from the crate root.
tests/verify_integrity.rs New integration tests validating integrity checks for standard and blob trees.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/verify.rs Outdated
Comment thread src/verify.rs Outdated
Comment thread src/verify.rs Outdated
Comment thread src/verify.rs
Comment thread src/verify.rs Outdated
Comment thread src/verify.rs
Comment thread src/verify.rs
Comment thread tests/verify_integrity.rs Outdated
- Compute XXH3 via BufReader instead of loading entire files
  into memory (prevents OOM on large SST/blob files)
- Store std::io::Error directly in IoError variant instead of String
- Use BlobFileId type alias instead of raw u64
- Add #[non_exhaustive] to IntegrityError for forward compatibility
- Fix rustdoc: remove incorrect # Errors section
- Add test for missing file (IoError path)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/verify.rs`:
- Around line 8-43: Add an implementation of std::error::Error for the
IntegrityError enum: implement impl std::error::Error for IntegrityError { fn
source(&self) -> Option<&(dyn std::error::Error + 'static)> { match self {
IntegrityError::IoError { error, .. } => Some(error), _ => None, } } } so
callers can inspect the underlying I/O error via source(); keep existing
Debug/Display impls and ensure the IoError variant returns the inner
std::io::Error as the source while other variants return None.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9aa732f6-c06f-4b07-861b-3a87f6467fdd

📥 Commits

Reviewing files that changed from the base of the PR and between 947828b and 3fa5f56.

📒 Files selected for processing (2)
  • src/verify.rs
  • tests/verify_integrity.rs

Comment thread src/verify.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a public integrity verification API to the lsm_tree crate that scans all SST and blob files and reports full-file checksum mismatches and I/O errors.

Changes:

  • Introduces src/verify.rs with verify_integrity() plus IntegrityReport/IntegrityError types.
  • Exposes the new verify module from src/lib.rs.
  • Adds integration tests covering clean trees, SST/blob corruption, multiple SSTs, and missing-file behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/verify_integrity.rs New integration tests for integrity verification behavior across SST and blob modes.
src/verify.rs New verification implementation that streams files from disk and compares XXH3-128 checksums to manifest values.
src/lib.rs Exposes pub mod verify; from the crate root.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/verify.rs
Comment thread src/verify.rs
Comment thread tests/verify_integrity.rs Outdated
BlobFileId is defined in the private vlog module and not re-exported,
so using it in the public IntegrityError enum would leak a private
type to external consumers.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a public integrity verification API for lsm_tree that scans on-disk SST and blob files and reports checksum mismatches and I/O errors, along with a new integration test suite for the verifier.

Changes:

  • Added verify module with verify_integrity, IntegrityReport, and IntegrityError.
  • Implemented streaming XXH3-128 file checksumming to avoid loading whole files into memory.
  • Added integration tests covering clean trees, corrupted SST/blob files, multiple tables, and missing SST files.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/verify.rs New public verifier API and streaming checksum implementation.
src/lib.rs Exposes the new verify module as part of the public crate API.
tests/verify_integrity.rs New integration tests validating verifier behavior across success and failure cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread tests/verify_integrity.rs Outdated
Comment thread src/verify.rs
Comment thread src/verify.rs Outdated
Enables error chain integration (anyhow, thiserror, ? propagation).
IoError variant returns the underlying std::io::Error via source().
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/verify.rs`:
- Line 139: The function verify_integrity currently returns an IntegrityReport
that can be silently ignored; add the #[must_use] attribute to the
verify_integrity function declaration (the pub fn verify_integrity(tree: &impl
crate::AbstractTree) -> IntegrityReport signature) so callers are warned if they
discard the returned IntegrityReport, preserving API ergonomics and preventing
accidental ignores.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 921bb59e-16df-494b-9113-ae415b6db1aa

📥 Commits

Reviewing files that changed from the base of the PR and between ddf8e9d and 221ea21.

📒 Files selected for processing (1)
  • src/verify.rs

Comment thread src/verify.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an integrity verification API to scan on-disk SST and blob files, comparing full-file XXH3-128 checksums against the version manifest, along with integration tests that validate detection of corruption and missing files.

Changes:

  • Add verify module with verify_integrity() that streams file contents to compute/check full-file checksums.
  • Introduce IntegrityError / IntegrityReport types to report per-file verification outcomes without aborting the scan.
  • Add integration tests covering clean trees, SST/blob corruption, multiple tables, and missing SST files.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/verify.rs New verification implementation + public report/error types for checksum validation across SST/blob files.
src/lib.rs Exposes the new verify module as part of the public API.
tests/verify_integrity.rs Adds integration tests exercising verification success and failure scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread tests/verify_integrity.rs
Comment thread tests/verify_integrity.rs
Comment thread tests/verify_integrity.rs Outdated
Comment thread src/verify.rs
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new public integrity verification facility for lsm-tree, allowing callers to scan SST and blob files and detect on-disk corruption or unreadable files, with accompanying integration tests to validate expected behavior.

Changes:

  • Added verify module with verify_integrity API that computes full-file checksums and returns a structured IntegrityReport.
  • Introduced IntegrityError variants for SST corruption, blob corruption, and I/O read failures.
  • Added integration tests covering clean trees, induced corruption, missing files, and multi-table scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/verify_integrity.rs New integration tests validating integrity verification across SST/blob scenarios.
src/verify.rs Implements file streaming checksum verification and report/error types.
src/lib.rs Exposes the new verify module publicly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread tests/verify_integrity.rs Outdated
Comment thread tests/verify_integrity.rs Outdated
Comment thread src/verify.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new public verify module to perform full-file integrity verification of SST and blob files by recomputing and comparing stored checksums, along with integration tests that exercise clean/corrupt/missing-file scenarios.

Changes:

  • Introduces src/verify.rs with IntegrityReport, IntegrityError, and verify_integrity() to scan SST/blob files and collect per-file errors.
  • Exposes the new module via pub mod verify; in src/lib.rs.
  • Adds tests/verify_integrity.rs integration tests covering OK, corruption, and missing-file cases for both standard and blob trees.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/verify_integrity.rs New integration tests for integrity verification across multiple scenarios.
src/verify.rs New implementation for streaming checksum verification and structured error reporting.
src/lib.rs Exposes the new verify module publicly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/verify.rs Outdated
Comment thread src/verify.rs Outdated
Comment thread tests/verify_integrity.rs Outdated
- Switch stream_checksum to Xxh3Default matching ChecksummedWriter
- Use path reference comparison instead of deref in test assertion
- Remove redundant doc comment on enum field visibility
polaz added 26 commits March 15, 2026 22:32
- Document that new(MAX_SEQNO) is allowed but exhausts the counter
- Remove #[allow(clippy::missing_panics_doc)] from trait impl (lint
  does not fire on trait impl methods, and the reason text was stale)
…insprefix

feat: add optimized contains_prefix() method
- Rename tests to follow <what>_<condition>_<expected> convention
- Replace magic hex literals with MAX_SEQNO constant in tests
- Remove BufReader wrapper — 64KB app buffer makes it redundant
- Add comment explaining why .get(..n) is used over direct indexing
…ock-seqno-aware-seek

perf: seqno-aware seek in data block point reads
- Add Display format tests for all three IntegrityError variants
- Test Error::source returns None for corruption, Some for IoError
- Add missing blob file test covering blob IoError path
- next() doc: explicitly state pre-increment (fetch-then-add) return
- get() doc: clarify it returns value that next next() will return
…structured-world/lsm-tree into feat/#174-sequencenumbergenerator-trait
feat: add multi_get() for batch point reads
Retry on ErrorKind::Interrupted instead of propagating — signals
can interrupt read() syscalls on Unix systems.
…diagnostics

- Use #[should_panic(expected = "...")] instead of shorthand = form
- Trait doc: clarify get() must not return reserved values, set/fetch_max
  must not store them (previously conflated get with mutation methods)
- OffsetGenerator: include current value in panic message
…structured-world/lsm-tree into feat/#174-sequencenumbergenerator-trait
…generator-trait

feat: add SequenceNumberGenerator trait
@polaz
Copy link
Copy Markdown
Author

polaz commented Mar 16, 2026

Sorry about the noise here — the branch accumulated unrelated changes from our fork. Opened a clean replacement rebased on upstream main: #275

@polaz polaz closed this Mar 16, 2026
@polaz polaz deleted the feat/#187-verify-integrity branch March 16, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tree checksum verification

2 participants