Skip to content

feat(cli): database pruning tool for historical trie data#123

Merged
kariy merged 28 commits intomainfrom
devin/1749799396-database-pruning-tool
Jun 17, 2025
Merged

feat(cli): database pruning tool for historical trie data#123
kariy merged 28 commits intomainfrom
devin/1749799396-database-pruning-tool

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Jun 13, 2025

Database Pruning Tool for Historical Trie Data

Overview

This PR implements a database pruning tool for the katana db utility command to address the exponential growth of historical trie data in the Katana database. The tool provides two pruning modes to help manage database size while preserving data integrity.

Features

  • Prune All History: Removes all historical trie data, keeping only the latest trie state
  • Keep Last N Blocks: Removes historical data older than the specified number of blocks
  • Safety Checks: Validates inputs and provides warnings to prevent accidental data loss
  • Progress Monitoring: Real-time progress bars showing pruning status with time estimates
  • Efficient Operations:
    • Uses database cursors for bulk deletions
    • Batch processing (1000 entries at a time) to minimize memory usage
    • Transaction-based operations with proper error handling

Implementation Details

CLI Interface

# Remove all historical trie data
katana db prune --latest -p path/to/db

# Keep only the last 1000 blocks of historical data
katana db prune --keep-last 1000 -p path/to/db

Database Tables Affected

The pruning tool operates on six trie-related tables:

  • History Tables: ClassesTrieHistory, ContractsTrieHistory, StoragesTrieHistory
  • Changeset Tables: ClassesTrieChangeSet, ContractsTrieChangeSet, StoragesTrieChangeSet

Safety Features

  • Input validation (blocks > 0)
  • Warning when requested blocks exceed available blocks (no-op behavior)
  • Preserves current trie state tables (ClassesTrie, ContractsTrie, StoragesTrie)
  • Transaction-based operations with proper commit/rollback handling

Progress Visualization

  • Uses indicatif (v0.17.8) for professional progress bars
  • Shows table name, entries processed, elapsed time, and processing speed
  • Multi-level progress bars for overall and per-table progress

Testing

  • ✅ Successfully builds with cargo build --bin katana
  • ✅ Passes all linting checks with scripts/clippy.sh
  • ✅ Code formatted with cargo +nightly fmt --all
  • ✅ Comprehensive test coverage using production-like data insertion
  • ✅ Tests verify state roots remain unchanged after pruning
  • ✅ Edge case testing (e.g., requested blocks > available blocks)

Test Improvements

  • Tests now use BlockWriter::insert_block_with_states_and_receipts for production-like setup
  • Verification uses StateFactoryProvider to check historical state accessibility
  • State root verification ensures data integrity is maintained
  • Added test case for when requested blocks exceed available blocks

Technical Implementation

  • Uses DbTxMut for write transactions with proper commit/rollback handling
  • Implements efficient cursor-based iteration for bulk operations
  • Handles IntegerSet (RoaringTreemap) data structure for changeset tables
  • Batch processing (1000 entries) to reduce memory usage during changeset pruning
  • Progress monitoring with time estimates and processing speed visualization
  • Follows existing code patterns and error handling conventions

Files Changed

  • bin/katana/src/cli/db/prune.rs: Added pruning functionality with progress monitoring
  • bin/katana/tests/db_prune.rs: Comprehensive test suite with state root verification
  • bin/katana/Cargo.toml: Added indicatif dependency for progress bars
  • Other supporting files for the database utility structure

Link to Devin run

https://app.devin.ai/sessions/746969faebe44c02aa1ee1ff518951fb

Requested by

karii@cartridge.gg

@devin-ai-integration
Copy link
Contributor Author

Original prompt from karii@cartridge.gg:

Currently, we have a problem where the Katana database is growing at an exponential rate. One primary factor is that it is storing all historical state trie data since the genesis block. 

This is the table definitions for everything related to the state trie:

```crates/storage/db/src/tables.rs
    /// Class trie
    ClassesTrie: (TrieDatabaseKey) => TrieDatabaseValue,
    /// Contract trie
    ContractsTrie: (TrieDatabaseKey) => TrieDatabaseValue,
    /// Contract storage trie
    StoragesTrie: (TrieDatabaseKey) => TrieDatabaseValue,

    /// Class trie history
    ClassesTrieHistory: (BlockNumber, TrieDatabaseKey) => TrieHistoryEntry,
    /// Contract trie history
    ContractsTrieHistory: (BlockNumber, TrieDatabaseKey) => TrieHistoryEntry,
    /// Contract storage trie history
    StoragesTrieHistory: (BlockNumber, TrieDatabaseKey) => TrieHistoryEntry,

    /// Class trie change set
    ClassesTrieChangeSet: (TrieDatabaseKey) => BlockList,
    /// contract trie change set
    ContractsTrieChangeSet: (TrieDatabaseKey) => BlockList,
    /// contract storage trie change set
    StoragesTrieChangeSet: (TrieDatabaseKey) => BlockList

and this is how the historical trie is being stored. The snapshot method is being called every time the trie is committed. And the trie is committed on every new block.

    fn snapshot(&mut self, id: CommitId) {
        let block_number: BlockNumber = id.into();

        let entries = std::mem::take(&mut self.write_cache);
        let entries = entries.into_iter().map(|(key, value)| TrieHistoryEntry { key, value });

        for entry in entries {
            let mut set = self
                .tx
                .get::<Tb::Changeset>(entry.key.clone())
                .expect("failed to get trie change set")
                .unwrap_or_default();
            set.insert(block_number);

            self.tx
                .put::<Tb::Changeset>(entry.key.clone(), set)
                .expect("failed to put trie change set");

            self.tx
                .put::<Tb::History>(block_number, entry)
                .expect("failed to put trie history entry");
        }
    }

The trie is committed by calling BonsaiTrie::commit. Note: BonsaiTrie is a base class that is used to derive more specific trie (ie, ClassesTrie, or ContractsTrie).

impl<DB, Hash> BonsaiTrie<DB, Hash>
where
    DB: BonsaiDatabase + BonsaiPersistentDatabase<CommitId>,
    Hash: StarkHash + Send + Sync,
{
    pub fn insert(&mut self, id: &[u8], key: Felt, value: Felt) {
        let key: BitVec = key.to_bytes_be().as_bits()[5..].to_owned();
        self.storage.insert(id, &key, &value).unwrap();
    }

    pub fn commit(&mut self, id: CommitId) {
        self.storage.commit(id).expect("failed to commit trie");
    }
}

Your job is to create a database pruning tool. The tool should be part of the katana db utility command. It should allow to prune; (1) all historical data (meaning only keep the latest trie), or (2) keep only the last N blocks.

You only need to look in the following repo: dojoengine/katana

@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

github-actions bot commented Jun 13, 2025

Binary size report 📊

Branch Size
main 65.91MiB
devin/1749799396-database-pruning-tool 66.19MiB ( +0.43% )

@codecov
Copy link

codecov bot commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 128 lines in your changes missing coverage. Please review.

Project coverage is 73.52%. Comparing base (bf21e05) to head (2172214).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
bin/katana/src/cli/db/stats.rs 0.00% 81 Missing ⚠️
bin/katana/src/cli/db/version.rs 0.00% 14 Missing ⚠️
bin/katana/src/cli/db/prune.rs 91.59% 10 Missing ⚠️
crates/storage/db/src/models/trie.rs 0.00% 10 Missing ⚠️
bin/katana/src/cli/db/mod.rs 61.90% 8 Missing ⚠️
crates/storage/db/src/models/list.rs 50.00% 3 Missing ⚠️
bin/katana/src/main.rs 0.00% 1 Missing ⚠️
crates/primitives/src/state.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #123      +/-   ##
==========================================
+ Coverage   73.27%   73.52%   +0.24%     
==========================================
  Files         201      205       +4     
  Lines       22115    22267     +152     
==========================================
+ Hits        16205    16372     +167     
+ Misses       5910     5895      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

devin-ai-integration bot and others added 8 commits June 16, 2025 21:58
- Add prune subcommand to katana db utility
- Support two modes: prune all history or keep last N blocks
- Include safety checks and proper error handling
- Efficiently handle bulk deletions using database cursors

Co-Authored-By: karii@cartridge.gg <evergreenkary@gmail.com>
…d import

- Fix double reference issue in get_latest_block_number call
- Change function signature to use DbTx instead of DbTxRef
- Remove unused DbTxRef import to clean up warnings

Co-Authored-By: karii@cartridge.gg <evergreenkary@gmail.com>
- Add custom Arbitrary implementation for TrieDatabaseKey with 256-byte limit
- Add arbitrary feature to katana-db Cargo.toml with optional dependency
- Create comprehensive integration tests in pruning_tests.rs
- Test pruning modes: all history removal and keep-last N blocks
- Verify current trie state preservation during pruning operations
- Test edge cases including empty database and key generation limits
- All tests pass with proper type conversions and trait imports

Co-Authored-By: karii@cartridge.gg <evergreenkary@gmail.com>
- Replace manual generate_test_key function with TrieDatabaseKey::arbitrary()
- Update populate_test_data to use Arbitrary-generated keys for better test coverage
- Remove unused TrieDatabaseKeyType import
- All tests continue to pass with randomly generated keys respecting 256-byte constraint
- Demonstrates Arbitrary trait implementation works correctly in practice

Co-Authored-By: karii@cartridge.gg <evergreenkary@gmail.com>
@kariy kariy force-pushed the devin/1749799396-database-pruning-tool branch from 98f0ed7 to d3d27e6 Compare June 16, 2025 14:16
@claude
Copy link
Contributor

claude bot commented Jun 16, 2025

Claude finished @devin-ai-integration[bot]'s task —— View job


Todo List

  • Read main implementation file (bin/katana/src/cli/db.rs)
  • Review test file (bin/katana/tests/db_prune.rs)
  • Check other changed files for context
  • Analyze code quality and best practices
  • Identify potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive review feedback

PR Review: Database Pruning Tool

Thank you for implementing this database pruning tool. The feature addresses an important need for managing database growth. Here's my comprehensive review:

✅ Code Quality and Best Practices

Strengths:

  • Clean separation of concerns with distinct pruning modes
  • Good use of Rust idioms (Result types, pattern matching)
  • Proper error handling with contextual messages
  • Transaction-based operations ensure atomicity
  • Follows the codebase conventions for imports and formatting

Suggestions:

  • Consider adding more detailed documentation for the pruning algorithms, especially the cursor-based iteration logic
  • The prune_changeset_table function could benefit from explaining why we collect keys before updating (avoiding cursor invalidation)

🐛 Potential Bugs or Issues

  1. Double Reference Issue (line 206 in db.rs):

    let latest_block = get_latest_block_number(&&tx)?;

    Should be:

    let latest_block = get_latest_block_number(&tx)?;
  2. Typo in Comment (line 72):

    • "since tha latest block" → "since the latest block"
  3. Test File Issues (db_prune.rs):

    • Line 169-170, 184-185: The test is calling methods directly on db but should use db.open_ro() first
    • Line 173: Should be assert_eq!(initial_headers_count, 16); (0-15 = 16 blocks)
    • Line 203: Same issue, should be 16 not 11
  4. Edge Case in KeepLastN:

    • When blocks == latest_block, the function returns early with just a warning. Consider if this should prune all history instead

⚡ Performance Considerations

Positive:

  • Cursor-based iteration is efficient for bulk operations
  • Collecting keys before updates in prune_changeset_table avoids cursor invalidation
  • Transaction batching minimizes write overhead

Concerns:

  • The prune_changeset_table function loads all keys into memory before processing. For very large databases, this could cause memory issues
  • Consider implementing progress reporting for long-running operations
  • No concurrent operations support - the entire database is locked during pruning

Suggestion: Consider processing changesets in batches to reduce memory usage:

const BATCH_SIZE: usize = 1000;
let mut batch = Vec::with_capacity(BATCH_SIZE);
// Process in batches...

🔒 Security Concerns

  1. Data Loss Risk: While there are warnings, consider adding a --force flag for destructive operations
  2. Path Expansion: Good use of shellexpand for path handling, preventing directory traversal issues
  3. No Backup Mechanism: Consider adding a recommendation to backup before pruning
  4. Permissions: The tool correctly uses appropriate read/write permissions

🧪 Test Coverage

Coverage Gaps:

  1. Missing test for error cases (invalid block numbers, database errors)
  2. No test for the edge case where requested blocks > available blocks
  3. Missing test for empty database scenario
  4. No integration test for the actual CLI invocation with error cases
  5. Missing tests for changeset pruning logic specifically

Test Improvements Needed:

  • Add tests for concurrent access scenarios
  • Test pruning with very large datasets
  • Add tests for transaction rollback on errors
  • Verify that current trie state tables remain untouched

Additional Recommendations

  1. Monitoring: Add metrics/logging for:

    • Number of entries pruned
    • Time taken for operation
    • Database size before/after
  2. Validation: Consider adding a --dry-run flag to preview what would be pruned

  3. Recovery: Document recovery procedures if pruning is interrupted

  4. CLI UX: The command structure is good, but consider adding:

    • Confirmation prompt for destructive operations
    • Progress bar for long operations
    • Summary statistics after completion

Summary

This is a solid implementation that addresses the core requirements. The main issues are minor (typos, test fixes) with some opportunities for enhancement around memory efficiency and test coverage. The transaction-based approach ensures data integrity, which is crucial for a pruning tool.

Priority fixes:

  1. Fix the double reference bug
  2. Correct the test assertions
  3. Add basic error case tests
  4. Consider memory usage for large databases

Overall, good work on implementing this important feature! With these improvements, it will be a robust solution for managing database growth.

@kariy kariy marked this pull request as draft June 16, 2025 14:26
@kariy
Copy link
Member

kariy commented Jun 16, 2025

@claude Let's update the tests (in bin/katana/tests/db_prune.rs) to be more reliable by actually computing the trie. Instead of inserting the individual database entries manually, let's insert them thru the trie provider trait.

impl<Db: Database> TrieWriter for DbProvider<Db> {
    fn trie_insert_declared_classes(
        &self,
        block_number: BlockNumber,
        updates: &BTreeMap<ClassHash, CompiledClassHash>,
    ) -> ProviderResult<Felt> { .. }

    fn trie_insert_contract_updates(
        &self,
        block_number: BlockNumber,
        state_updates: &StateUpdates,
    ) -> ProviderResult<Felt> { .. }
}

To ensure the pruning process is done correctly, we need to make sure the latest state root is still the same before and after the pruning.

We can use the StateRootProvider trait to fetch the latest state root.

impl<Tx> StateRootProvider for LatestStateProvider<Tx>
where
    Tx: DbTx + Send + Sync,
{
    fn classes_root(&self) -> ProviderResult<Felt> { .. }

    fn contracts_root(&self) -> ProviderResult<Felt> { ..  }

    fn storage_root(&self, contract: ContractAddress) -> ProviderResult<Option<Felt>> { .. }
}

@claude
Copy link
Contributor

claude bot commented Jun 16, 2025

Claude finished @kariy's task —— View job


Todo List

  • Check for CLAUDE.md files for repo-specific instructions
  • Read current test implementation in bin/katana/tests/db_prune.rs
  • Find TrieWriter trait definition and usage patterns
  • Find StateRootProvider trait definition and usage patterns
  • Refactor tests to use TrieWriter for data insertion
  • Add state root verification before and after pruning
  • Run tests to ensure they pass (not able to run without bash permissions)
  • Commit the changes

Summary

I've successfully refactored the database pruning tests to be more reliable by using the proper trie provider traits as requested. Here are the key changes:

Changes Made:

  1. Updated imports: Added necessary imports for StateUpdates, StateRootProvider, TrieWriter, and related types.

  2. Replaced manual database insertion:

    • Removed the old generate_trie_entries function that manually created TrieHistoryEntry objects
    • Created helper functions to generate test addresses, class hashes, and compiled class hashes
  3. Rewrote populate_db function:

    • Now uses TrieWriter trait methods (trie_insert_declared_classes and trie_insert_contract_updates)
    • Creates proper StateUpdates objects for each block with:
      • Declared classes (every 3rd block)
      • Deployed contracts (every 2nd block)
      • Storage updates (for all existing contracts)
      • Nonce updates
  4. Added state root verification:

    • Created get_state_roots function that uses StateRootProvider to fetch classes and contracts roots
    • Both test functions now verify that state roots remain unchanged after pruning
  5. Enhanced test assertions:

    • Added descriptive messages to all assertions
    • Verify that the pruning process doesn't affect the current state

The tests now properly compute the trie using the same mechanisms as the actual code, making them more reliable and ensuring that the pruning process correctly preserves the latest state while removing historical data.
| View changes

claude bot and others added 2 commits June 16, 2025 16:03
…e roots

- Replace manual database entry insertion with TrieWriter trait methods
- Use trie_insert_declared_classes and trie_insert_contract_updates
- Add state root verification before and after pruning using StateRootProvider
- Create more realistic test data with proper StateUpdates structure
- Ensure pruning operations don't affect the current state roots

Co-authored-by: Ammar Arif <kariy@users.noreply.github.com>
@kariy
Copy link
Member

kariy commented Jun 16, 2025

@claude

Ok, i just remember that using BlockWriter::insert_block_with_states_and_receipts is a much better approach than inserting the header and state updates separately. Using this method to populate the database would also better reflect production setting.

impl<Db: Database> BlockWriter for DbProvider<Db> {
    fn insert_block_with_states_and_receipts(
        &self,
        block: SealedBlockWithStatus,
        states: StateUpdatesWithClasses,
        receipts: Vec<Receipt>,
        executions: Vec<TypedTransactionExecutionInfo>,
    ) -> ProviderResult<()> { .. }

@claude
Copy link
Contributor

claude bot commented Jun 16, 2025

Claude finished @kariy's task —— View job


I've successfully refactored the database pruning tests to use BlockWriter::insert_block_with_states_and_receipts for a more production-like setup.

Todo List

  • Check for CLAUDE.md files for repo-specific instructions
  • Read current test implementation in bin/katana/tests/db_prune.rs
  • Find BlockWriter trait and its method signature
  • Understand how to create SealedBlockWithStatus and other required parameters
  • Refactor the populate_db function to use the new method
  • Ensure state root verification is still working correctly
  • Commit the changes

Summary

The tests have been updated to better reflect production settings by:

  1. Using BlockWriter::insert_block_with_states_and_receipts instead of manually inserting headers and calling trie methods separately
  2. Creating proper block structures with SealedBlockWithStatus containing:
    • FinalityStatus::AcceptedOnL2
    • Sealed blocks with appropriate hash and header data
  3. Wrapping state updates in StateUpdatesWithClasses objects
  4. Maintaining state root verification to ensure the pruning process doesn't affect the latest state

The refactored approach more closely mirrors how blocks are actually inserted in production, making the tests more reliable and representative of real-world usage.
|

@kariy
Copy link
Member

kariy commented Jun 16, 2025

(aside) @claude

Counting the historical entries isn't the most robust way to check for the historical states validity before/after the pruning.

/// Count total historical entries in the database
fn count_historical_entries(db: &DbEnv) -> Result<usize> {
    let tx = db.tx()?;
    let mut total = 0;

    // Count ClassesTrieHistory entries
    let mut cursor = tx.cursor_dup::<tables::ClassesTrieHistory>()?;
    let mut walker = cursor.walk(None)?;
    while walker.next().transpose()?.is_some() {
        total += 1;
    }

    // Count ContractsTrieHistory entries
    let mut cursor = tx.cursor_dup::<tables::ContractsTrieHistory>()?;
    let mut walker = cursor.walk(None)?;
    while walker.next().transpose()?.is_some() {
        total += 1;
    }

    // Count StoragesTrieHistory entries
    let mut cursor = tx.cursor_dup::<tables::StoragesTrieHistory>()?;
    let mut walker = cursor.walk(None)?;
    while walker.next().transpose()?.is_some() {
        total += 1;
    }

    Ok(total)
}

/// Count headers in the database (should remain unchanged after pruning)
fn count_headers(db: &DbEnv) -> Result<usize> {
    let tx = db.tx()?;
    let mut cursor = tx.cursor::<tables::Headers>()?;
    let mut count = 0;

    let mut walker = cursor.walk(None)?;
    while walker.next().transpose()?.is_some() {
        count += 1;
    }

    Ok(count)
}

We should instead try fetching the historical state roots using:

impl<Tx> StateRootProvider for HistoricalStateProvider<Tx>
where
    Tx: DbTx + Send + Sync,
{
    fn classes_root(&self) -> ProviderResult<katana_primitives::Felt> { .. }

    fn contracts_root(&self) -> ProviderResult<katana_primitives::Felt> { .. }

    fn storage_root(&self, contract: ContractAddress) -> ProviderResult<Option<Felt>> { .. }

    fn state_root(&self) -> ProviderResult<katana_primitives::Felt> { .. }
}

HistoricalStateProvider can be obtained by using this trait:

impl<Db: Database> StateFactoryProvider for DbProvider<Db> {
    fn historical(
        &self,
        block_id: BlockHashOrNumber,
    ) -> ProviderResult<Option<Box<dyn StateProvider>>> { .. }
}

@claude
Copy link
Contributor

claude bot commented Jun 16, 2025

Claude finished @kariy's task —— View job


I'll analyze this and get back to you.

@kariy
Copy link
Member

kariy commented Jun 17, 2025

@claude please add a test case for when requested blocks > available blocks

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude finished @kariy's task —— View job


Todo List

  • Read current test implementation in bin/katana/tests/db_prune.rs
  • Read the prune implementation to understand the edge case handling
  • Add test case for when requested blocks > available blocks
  • Commit the changes to the PR branch

Summary

I've successfully added a test case for the scenario where requested blocks exceed available blocks in the database. The new test prune_keep_last_n_blocks_exceeds_available verifies that:

  1. When requesting to keep more blocks than exist (e.g., asking for 25 blocks when only 15 exist)
  2. The prune command prints a warning and returns early without performing any pruning
  3. All historical state data remains intact and accessible
  4. State roots remain unchanged

This ensures the edge case is properly handled and tested, preventing accidental data loss when users specify incorrect block counts.
| View PR branch

claude bot and others added 2 commits June 17, 2025 11:57
This test verifies that when the user requests to keep more blocks than
are available in the database, the prune command correctly:
- Prints a warning and returns early
- Performs no pruning operations
- Preserves all historical state data
- Keeps state roots unchanged

Co-authored-by: Ammar Arif <kariy@users.noreply.github.com>
@kariy
Copy link
Member

kariy commented Jun 17, 2025

let's update the prune_changeset_table function to process the changesets in batches to reduce memory usage.

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

1 similar comment
@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude finished @devin-ai-integration[bot]'s task —— View job


Todo List

  • Check for CLAUDE.md files for repo-specific instructions
  • Read main implementation file (bin/katana/src/cli/db.rs)
  • Review test file (bin/katana/tests/db_prune.rs)
  • Check other changed files for context
  • Analyze code quality and best practices
  • Identify potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive review feedback

PR Review: Database Pruning Tool

Thank you for implementing this database pruning tool. The feature addresses an important need for managing database growth. Here's my comprehensive review:

✅ Code Quality and Best Practices

Strengths:

  • Clean separation of concerns with distinct pruning modes
  • Good use of Rust idioms (Result types, pattern matching)
  • Proper error handling with contextual messages
  • Transaction-based operations ensure atomicity
  • Follows the codebase conventions for imports and formatting

Suggestions:

  • Consider adding more detailed documentation for the pruning algorithms, especially the cursor-based iteration logic
  • The prune_changeset_table function could benefit from explaining why we collect keys before updating (avoiding cursor invalidation)

🐛 Potential Bugs or Issues

  1. Edge Case in KeepLastN:

    • When blocks > latest_block, the function returns early with just a warning. This is correct behavior but could be clearer in documentation
  2. Zero Block Check (line 70):

    • The check for blocks == 0 is unnecessary since the clap parser already validates value_parser!(u64).range(1..) ensuring blocks >= 1

⚡ Performance Considerations

Positive:

  • Cursor-based iteration is efficient for bulk operations
  • Collecting keys before updates in prune_changeset_table avoids cursor invalidation
  • Transaction batching minimizes write overhead
  • Already implements batch processing (BATCH_SIZE = 1000) in prune_changeset_table

Concerns:

  • No progress reporting for long-running operations
  • No concurrent operations support - the entire database is locked during pruning
  • Consider allowing configurable batch size for different database sizes

🔒 Security Concerns

  1. Data Loss Risk: While there are warnings, consider adding a --force flag for destructive operations
  2. Path Expansion: Good use of shellexpand for path handling, preventing directory traversal issues
  3. No Backup Mechanism: Consider adding a recommendation to backup before pruning
  4. Permissions: The tool correctly uses appropriate read/write permissions

🧪 Test Coverage

Good Coverage:

  • Tests verify state roots remain unchanged after pruning
  • Both pruning modes are tested
  • Uses realistic data generation with the TrieWriter trait
  • Good test for edge case where requested blocks > available blocks

Coverage Gaps:

  1. Missing test for error cases (database errors, transaction failures)
  2. Missing test for empty database scenario
  3. No integration test for the CLI invocation with error cases
  4. Missing tests for concurrent access scenarios

Test Quality:

  • The recent refactoring to use BlockWriter::insert_block_with_states_and_receipts is excellent - better reflects production usage
  • Good use of historical state root verification instead of just counting entries

Additional Recommendations

  1. Monitoring: Add metrics/logging for:

    • Number of entries pruned
    • Time taken for operation
    • Database size before/after
  2. Validation: Consider adding a --dry-run flag to preview what would be pruned

  3. Recovery: Document recovery procedures if pruning is interrupted

  4. CLI UX: The command structure is good, but consider adding:

    • Confirmation prompt for destructive operations
    • Progress bar for long operations
    • Summary statistics after completion

Summary

This is a solid implementation that addresses the core requirements. The transaction-based approach ensures data integrity, which is crucial for a pruning tool. The main areas for improvement are:

  1. Enhanced test coverage for error cases
  2. Better user feedback during operations (progress reporting)
  3. Documentation improvements
  4. Consider adding safety features like --force flag

The code is well-structured and follows good practices. With the suggested improvements, this will be a robust solution for managing database growth.

@kariy
Copy link
Member

kariy commented Jun 17, 2025

@claude let's add monitoring to track the progression of the pruning process. then we should show a progress bar to visualize the progression in a user friendly way

@devin-ai-integration
Copy link
Contributor Author

Devin is currently unreachable - the session may have died.

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude finished @kariy's task —— View job


✅ Progress monitoring and visualization added to database pruning

Completed Tasks

  • Read current pruning implementation
  • Add indicatif dependency for progress bars
  • Modify prune_history_table to track progress
  • Modify prune_changeset_table to track progress
  • Add progress bars to main pruning functions
  • Test the implementation
  • Commit the changes

Summary

Successfully implemented progress monitoring and visualization for the database pruning process:

  • Added indicatif dependency (v0.17.8) for professional progress bar visualization
  • Multi-level progress bars showing both overall pruning progress and individual table progress
  • Pre-counting passes to determine total entries for accurate progress tracking
  • Detailed progress information including:
    • Table name being processed
    • Current/total entries processed
    • Elapsed time and processing speed (entries per second)
    • Visual progress bars with cyan/blue styling
  • Non-intrusive progress bars that clear after each table is completed

The implementation maintains the existing batch processing approach for memory efficiency while providing real-time feedback to users during potentially long-running pruning operations.
| View PR branch

- Add indicatif dependency for progress bar visualization
- Implement multi-level progress bars showing overall and per-table progress
- Add counting passes to determine total entries for accurate progress tracking
- Show table names, current/total entries, elapsed time, and entries per second
- Visual progress bars with cyan/blue styling for better user experience

Co-authored-by: Ammar Arif <kariy@users.noreply.github.com>
Comment on lines 202 to 211
if let Some((block, _)) = cursor.first()? {
let mut current_block = block;
while current_block <= cutoff_block {
blocks_to_prune += 1;
if let Some((next_block, _)) = cursor.next()? {
current_block = next_block;
} else {
break;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block counting logic may lead to inaccurate results. The cursor iteration with cursor.next() advances to the next key-value pair in the database, not necessarily the next block number. If multiple trie entries exist for a single block, this approach will count each entry rather than each unique block number. Consider tracking unique block numbers or using a different approach to determine the exact count of blocks to prune.

Suggested change
if let Some((block, _)) = cursor.first()? {
let mut current_block = block;
while current_block <= cutoff_block {
blocks_to_prune += 1;
if let Some((next_block, _)) = cursor.next()? {
current_block = next_block;
} else {
break;
}
}
if let Some((block, _)) = cursor.first()? {
let mut current_block = block;
let mut unique_blocks = std::collections::HashSet::new();
while current_block <= cutoff_block {
unique_blocks.insert(current_block);
if let Some((next_block, _)) = cursor.next()? {
current_block = next_block;
} else {
break;
}
}
blocks_to_prune = unique_blocks.len();

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude encountered an error —— View job


Todo List

  • Check for CLAUDE.md files for repo-specific instructions
  • Read main implementation file (bin/katana/src/cli/db/prune.rs)
  • Review test file (bin/katana/tests/db_prune.rs)
  • Check other changed files for context
  • Analyze code quality and best practices
  • Identify potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive review feedback

I'm analyzing the PR to provide a thorough code review...

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude finished @devin-ai-integration[bot]'s task —— View job


Todo List

  • Check for CLAUDE.md files for repo-specific instructions
  • Read main implementation file (bin/katana/src/cli/db/prune.rs)
  • Review test file (bin/katana/tests/db_prune.rs)
  • Check other changed files for context
  • Analyze code quality and best practices
  • Identify potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive review feedback

PR Review: Database Pruning Tool

Thank you for implementing this database pruning tool with progress monitoring. Here's my comprehensive review:

✅ Code Quality and Best Practices

Strengths:

  • Clean separation of concerns with distinct pruning modes
  • Good use of Rust idioms (Result types, pattern matching)
  • Proper error handling with contextual messages
  • Transaction-based operations ensure atomicity
  • Progress bars provide excellent user feedback
  • Batch processing (1000 entries) prevents memory issues for large databases

Areas for Improvement:

  1. Progress Bar Consistency: prune_all_history uses MultiProgress unnecessarily (only one progress bar shown). Consider simplifying to match prune_keep_last_n
  2. Magic Numbers: Document why BATCH_SIZE = 1000 was chosen
  3. Logging: Replace println! and eprintln! with proper logging infrastructure (tracing crate)
  4. Code Duplication: Extract progress bar style constants to avoid repetition

🐛 Potential Bugs or Issues

  1. Edge Case Handled Well: The blocks == 0 check (line 72) is actually redundant since clap validates value_parser!(u64).range(1..), but it's good defensive programming

  2. Minor Inefficiency: In prune_changeset_table line 263, consider adding an is_empty() method to IntegerSet instead of block_list.select(0).is_none()

  3. Progress Counting: In prune_history_table, the progress isn't tracked per entry, which might make the progress bar less accurate for large datasets

⚡ Performance Considerations

Positive:

  • Cursor-based iteration is efficient for bulk operations
  • Batch processing (1000 entries at a time) prevents memory exhaustion
  • Collecting keys before updates avoids cursor invalidation
  • Transaction batching minimizes write overhead

Suggestions:

  1. Consider making BATCH_SIZE configurable based on available memory
  2. Add estimated time remaining to progress bars for better UX
  3. Consider parallel processing for independent table operations (though this might complicate transaction handling)

🔒 Security Concerns

No significant security issues identified:

  • Proper use of shellexpand prevents path traversal
  • Transaction rollback on errors prevents partial state
  • No arbitrary code execution risks
  • Appropriate file permissions used

Recommendation: Add a --dry-run flag to preview operations before execution

🧪 Test Coverage

Good Coverage:

  • Tests verify state roots remain unchanged after pruning
  • Both pruning modes tested
  • Edge case test for when requested blocks > available blocks
  • Uses production-like setup with BlockWriter

Missing Coverage:

  1. Error handling scenarios (database failures, transaction rollbacks)
  2. Empty database scenario
  3. Concurrent access behavior
  4. Very large dataset performance testing
  5. Interrupted pruning recovery

📝 Additional Recommendations

  1. Extract Constants:
const PROGRESS_STYLE: &str = "{msg} {bar:40.cyan/blue} {pos:>7}/{len:7} [{elapsed_precise}] {per_sec}";
const BATCH_SIZE: usize = 1000; // Process in chunks to limit memory usage
  1. Add Metrics: Report total entries pruned, time taken, and space reclaimed

  2. CLI Enhancement: Consider adding:

    • --force flag for destructive operations
    • Confirmation prompt: "This will permanently delete data. Continue? [y/N]"
    • Summary after completion showing space saved
  3. Documentation: Add examples to CLI help text showing common scenarios

Summary

This is a well-implemented feature that addresses database growth concerns effectively. The addition of progress monitoring significantly improves user experience, and the batch processing ensures scalability. The code is clean, follows Rust best practices, and handles edge cases appropriately.

The main improvements would be around consistency (progress bars, logging) and additional safety features (dry-run, confirmation prompts). With these minor enhancements, this will be a robust solution for managing historical trie data.

Verdict: Approve with minor suggestions ✅

@kariy kariy changed the title feat(db): add database pruning tool for historical trie data feat(cli): database pruning tool for historical trie data Jun 17, 2025
@kariy
Copy link
Member

kariy commented Jun 17, 2025

@claude let's update the PR description to reflect the latest changes to the PR

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude finished @kariy's task —— View job


Todo List

  • Read main implementation file to understand current features
  • Review test file changes
  • Check Cargo.toml for new dependencies
  • Create updated PR description
  • Update this comment with the new PR description

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude finished @devin-ai-integration[bot]'s task —— View job


Todo List

  • Check for CLAUDE.md files for repo-specific instructions
  • Read main implementation file (bin/katana/src/cli/db/prune.rs)
  • Review test file (bin/katana/tests/db_prune.rs)
  • Check other changed files for context
  • Analyze code quality and best practices
  • Identify potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive review feedback

PR Review: Database Pruning Tool

Thank you for implementing this database pruning tool. The feature addresses an important need for managing database growth. Here's my comprehensive review:

✅ Code Quality and Best Practices

Strengths:

  • Clean separation of concerns with distinct pruning modes
  • Good use of Rust idioms (Result types, pattern matching)
  • Proper error handling with contextual messages
  • Transaction-based operations ensure atomicity
  • Follows the codebase conventions for imports and formatting
  • Excellent progress monitoring with indicatif library
  • Batch processing (1000 entries at a time) prevents memory exhaustion

Areas for Improvement:

  1. Progress Bar Consistency: prune_all_history uses MultiProgress unnecessarily (only one progress bar shown at a time). Consider simplifying to match prune_keep_last_n
  2. Magic Numbers: Document why BATCH_SIZE = 1000 was chosen
  3. Code Comments: The typo in line 24 "since tha latest block" → "since the latest block"

🐛 Potential Bugs or Issues

  1. Zero Block Check (line 72): While the clap validator ensures blocks >= 1, keeping this check is good defensive programming

  2. Minor Inefficiency (line 261): Consider adding an is_empty() method to IntegerSet for clearer intent:

    if block_list.is_empty() { // Instead of block_list.select(0).is_none()
  3. Batch Processing Edge Case: The final batch processing after the loop (lines 281-287) correctly handles remaining entries ✅

⚡ Performance Considerations

Positive:

  • Cursor-based iteration is efficient for bulk operations
  • Batch processing (1000 entries) prevents memory issues
  • Transaction batching minimizes write overhead
  • Progress bars provide excellent user feedback during long operations

Suggestions:

  1. Consider making BATCH_SIZE configurable based on available memory
  2. The progress counting logic is efficient and provides good UX
  3. Pre-counting entries for progress bars adds overhead but improves user experience - good tradeoff

🔒 Security Concerns

Well Handled:

  • Proper use of shellexpand for path handling prevents directory traversal
  • Transaction rollback on errors prevents partial state
  • No arbitrary code execution risks
  • Appropriate file permissions used

Recommendations:

  1. Consider adding a --dry-run flag to preview operations
  2. Add a confirmation prompt for destructive operations: "This will permanently delete data. Continue? [y/N]"
  3. Consider logging pruned data statistics for audit purposes

🧪 Test Coverage

Excellent Coverage:

  • Tests verify state roots remain unchanged after pruning ✅
  • Both pruning modes tested thoroughly
  • Edge case test for when requested blocks > available blocks ✅
  • Uses production-like setup with BlockWriter::insert_block_with_states_and_receipts
  • Historical state verification using StateFactoryProvider

Missing Coverage:

  1. Error handling scenarios (database failures, transaction rollbacks)
  2. Empty database scenario
  3. Concurrent access behavior
  4. Very large dataset performance testing

📝 Additional Recommendations

  1. User Experience Enhancements:

    • Add estimated time remaining to progress bars
    • Report total entries pruned and disk space reclaimed after completion
    • Consider adding a --force flag for CI/automation scenarios
  2. Documentation:

    • Add examples to CLI help text showing common scenarios
    • Document the pruning algorithm in more detail
    • Add recovery procedures if pruning is interrupted
  3. Code Organization:

    • Extract progress bar styling constants to reduce duplication:
      const PROGRESS_STYLE: &str = "{msg} {bar:40.cyan/blue} {pos:>7}/{len:7} [{elapsed_precise}] {per_sec}";

Summary

This is a well-implemented feature that effectively addresses database growth concerns. The code is clean, follows Rust best practices, and handles edge cases appropriately. The addition of progress monitoring significantly improves user experience, and the batch processing ensures scalability.

Main areas for improvement are around additional safety features (dry-run, confirmation prompts) and enhanced test coverage for error scenarios. With these minor enhancements, this will be a robust solution for managing historical trie data.

Verdict: Approve with minor suggestions ✅

@kariy
Copy link
Member

kariy commented Jun 17, 2025

@claude can you open issues for adding these safety features; dry-run, confirmation prompts.

@claude
Copy link
Contributor

claude bot commented Jun 17, 2025

Claude finished @kariy's task —— View job


Todo List

  • Understand the request
  • Check my capabilities for creating issues
  • Draft issue descriptions for the safety features

I'm unable to create GitHub issues directly due to my current permissions. However, I can help by drafting comprehensive issue descriptions that you can use to create these issues yourself.

Issue 1: Add --dry-run flag to database pruning tool

Title: feat(katana-db): Add --dry-run flag to prune command for safer operations

Description:
Currently, the katana db prune command immediately executes database pruning operations which permanently delete historical trie data. This poses a risk for users who might accidentally run the wrong command or want to preview the impact before execution.

Feature Request:
Add a --dry-run flag that:

  • Simulates the pruning operation without actually deleting data
  • Reports what would be deleted (number of entries, affected tables, estimated space to be reclaimed)
  • Shows which blocks would be pruned
  • Validates the operation parameters

Example Usage:

# Preview what would be deleted
katana db prune --latest --dry-run -p path/to/db

# Preview keeping last 1000 blocks
katana db prune --keep-last 1000 --dry-run -p path/to/db

Implementation Notes:

  • Modify bin/katana/src/cli/db/prune.rs to add the dry-run flag
  • Count entries that would be deleted without performing actual deletions
  • Display summary statistics after the dry run

Issue 2: Add confirmation prompt to database pruning tool

Title: feat(katana-db): Add interactive confirmation prompt for pruning operations

Description:
The database pruning tool currently executes immediately when run, which could lead to accidental data loss. Adding an interactive confirmation prompt would provide an additional safety layer.

Feature Request:
Add an interactive confirmation prompt that:

  • Displays a warning about permanent data deletion
  • Shows summary of what will be pruned
  • Requires explicit user confirmation (y/N)
  • Can be bypassed with a --force flag for CI/automation scenarios

Example Interaction:

$ katana db prune --latest -p path/to/db

WARNING: This operation will permanently delete historical trie data.
- Tables affected: 6 (ClassesTrieHistory, ContractsTrieHistory, etc.)
- Estimated entries to delete: 125,432
- Current database size: 2.3 GB

This action cannot be undone. Continue? [y/N]: 

Implementation Notes:

  • Add --force flag to skip confirmation (useful for scripts/CI)
  • Use a crate like dialoguer or inquire for interactive prompts
  • Ensure the prompt clearly communicates the risks
  • Default to "No" if user just presses Enter

Priority: High (safety feature)

These draft issues should help you create the GitHub issues for tracking these important safety enhancements to the database pruning tool.

@kariy kariy merged commit a404ab3 into main Jun 17, 2025
11 checks passed
@kariy kariy deleted the devin/1749799396-database-pruning-tool branch June 17, 2025 15:58
kariy added a commit that referenced this pull request Jun 17, 2025
* feat(db): add database pruning tool for historical trie data

- Add prune subcommand to katana db utility
- Support two modes: prune all history or keep last N blocks
- Include safety checks and proper error handling
- Efficiently handle bulk deletions using database cursors

* feat: add progress monitoring and visualization for database pruning

- Add indicatif dependency for progress bar visualization
- Implement multi-level progress bars showing overall and per-table progress
- Add counting passes to determine total entries for accurate progress tracking
- Show table names, current/total entries, elapsed time, and entries per second
- Visual progress bars with cyan/blue styling for better user experience

---------

Co-authored-by: Ammar Arif <evergreenkary@gmail.com>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant