Skip to content

Design Draft: Refactor RocksDB Schema for Reduced Read/Write Amplification #5087

@eval-exec

Description

@eval-exec

1. Problem Statement

The current CKB database schema relies heavily on Block Hash as the primary key for storing block-related data (Headers, Bodies, Uncles, etc.). While Block Hash is unique and essential for verifying data integrity, using it as a key in RocksDB (an LSM-tree based storage) presents significant performance challenges:

  • Random Writes: Block hashes are effectively random. Inserting blocks causes random write patterns, which are inefficient for LSM-trees that favor sequential writes.
  • Write Amplification: Random insertions trigger frequent and expensive compaction cycles in RocksDB to sort and merge SSTables.
  • Read Amplification: Scattering related data across many SSTables increases the overhead of point lookups and range scans.

2. Proposed Solution

The core proposal is to refactor the database schema to use Composite Keys based on Block Number (Big Endian) + Block Hash.

Why Block Number?

Block numbers are strictly sequential. By using the block number as the prefix of the key:

  1. Sequential Writes: New blocks are appended to the end of the key space. This aligns perfectly with RocksDB's append-only nature for MemTables and minimizes overlap in SSTables.
  2. Reduced Compaction: Sequential writes significantly reduce the need for rewrites during compaction, lowering Write Amplification.
  3. Data Locality: Blocks with similar heights are stored close together, improving cache efficiency and range scan performance.

3. Detailed Schema Changes

The refactoring introduces a new key structure for block-related Column Families.

3.1 New COLUMN_INDEX (Col 0)

This acts as the primary "index" to map random hashes to sequential numbers.

  • Key: Block Hash (32 bytes)
  • Value:
    • Block Number (8 bytes, Big Endian)
    • Main Chain Flag (1 byte): 0x01 if on main chain, 0x00 otherwise.
  • Benefit:
    • Allows looking up the Block Number when only the hash is known.
    • Optimizes is_main_chain(hash) checks to be O(1) in the same lookup.

3.2 Block Data Columns (Cols 1, 2, 3, 6, 7, 8, 15, 17, 18)

These columns store the actual block content. They now use a composite key.

  • Key Format: Block Number (BE) + Block Hash
  • Affected Columns:
    • COLUMN_BLOCK_HEADER (1): Header + Hash
    • COLUMN_BLOCK_BODY (2): Transactions
    • COLUMN_BLOCK_UNCLE (3): Uncle Blocks
    • COLUMN_BLOCK_EXT (6): Block Extension (verified, total difficulty)
    • COLUMN_BLOCK_PROPOSAL_IDS (7)
    • COLUMN_BLOCK_EPOCH (8)
    • COLUMN_BLOCK_EXTENSION (15)
    • COLUMN_BLOCK_FILTER (17)
    • COLUMN_BLOCK_FILTER_HASH (18)

3.3 Other Changes

  • COLUMN_NUMBER_HASH (13): Deprecated. The composite keys now naturally provide the number->hash mapping (and more, since it handles forks by storing all hashes for a number).
  • Unchanged Columns: Columns that don't key off blocks (e.g., COLUMN_META, COLUMN_CELL) remain largely unchanged or have minor adjustments.

4. Migration Strategy

(Not sure, Considering)

5. Benefits Summary

  • Performance: Drastically improved write throughput and reduced latency for block synchronization.
  • Resource Usage: Lower CPU and I/O usage due to reduced compaction overhead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions