Skip to content

Conversation

@Zorglub4242
Copy link

Summary

This PR implements a comprehensive solution for running Kaspa archive nodes on HDD storage, addressing Issue #681.

The implementation adds two main features:

  1. RocksDB Preset System - Pre-configured database settings optimized for different storage types (SSD vs HDD)
  2. WAL Directory Support - Ability to place Write-Ahead Logs on separate high-speed storage for hybrid setups

These features enable efficient archive nodes on HDDs while maintaining the option for hybrid NVMe+HDD configurations.

Features

1. RocksDB Preset System (--rocksdb-preset)

Two configuration presets for different deployment scenarios:

Default Preset (SSD/NVMe):

  • 64MB write buffer
  • Standard compression
  • Optimized for fast storage
  • Default behavior (no flag needed)

Archive Preset (HDD):

  • 256MB write buffer (4x larger for better batching)
  • Aggressive compression (LZ4 + ZSTD with 64KB dictionaries)
  • BlobDB enabled for large values (>512 bytes)
  • 256MB SST files (reduces file count: 500K → 16K for 4TB)
  • Rate limiting (12 MB/s) to prevent I/O spikes
  • Based on production testing by @Callidon

Usage:

# Default (SSD/NVMe) - no flag needed
kaspad --archival

# Archive preset for HDD
kaspad --archival --rocksdb-preset=archive

2. WAL Directory Support (--rocksdb-wal-dir)

Enables hybrid storage configurations by placing Write-Ahead Logs on fast storage (SSD/NVME or memory based like tmpfs) while keeping database files on HDDs.
Enable faster synchronization process on archival nodes.
On regular nodes, using tmpfs (or lmDisk on windows) allow “small” performance improvements but also reduce wear / tear of nvme / SSD storage devices.
Using tmpfs or memory based storage could lead to database corruption on restart ! Use with caution... (A wal recovery process was tested but would require more extensive work / review so, if needed, we could have it implemented on a separate issue)

Features:

  • Custom WAL directory location
  • Auto-generated unique subdirectories per database (consensus, meta, utxoindex) : When using fast wal storage (eg tmpfs), this avoid some race conditions experienced during testing.
  • Works with both presets

Usage:

# Place WAL on NVMe, data on HDD
kaspad --archival \
       --rocksdb-preset=archive \
       --rocksdb-wal-dir=/mnt/nvme/kaspa-wal

# Hybrid setup for maximum performance
kaspad --archival \
       --rocksdb-preset=archive \
       --rocksdb-wal-dir=/mnt/nvme/kaspa-wal \
       --appdir=/mnt/hdd/kaspa-data

Benefits:

  • Fast write bursts to NVMe WAL
  • Bulk data on cheaper HDD storage
  • Optimal I/O distribution
  • Cost-effective for large archives

Implementation Details

Files Modified

Database Layer:

  • database/src/db.rs - Export RocksDbPreset
  • database/src/db/conn_builder.rs - Add preset and wal_dir support
  • database/src/db/rocksdb_preset.rs - NEW - Preset configurations
  • database/src/lib.rs - Module exports

Application Layer:

  • kaspad/src/args.rs - CLI arguments for --rocksdb-preset and --rocksdb-wal-dir
  • kaspad/src/daemon.rs - Parse and apply configuration
  • consensus/src/consensus/factory.rs - Pass settings to consensus databases

Testing:

  • testing/integration/src/consensus_integration_tests.rs - Updated test parameters

Archive Preset Configuration Details

Based on extensive testing and community feedback (Issue #681):

Memory & Write Buffers:

  • write_buffer_size: 256MB (4x default)
  • Re-applied after optimize_level_style_compaction() to prevent override

LSM Tree Structure:

  • target_file_size_base: 256MB (reduces file count dramatically)
  • target_file_size_multiplier: 1 (consistent size across levels)
  • max_bytes_for_level_base: 1GB
  • level_compaction_dynamic_level_bytes: true (minimizes space amplification)

Compaction:

  • level_zero_file_num_compaction_trigger: 1 (minimize write amplification)
  • compaction_pri: OldestSmallestSeqFirst
  • compaction_readahead_size: 4MB (optimized for sequential HDD reads)

Compression Strategy:

  • Default: LZ4 (fast)
  • Bottommost level: ZSTD level 22 with 64KB dictionaries
  • zstd_max_train_bytes: 8MB (125x dictionary size)

Block Cache:

  • 2GB LRU cache for frequently accessed blocks
  • Partitioned Bloom filters (18 bits per key)
  • Two-level index search for large databases
  • 256KB block size (better for sequential HDD reads)

BlobDB:

  • Enabled for values >512 bytes
  • 256MB blob files
  • ZSTD compression
  • Garbage collection at 90% age cutoff

Rate Limiting:

  • 12 MB/s for background writes (prevents HDD saturation)

Testing

Unit Tests

  • ✅ Preset parsing (default, archive, invalid)
  • ✅ Preset display formatting
  • ✅ Configuration application to RocksDB options

Integration Tests

  • ✅ Consensus integration tests updated with new parameters
  • ✅ Verified backward compatibility (no preset = default behavior)
  • ✅ Passed cargo fmt, cargo check & cargo clippy

Production Testing

Archive preset based on real-world deployment:

  • Tested by @Callidon
  • Successfully running on HDD storage
  • Proven effective for large archives
  • Tested on a local hdd to confirm it stable (several day run)

Backward Compatibility

Fully backward compatible:

  • No flags = default preset (current behavior)
  • Existing deployments unaffected
  • WAL directory optional (defaults to database directory)

Performance Impact

Archive Preset Benefits (HDD):

  • 30-50% better compression (ZSTD on bottommost level)
  • Reduced write amplification (larger buffers, aggressive compaction)
  • 96% fewer files (256MB SST files vs default)
  • Smoother I/O (rate limiting prevents spikes)
  • Better caching (2GB block cache)

Hybrid Setup Benefits (NVMe + HDD):

  • Fast write bursts (WAL on NVMe)
  • Cost-effective bulk storage (data on HDD)
  • Minimal latency for write-heavy workloads

Documentation

User-facing documentation has been kept separate from code and will be added to the wiki/docs repository as appropriate.

Migration Notes

Existing Archive Nodes:
Compression settings cannot be changed retroactively. For optimal results with the archive preset:

  1. Fresh deployments: Use --rocksdb-preset=archive from the start
  2. Existing nodes: Continue with current settings, or start fresh if storage savings are critical

Note: Switching presets on an existing database will apply new settings to new data only. For full benefits, a fresh sync is recommended.

Related Issues

Checklist

  • Code follows project style guidelines
  • Unit tests added/updated
  • Integration tests updated
  • Backward compatibility maintained
  • CLI arguments documented (--help text)
  • Performance tested in production
  • No breaking changes

@Zorglub4242 Zorglub4242 force-pushed the feature/hdd-archive-optimization branch 2 times, most recently from a446af4 to 2e3ee80 Compare December 1, 2025 18:08
@michaelsutton
Copy link
Contributor

Exciting to see this!

This commit introduces a comprehensive solution for running Kaspa archive nodes
on HDD storage, addressing performance challenges through two key features:

1. RocksDB Preset System
   - Default preset: Optimized for SSD/NVMe (existing behavior)
   - Archive preset: Optimized for HDD with:
     * 256MB write buffer (reduced write amplification)
     * BlobDB for large values (efficient UTXO storage)
     * Aggressive compression (2.5x space savings)
     * 256MB SST files (reduced file count from 500K to 16K)
     * Rate limiting (100 MB/s to prevent I/O saturation)

2. WAL Directory Support
   - Allows placing Write-Ahead Logs on separate fast storage
   - Recommended: NVMe for WAL + HDD for data
   - Provides near-SSD performance for writes while using HDD for bulk storage

Configuration:
- --rocksdb-preset=archive    Enable HDD optimizations
- --rocksdb-wal-dir=/path     Place WAL on fast storage

This enables archive nodes to run efficiently on HDD, reducing storage costs
from ~$400 (4TB NVMe) to ~$80 (8TB HDD) while maintaining acceptable performance.
@Zorglub4242 Zorglub4242 force-pushed the feature/hdd-archive-optimization branch from 2e3ee80 to 7c7fb2b Compare December 3, 2025 08:54
@Zorglub4242
Copy link
Author

Zorglub4242 commented Dec 3, 2025

Exciting to see this!

thanks, credits to callidon to for his settings. did a lot of tweaking / testing but his settings where already top.
It may be good also for nvme users btw (reduce impacts / prolong lifespan).
It may be interesting to open another issue to have a WAL corruption recovery process implemented / tested (would allow to safely use tmpfs even on production nodes)

Got a clippy issue on the automated tests so resubmitted the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RocksDB] Preset & RAM‑backed Disk Virtualization

2 participants