Blockchain Nodes Manager

A comprehensive Rust-based system for managing 20+ blockchain nodes with health monitoring, automated pruning using cosmos-pruner, network-based snapshot management with auto-restore, log monitoring, and Hermes relayer management through a web interface.

Features

Core Functionality

Health Monitoring: Real-time RPC status checks with configurable intervals
Automated Pruning: Integration with cosmos-pruner tool for efficient blockchain data management
Network-Based Snapshot System: Create, restore, and manage network-wide LZ4-compressed blockchain snapshots with cross-node recovery and validator state preservation
Log Monitoring: Pattern-based log monitoring with configurable alerts and context extraction
Hermes Management: Smart relayer restarts with RPC-based dependency validation
Web Interface: RESTful API with comprehensive endpoints for all operations
HTTP Agent Management: Direct HTTP communication with agents for all operations
Configuration: Hot-reload capability with multi-server support

Advanced Capabilities

Parallel Operations: Execute maintenance across multiple servers simultaneously
Dependency Validation: Hermes restarts only when dependent nodes are healthy and synced
Scheduled Maintenance: Cron-based automation with timezone awareness
Real-time Monitoring: Continuous health checks with database persistence
Batch Operations: Execute pruning/restarts across multiple nodes efficiently
Maintenance Tracking: Track operation status with duration estimates and stuck operation detection
Cross-Node Recovery: Network-based snapshots allow any node on the same network to restore from shared snapshots
Validator State Preservation: Auto-restore preserves current validator signing state to prevent double-signing
Scheduled Snapshots: Automatic network snapshot creation with configurable retention policies
Emergency Cleanup: Force cleanup of stuck operations and maintenance windows

Snapshot System Features

Network-Based Naming: Snapshots are named by network (e.g., pirin-1_20250101_120000) enabling cross-node recovery
Validator Safety: Current validator state is preserved during restore to prevent consensus violations
LZ4 Compression: Fast background compression with good ratios
Automatic Backups: Scheduled network snapshot creation
Retention Management: Configurable cleanup of old network snapshots
Cross-Node Recovery: Any node on the same network can restore from the same snapshot
Long Operation Support: 24-hour timeout for large snapshots

Monitoring Features

Process Monitoring: Detect stuck pruning processes and silent failures
Log Pattern Detection: Monitor logs for specific error patterns with context extraction
Rate-Limited Alerting: Prevent alarm spam with configurable rate limiting
Maintenance Windows: Visual indication when nodes are undergoing maintenance
Health Recovery Notifications: Automatic notifications when nodes recover
Auto-Restore Triggers: Automatic restoration from snapshots when corruption patterns detected

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Web Interface │    │  Health Monitor │    │ Maintenance     │
│   (Axum + API)  │    │  (RPC Polling)  │    │ Scheduler       │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
         ┌───────────────────────────────────────────────────────────┐
         │              Core Engine                                  │
         │  ┌─────────────┐  ┌─────────────┐  ┌──────────┐  ┌──────┐│
         │  │Config Mgmt  │  │  Database   │  │HTTP Agent│  │ Log  ││
         │  │(Hot Reload) │  │  (SQLite)   │  │Manager   │  │Monitor││
         │  │             │  │             │  │          │  │      ││
         │  └─────────────┘  └─────────────┘  └──────────┘  └──────┘│
         │  ┌─────────────┐  ┌─────────────┐  ┌──────────┐         │
         │  │Maintenance  │  │ Network     │  │ Auto     │         │
         │  │Tracker      │  │ Snapshot    │  │ Restore  │         │
         │  │             │  │ Manager     │  │          │         │
         │  └─────────────┘  └─────────────┘  └──────────┘         │
         └───────────────────────────────────────────────────────────┘
                                 │
         ┌───────────────────────────────────────────────────────────┐
         │            Blockchain Infrastructure                      │
         │  ┌─────────────┐  ┌─────────────┐  ┌──────────┐         │
         │  │   Cosmos    │  │   Hermes    │  │ Remote   │         │
         │  │   Nodes     │  │  Relayers   │  │ Servers  │         │
         │  └─────────────┘  └─────────────┘  └──────────┘         │
         └───────────────────────────────────────────────────────────┘

Installation

Prerequisites

Rust 1.70+
cosmos-pruner tool installed on target servers
HTTP agent access to all blockchain servers
SQLite3
lz4 compression tool installed on target servers (for snapshots)

Build & Setup

# Clone repository
git clone https://github.com/nolus-protocol/nodes-manager.git
cd nodes-manager

# Build release version
cargo build --release

# Create required directories
mkdir -p data static config

# Set up configuration files
mkdir -p config
# Create your configuration files based on the examples below

# Ensure proper API key setup for HTTP agents
export AGENT_API_KEY="your-secure-api-key-here"

Configuration

Main Configuration

Create config/main.toml with:

host = "0.0.0.0"
port = 8095
check_interval_seconds = 90
rpc_timeout_seconds = 10
alarm_webhook_url = "http://your-n8n-instance/webhook/node-alarm"
hermes_min_uptime_minutes = 5

# Auto-restore trigger words (optional)
auto_restore_trigger_words = [
    "AppHash",
    "wrong Block.Header.AppHash",
    "database corruption",
    "state sync failed"
]

# Log monitoring configuration (optional)
log_monitoring_enabled = true
log_monitoring_patterns = [
    "Possibly no price is available!",
    "failed to lock fees to pay for",
    "consensus failure",
    "panic:"
]
log_monitoring_interval_minutes = 5
log_monitoring_context_lines = 2

Server Configuration Example

Create files like config/discovery.toml with this structure:

[server]
host = "192.168.11.206"
agent_port = 8745
api_key = "your-secure-api-key-here"
request_timeout_seconds = 300
max_concurrent_requests = 5

[nodes.osmosis-1]
rpc_url = "http://192.168.11.206:26657"
network = "osmosis-1"
server_host = "discovery"
enabled = true

# Pruning configuration
pruning_enabled = true
pruning_schedule = "0 0 6 * * 2"  # Tuesdays at 6AM UTC
pruning_keep_blocks = 8000
pruning_keep_versions = 8000
pruning_deploy_path = "/opt/deploy/osmosis"
pruning_service_name = "osmosis"

# Log configuration (for log monitoring)
log_path = "/var/log/osmosis"
truncate_logs_enabled = false

# Network-based snapshot configuration (optional)
snapshots_enabled = true
snapshot_backup_path = "/backup/snapshots/osmosis"
auto_restore_enabled = true

# Scheduled network snapshots (optional)
snapshot_schedule = "0 0 2 * * 0"  # Sundays at 2AM UTC
snapshot_retention_count = 7  # Keep 7 most recent network snapshots

[hermes.relay-discovery]
server_host = "discovery"
service_name = "hermes"
log_path = "/var/log/hermes"
restart_schedule = "0 0 16 * * 2"  # Tuesdays at 4PM UTC
dependent_nodes = ["discovery-osmosis-1", "discovery-neutron-1"]

Usage

Start the Service

./target/release/nodes-manager

API Endpoints

Health Monitoring

# Get all nodes health
GET /api/nodes/health

# Get specific node health
GET /api/nodes/{name}/health

# Get node health history
GET /api/nodes/{name}/history?limit=50

# Force health check
POST /api/nodes/{name}/check

Network-Based Snapshot Management

# Create network snapshot (named by network, e.g., pirin-1_20250101_120000)
POST /api/snapshots/{node_name}/create

# List all network snapshots (any node on network can see same snapshots)
GET /api/snapshots/{node_name}/list

# Restore from latest network snapshot (preserves current validator state)
POST /api/snapshots/{node_name}/restore

# Delete specific network snapshot
DELETE /api/snapshots/{node_name}/{filename}

# Get network snapshot statistics
GET /api/snapshots/{node_name}/stats

# Check auto-restore triggers
POST /api/snapshots/{node_name}/check-restore

# Cleanup old network snapshots (keep N most recent)
POST /api/snapshots/{node_name}/cleanup?retention_count=5

Maintenance Operations

# Execute immediate pruning
POST /api/maintenance/run-now
{
  "operation_type": "pruning",
  "target_name": "discovery-osmosis-1",
  "schedule": "immediate"
}

# Execute immediate network snapshot creation
POST /api/maintenance/run-now
{
  "operation_type": "snapshot_creation",
  "target_name": "discovery-osmosis-1",
  "schedule": "immediate"
}

# Batch pruning multiple nodes
POST /api/maintenance/prune-multiple
{
  "node_names": ["discovery-osmosis-1", "enterprise-neutron-1"]
}

# Get maintenance logs
GET /api/maintenance/logs?limit=100

# Get scheduled operations
GET /api/maintenance/schedule

# Schedule network snapshot creation
POST /api/maintenance/schedule-snapshot
{
  "operation_type": "snapshot_creation",
  "target_name": "discovery-osmosis-1",
  "schedule": "0 0 2 * * 0"
}

Maintenance Tracking

# Get active maintenance operations
GET /api/maintenance/active

# Get maintenance statistics
GET /api/maintenance/stats

# Get detailed maintenance report
GET /api/maintenance/report

# Check for stuck operations
GET /api/maintenance/stuck

# Emergency kill stuck processes
POST /api/maintenance/kill-stuck

# Emergency clear all maintenance windows
POST /api/maintenance/emergency-clear

# Clear specific node maintenance
POST /api/maintenance/clear/{node_name}

Hermes Management

# Get all Hermes instances
GET /api/hermes/instances

# Restart Hermes instance
POST /api/hermes/{name}/restart

# Get Hermes status with uptime
GET /api/hermes/{name}/status

# Restart all Hermes instances
POST /api/hermes/restart-all

Configuration Management

# Get all node configurations
GET /api/config/nodes

# Update node configuration
PUT /api/config/nodes/{name}
{
  "snapshots_enabled": true,
  "snapshot_schedule": "0 0 2 * * 0",
  "snapshot_retention_count": 5
}

# Get all Hermes configurations
GET /api/config/hermes

# Get all server configurations
GET /api/config/servers

# Reload configuration
POST /api/config/reload

# Validate configuration
POST /api/config/validate

System Status

# Overall system status
GET /api/system/status

# HTTP agent connections status
GET /api/system/agent-connections

# Running operations
GET /api/system/operations

# Health check endpoint
GET /api/system/health

# Test server connectivity
GET /api/system/connectivity

Key Features in Detail

Pruning with cosmos-pruner

The system uses the cosmos-pruner tool with extended timeouts for large datasets:

cosmos-pruner prune /opt/deploy/osmosis/data --blocks=8000 --versions=8000

Process:

Start maintenance tracking (5-hour timeout)
Stop blockchain service via HTTP agent
Optional: Truncate logs if enabled
Execute cosmos-pruner with configured parameters
Start blockchain service via HTTP agent
Verify service health
Send completion notification

Network-Based Snapshot System with Validator State Preservation

Features:

Network-Based Naming: Snapshots named by network (e.g., pirin-1_20250101_120000) for cross-node compatibility
Cross-Node Recovery: Any node on the same network can restore from shared network snapshots
Validator State Preservation: Current validator signing state is backed up and restored to prevent double-signing
LZ4 Compression: Fast background compression with good ratios
Automatic Backups: Scheduled network snapshot creation
Retention Management: Configurable cleanup of old network snapshots
Long Operation Support: 24-hour timeout for large snapshots

Network Snapshot Process:

Start maintenance tracking (24-hour timeout)
Stop blockchain service via HTTP agent
Create network-named directory: {network}_{timestamp}
Copy data and wasm directories (excluding validator state)
Remove any validator state files from snapshot
Start blockchain service via HTTP agent
Apply network retention policy if configured
Background LZ4 compression (optional)

Cross-Node Restore Process:

Start maintenance tracking (24-hour timeout)
Stop blockchain service via HTTP agent
Backup current validator state (critical for validator safety)
Delete existing data and wasm directories
Copy data and wasm from network snapshot
Restore backed up validator state (prevents double-signing)
Set proper permissions
Start blockchain service via HTTP agent
Verify service health

Auto-Restore System:

Monitors /var/log/{log_path}/out1.log for trigger words
Automatically restores from latest network snapshot when corruption detected
Preserves current validator state during auto-restore
Prevents infinite loops with 2-hour cooldown between attempts
Sends critical alerts if auto-restore fails

Log Monitoring System

Features:

Pattern-Based Detection: Monitor logs for specific error patterns
Context Extraction: Include configurable lines before/after matches
Rate-Limited Alerts: Same rate limiting as health alerts (0, 6, 12, 24, 48 hours)
Healthy Nodes Only: Only monitors logs when nodes are healthy
HTTP Agent Communication: Each check uses HTTP agent for log access

Configuration:

log_monitoring_enabled = true
log_monitoring_patterns = [
    "Possibly no price is available!",
    "failed to lock fees to pay for"
]
log_monitoring_interval_minutes = 5
log_monitoring_context_lines = 2

Intelligent Hermes Restart

Hermes relayers restart only when ALL dependent nodes are:

Healthy (RPC status check passes)
Synced (not catching up)
Recent (health data less than 5 minutes old)
Minimum Uptime (configurable minimum uptime before restart)

Maintenance Tracking System

Real-time Status:

Track all operations with start time and duration estimates
Visual indication in web interface when nodes are in maintenance
Automatic cleanup of expired maintenance windows (25-hour maximum)
Detection of stuck operations with process monitoring

Emergency Features:

Force kill stuck pruning processes
Emergency clear all maintenance windows
Manual maintenance window cleanup per node
Overdue operation detection (3x estimated duration)

HTTP Agent Management

Direct Communication Model:

Each operation uses HTTP POST to dedicated agents
Automatic operation cleanup after completion
No persistent connection pooling (prevents conflicts)
Configurable timeouts per server
Parallel execution across different servers
Sequential execution on same server for safety

Timezone Handling

Important: All cron schedules run in the timezone where the Node Manager is deployed.

Time Conversion Example:

Local time: 10:00 EEST (UTC+3)
Config schedule: "0 0 7 * * 2" (7:00 AM UTC)
Result: Runs at 10:00 AM local time

Cross-Node Recovery Examples

Example 1: Network Snapshot Creation

# Create snapshot on node1 - creates network-based snapshot
curl -X POST http://localhost:8095/api/snapshots/pirin-node-1/create

# Result: Creates snapshot named "pirin-1_20250101_120000"
# This snapshot can be used by ANY node on pirin-1 network

Example 2: Cross-Node Restore

# Node7 can restore from snapshot created by Node1
curl -X POST http://localhost:8095/api/snapshots/pirin-node-7/restore

# Result: Restores from latest pirin-1 network snapshot
# Preserves pirin-node-7's current validator state

Example 3: Validator Safety

# During restore, the system:
# 1. Backs up current validator state from pirin-node-7
# 2. Restores blockchain data from network snapshot
# 3. Restores pirin-node-7's validator state (not from snapshot)
# 4. Node resumes with correct signing state - no double-signing risk

Monitoring & Debugging

System Status

# Overall system status with network snapshot info
curl http://localhost:8095/api/system/status

# Maintenance tracking status
curl http://localhost:8095/api/maintenance/active

# Stuck operation detection
curl http://localhost:8095/api/maintenance/stuck

# Network snapshot statistics
curl http://localhost:8095/api/snapshots/{node_name}/stats

Health Check Endpoint

curl http://localhost:8095/health

Logs and Troubleshooting

Health checks run every 90 seconds (configurable)
Log monitoring runs every 5 minutes (configurable)
Failed operations are logged with detailed error messages
HTTP agent connection failures automatically trigger retries
Database cleanup runs hourly for old records
Maintenance windows automatically expire after 25 hours
Network snapshot operations support up to 24-hour timeouts
Auto-restore attempts have 2-hour cooldown periods
Cross-node recovery capability eliminates single points of failure

Security Considerations

API key authentication for all HTTP agent communications
Config files may contain sensitive information
Use firewalls to restrict API access
Monitor HTTP agent connection limits per server
Regular security updates for all dependencies
Network snapshot backup paths should be secured
Log monitoring may capture sensitive information in context
Validator state files are never included in shared snapshots

Production Deployment

Backup Strategy

Database: Regular backups of data/nodes.db
Configuration: Backup config/*.toml files separately
Network Snapshots: Configure separate backup storage for network snapshots
Validator States: Individual validator states are preserved per node
Log Rotation: Set up log rotation for maintenance logs

Storage Requirements

Database: ~10-50MB for typical deployments
Logs: Variable based on retention policies
Network Snapshots: Can be very large (GBs to TBs depending on blockchain data)
Per-network Storage: One snapshot location serves all nodes on same network

Performance Considerations

LZ4 Compression: Faster than gzip, good balance of speed/compression
Network Snapshot Retention: Configure appropriate retention counts to manage disk usage
Log Monitoring: Monitor disk I/O impact of frequent log reads via HTTP agents
Maintenance Windows: Plan maintenance schedules to avoid conflicts
Cross-Node Recovery: Reduces individual node storage requirements

Performance

Health checks: 20+ nodes in <5 seconds (parallel execution)
HTTP agent operations: Direct communication per operation
Database: SQLite with indexed queries for fast access
Memory usage: ~50-100MB typical operation (includes network snapshot management)
Pruning operations: 10-300 minutes depending on node size (5-hour timeout)
Network snapshot creation: 30-1440 minutes depending on data size (24-hour timeout)
Cross-node restore: 30-60 minutes depending on snapshot size with validator state preservation
Log monitoring: ~1-5 seconds per node every 5 minutes via HTTP agents
LZ4 compression: 50-200 MB/s typical compression speed

Configuration Examples

Complete Node Configuration with Network Snapshots

[nodes.pirin-mainnet]
# Basic configuration
rpc_url = "http://192.168.1.100:26657"
network = "pirin-1"  # Network name used for snapshot naming
server_host = "mainnet-server"
enabled = true

# Pruning configuration
pruning_enabled = true
pruning_schedule = "0 0 6 * * 2"  # Tuesdays at 6AM
pruning_keep_blocks = 8000
pruning_keep_versions = 8000
pruning_deploy_path = "/opt/deploy/pirin"
pruning_service_name = "pirin"

# Log configuration
log_path = "/var/log/pirin"
truncate_logs_enabled = false

# Network-based snapshot configuration
snapshots_enabled = true
snapshot_backup_path = "/backup/snapshots/pirin-network"  # Shared location for network
auto_restore_enabled = true

# Scheduled network snapshots
snapshot_schedule = "0 0 2 * * 0"  # Weekly on Sunday at 2AM
snapshot_retention_count = 4  # Keep 4 most recent network snapshots

Multiple Nodes Same Network Configuration

# All nodes on pirin-1 network share the same snapshot location
[nodes.pirin-node-1]
network = "pirin-1"
snapshot_backup_path = "/backup/snapshots/pirin-network"
snapshots_enabled = true  # Can create snapshots

[nodes.pirin-node-2]
network = "pirin-1"
snapshot_backup_path = "/backup/snapshots/pirin-network"
snapshots_enabled = false  # Only restores, doesn't create

[nodes.pirin-node-7]
network = "pirin-1"
snapshot_backup_path = "/backup/snapshots/pirin-network"
auto_restore_enabled = true  # Can auto-restore from network snapshots

Log Monitoring Patterns

# Common error patterns to monitor
log_monitoring_patterns = [
    # Price feed issues
    "Possibly no price is available!",
    "failed to lock fees to pay for",

    # Consensus issues
    "consensus failure",
    "failed to verify block",
    "invalid block",

    # System issues
    "panic:",
    "out of memory",
    "disk full",
    "database corruption",

    # Network issues
    "connection refused",
    "timeout",
    "network unreachable"
]

API Documentation

When the service is running, comprehensive API documentation is available at:

GET /api/docs

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

MIT License - see LICENSE file for details.

Support

For issues, questions, or contributions:

GitHub Issues: Create an issue
API Documentation: GET /api/docs when service is running

Related Projects

cosmos-pruner - Blockchain state pruning tool
Hermes - IBC relayer
Cosmos SDK - Blockchain application framework
LZ4 - Fast compression algorithm

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
agent		agent
manager		manager
static		static
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

License

nolus-protocol/nodes-manager

Folders and files

Latest commit

History

Repository files navigation

Blockchain Nodes Manager

Features

Core Functionality

Advanced Capabilities

Snapshot System Features

Monitoring Features

Architecture

Installation

Prerequisites

Build & Setup

Configuration

Main Configuration

Server Configuration Example

Usage

Start the Service

API Endpoints

Health Monitoring

Network-Based Snapshot Management

Maintenance Operations

Maintenance Tracking

Hermes Management

Configuration Management

System Status

Key Features in Detail

Pruning with cosmos-pruner

Network-Based Snapshot System with Validator State Preservation

Log Monitoring System

Intelligent Hermes Restart

Maintenance Tracking System

HTTP Agent Management

Timezone Handling

Cross-Node Recovery Examples

Example 1: Network Snapshot Creation

Example 2: Cross-Node Restore

Example 3: Validator Safety

Monitoring & Debugging

System Status

Health Check Endpoint

Logs and Troubleshooting

Security Considerations

Production Deployment

Backup Strategy

Storage Requirements

Performance Considerations

Performance

Configuration Examples

Complete Node Configuration with Network Snapshots

Multiple Nodes Same Network Configuration

Log Monitoring Patterns

API Documentation

Contributing

License

Support

Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages