diff --git a/core/summary.md b/core/summary.md new file mode 100644 index 000000000..17fc197dc --- /dev/null +++ b/core/summary.md @@ -0,0 +1,53 @@ +# Core Crate Summary + +The core crate contains foundational infrastructure and shared utilities that +all other crates depend on. We use this crate for shared types, utilities, and +abstractions that need to be accessible across the entire project. + +## What This Crate Contains + +- **Substate System** (`src/substate.rs`): Type-safe access patterns for Redux + state slicing and mutation control +- **WASM Threading** (`src/thread.rs`): Main thread task delegation system + needed for browser environments +- **Network Configuration** (`src/network.rs`): Static network constants and + config for mainnet/devnet +- **Core Domain Types**: Basic blockchain types (blocks, SNARKs, requests, + consensus) that everything uses + - `src/block/` - Block structures and validation + - `src/snark/` - SNARK work and job management + - `src/transaction/` - Transaction helper types and metadata + - `src/consensus.rs` - Consensus types and fork decision logic +- **Request Management** (`src/requests/`): Type-safe request ID generation and + lifecycle tracking +- **Channel Wrappers** (`src/channels.rs`): Abstractions over flume channels for + message passing +- **Distributed Pool** (`src/distributed_pool.rs`): BTreeMap-based data + structure for network-synchronized state + +## Technical Debt and Issues + +### Medium Priority + +**Hardcoded Network Constants** + +- Fork constants are hardcoded in the `NetworkConfig` struct +- Impact: Can't deploy flexibly, need code changes for network updates + +### Low Priority + +**Unsafe Redux Access** + +- `unsafe_get_state()` method in `Substate` struct breaks Redux safety +- Only used by the transaction pool state machine which requires refactoring + +**Inconsistent Error Handling** + +- Various instances of `panic!`, `unwrap()`, or `expect()` calls throughout core + +## Known Limitations + +1. **Network Config**: Can't dynamically configure network parameters without + code changes +2. **WASM Constraints**: Browser limitations require specialized threading + patterns diff --git a/docs/handover/README.md b/docs/handover/README.md new file mode 100644 index 000000000..e41e0ddea --- /dev/null +++ b/docs/handover/README.md @@ -0,0 +1,449 @@ +# OpenMina Handover Documentation + +This directory contains comprehensive documentation for understanding and +working with the OpenMina codebase. The documents are designed to provide a +structured onboarding experience for developers new to the project. + +## Quick Start + +### 👋 New to OpenMina? + +**Start here**: [Architecture Walkthrough](architecture-walkthrough.md) → +[State Machine Structure](state-machine-structure.md) → +[State Machine Patterns](state-machine-patterns.md) → +[Project Organization](organization.md) + +### 🔍 Looking for Something Specific? + +- **Add new features**: + [State Machine Development Guide](state-machine-development-guide.md) - + includes quick reference checklists +- **Add RPC endpoints**: [Adding RPC Endpoints](adding-rpc-endpoints.md) - HTTP + routing and service patterns +- **Testing framework**: [Testing Infrastructure](testing-infrastructure.md) - + scenario-based testing with extensive examples +- **Browser deployment**: [Webnode Implementation](webnode.md) - WebAssembly + build target for running OpenMina in browsers +- **Ledger implementation overview**: [Ledger Crate](ledger-crate.md) - OCaml + port with Rust adaptations +- **Service integration**: [Services](services.md) - complete service inventory + and patterns + +### 📚 Quick Reference + +- **Architecture**: Redux pattern, actions, reducers, effects - see + [Glossary](#glossary-of-key-terms) for definitions +- **Services**: External I/O, threading, event-driven communication - see + [Services](services.md) for complete inventory +- **Technical Debt**: [Services](services-technical-debt.md) (blocking + operations, error handling) | [State Machine](state-machine-technical-debt.md) + (architecture migration, anti-patterns) +- **Testing**: Scenario-based, multi-node simulation, fuzzing - see + [Testing Infrastructure](testing-infrastructure.md) for extensive test + scenarios + +## Document Overview + +- [Architecture Walkthrough](#architecture-walkthrough) +- [State Machine Structure](#state-machine-structure) +- [State Machine Patterns](#state-machine-patterns) +- [Project Organization](#project-organization) +- [Services](#services) +- [Ledger Crate](#ledger-crate) +- [Testing Infrastructure](#testing-infrastructure) +- [State Machine Development Guide](#state-machine-development-guide) +- [State Machine Debugging Guide](#state-machine-debugging-guide) +- [Adding RPC Endpoints](#adding-rpc-endpoints) +- [Fuzzing Infrastructure](#fuzzing-infrastructure) +- [Services Technical Debt](#services-technical-debt) +- [State Machine Technical Debt](#state-machine-technical-debt) +- [Webnode Implementation](#webnode-implementation) +- [Circuits](#circuits) +- [Debug Block Proof Generation](#debug-block-proof-generation) +- [Persistence](#persistence) +- [Mainnet Readiness](#mainnet-readiness) +- [Release Process](#release-process) +- [Component Summaries](#component-summaries) +- [Git Workflow](#git-workflow) +- [P2P Evolution Plan](#p2p-evolution-plan) +- [OCaml Coordination](#ocaml-coordination) + +--- + +### [Architecture Walkthrough](architecture-walkthrough.md) + +**Start here** - Provides a high-level overview of the OpenMina architecture, +including: + +- Architecture philosophy and design principles +- Redux-style state machine architecture +- Core concepts (actions, enabling conditions, reducers, effects) +- Network configuration system (devnet/mainnet) +- Key component overview +- Development guidelines + +### [State Machine Structure](state-machine-structure.md) + +Deep dive into the state machine implementation details: + +- Action types and hierarchies +- Reducer patterns (new vs old style) +- Substate contexts and access control +- Callback systems +- Migration from old to new architecture + +### [State Machine Patterns](state-machine-patterns.md) + +Analysis of common patterns across OpenMina's state machines: + +- Lifecycle Pattern (Init/Pending/Success/Error) - Most common for async + operations +- Request/Response Pattern - For communication-heavy components +- Custom Domain-Specific Patterns - For complex workflows +- Pattern selection guide and implementation best practices + +### [Project Organization](organization.md) + +Overview of the codebase structure and component purposes: + +- Entry points and CLI structure +- Major components (node, p2p, snark, ledger) +- Supporting libraries and cryptographic primitives +- Development tools and utilities +- Dependency hierarchy + +### [Services](services.md) + +Detailed documentation of all system services: + +- Service architecture principles +- Complete inventory of services +- Service trait definitions vs implementations +- Threading and communication patterns +- Service lifecycle management + +### [Ledger Crate](ledger-crate.md) + +High-level overview of the ledger implementation: + +- Direct port from OCaml with Rust adaptations +- Core components (BaseLedger, Mask, Database) +- Transaction processing and validation +- Proof system integration +- Future refactoring plans + +### [Testing Infrastructure](testing-infrastructure.md) + +Comprehensive testing approaches and tools: + +- Scenario-based testing +- Multi-node simulation +- Fuzz testing and differential testing +- Debugging capabilities +- Testing best practices + +### [State Machine Development Guide](state-machine-development-guide.md) + +Practical guide for implementing features with Redux patterns: + +- Making changes to existing components +- Adding new state machines and actions +- Component communication patterns + +### [State Machine Debugging Guide](state-machine-debugging-guide.md) + +Comprehensive troubleshooting and investigation tools: + +- ActionEvent macro and structured logging +- Recording, replay, and testing frameworks +- Common error patterns and solutions + +### [Adding RPC Endpoints](adding-rpc-endpoints.md) + +Focused guide for RPC-specific implementation patterns: + +- RPC request/response type system +- HTTP routing with Warp framework +- Service interface integration +- WASM API exposure patterns + +### [Fuzzing Infrastructure](fuzzing.md) + +Basic fuzzing infrastructure for transaction processing (documentation +incomplete): + +- Differential testing setup against OCaml implementation +- Limited mutation strategies and reproduction capabilities +- Basic debugging tools for fuzzer +- Note: Document is incomplete and contains unverified claims + +### [Services Technical Debt](services-technical-debt.md) + +Analysis of technical debt in the services layer: + +- Service-by-service debt inventory +- Cross-cutting concerns (error handling, blocking operations) +- Prioritized recommendations for improvements +- Critical issues including intentional panics and synchronous operations + +### [State Machine Technical Debt](state-machine-technical-debt.md) + +Systemic architectural issues in state machine implementations: + +- Architecture migration status (old vs new style) +- Anti-patterns and monolithic reducers +- Enabling conditions and service integration issues +- Safety and linting improvements (including clippy lints) +- Prioritized refactoring roadmap + +### [Webnode Implementation](webnode.md) + +WebAssembly build target for browser deployment: + +- WASM compilation with browser-specific threading adaptations +- WebRTC-based P2P networking with pull-based protocol +- JavaScript API for web application integration +- Block production capabilities with browser-based SNARK proving +- Technical constraints and workarounds for browser environment + +### [Circuits](circuits.md) + +Circuit generation process and distribution: + +- Circuit generation requires OCaml (OpenMina fork) +- Circuit blob repository and GitHub releases +- On-demand downloading and caching system +- Network-specific circuit configuration +- Verifier index loading and validation + +### [Debug Block Proof Generation](debug-block-proof-generation.md) + +Technical procedure for debugging failed block proofs: + +- Decrypting and preparing failed proof dumps +- Running proof generation in both Rust and OCaml +- Comparing outputs for debugging discrepancies + +### [Persistence](persistence.md) + +Design for ledger persistence (not yet implemented): + +- Memory reduction strategy for mainnet scale +- Fast restart capabilities +- SNARK verification result caching +- Critical for webnode browser constraints + +### [Mainnet Readiness](mainnet-readiness.md) + +Requirements and gaps for mainnet deployment: + +- Critical missing features (persistence, wide merkle queries) +- Security audit requirements and error sink service integration +- Protocol compliance gaps +- Webnode-specific requirements +- Future compatibility considerations +- Rollout plan with testing requirements and deployment phases + +### [Release Process](release-process.md) + +Comprehensive release workflow: + +- Monthly release cadence during active development +- Version management across all Cargo.toml files +- Changelog and Docker image updates +- CI/CD automation for multi-architecture builds + +### [Component Summaries](component-summaries.md) + +Tree view of all component technical debt documentation: + +- Complete hierarchy of summary.md files +- Links to refactoring plans where available +- Organized by node, p2p, and snark subsystems + +### [Git Workflow](git-workflow.md) + +Git workflow and pull request policy used in the repository: + +- Branch naming conventions and management +- PR development workflow and commit squashing policy +- Merge strategy and best practices +- Commit message format and examples + +### [P2P Evolution Plan](p2p-evolution.md) + +Evolution plan for Mina's P2P networking layer: + +- Unified pull-based P2P design for the entire Mina ecosystem +- Current dual P2P architecture challenges (WebRTC + libp2p) +- Four-phase implementation strategy with QUIC transport integration +- Migration from libp2p to unified protocol across OCaml and Rust nodes +- Requires coordination with OCaml Mina team for ecosystem adoption + +### [OCaml Coordination](ocaml-coordination.md) + +Coordination needs between OCaml and Rust implementations: + +- Maintenance burden coordination (circuit generation and fuzzer branches) +- Cross-implementation testing challenges and potential improvements +- Shared infrastructure dependencies (P2P evolution, archive service) +- Protocol compatibility coordination (hardfork handling) + +## Recommended Reading Order + +### For New Developers + +- **Architecture Walkthrough** - Get the big picture +- **State Machine Structure** - Learn the core programming model +- **State Machine Patterns** - Understand common patterns and when to use them +- **Project Organization** - Understand the codebase layout +- **Services** - Understand external interactions +- **State Machine Development Guide** - Learn practical development patterns +- **Adding RPC Endpoints** - Learn to implement new API endpoints +- **Testing Infrastructure** - Learn how to test your changes +- **State Machine Technical Debt** - Understand known architectural issues and + ongoing improvements + +### For Protocol Developers + +- **Architecture Walkthrough** - Understand the system design +- **Ledger Crate** - High-level overview of ledger implementation +- **State Machine Structure** - Learn state management patterns +- **Services** - Understand proof verification and block production +- **Services Technical Debt** - Be aware of service layer limitations + +### For Quick Reference + +- **Project Organization** - Find where components are located +- **Services** - Look up specific service interfaces +- **State Machine Structure** - Reference for action/reducer patterns +- **Technical Debt Documents** - Check current known issues and planned + improvements + +## Glossary of Key Terms + +### Core Architecture Terms + +**Redux Pattern** - State management architecture where all state changes happen +through actions processed by reducers. Provides predictable state updates and +easy debugging. + +**Action** - Data structure representing a state change request. Can be stateful +(handled by reducers) or effectful (handled by effects). + +**Reducer** - Pure function that takes current state and an action, returns new +state. In new-style components, also handles action dispatching. + +**Effect** - Side-effect handler that interacts with external services. Should +be thin wrappers around service calls. + +**Enabling Condition** - Function that determines if an action is valid in the +current state. Prevents invalid state transitions. + +**State Machine** - Component that manages a specific domain's state through +actions and reducers (e.g., P2P, block producer). + +**Stateful Action** - Action that modifies state and is processed by reducers. + +**Effectful Action** - Action that triggers side effects (service calls) and is +processed by effects. + +### State Management Terms + +**Substate** - Abstraction that gives components access to their specific +portion of global state without coupling to global state structure. + +**Dispatcher** - Interface for dispatching new actions from within reducers. + +**Callback** - Mechanism for components to respond to async operations without +tight coupling. + +**bug_condition!** - Macro for defensive programming that marks code paths that +should be unreachable if enabling conditions work correctly. + +### Service Architecture Terms + +**Service** - Component that handles external I/O, heavy computation, or async +operations. Runs in separate threads. + +**Event** - Result from a service operation that gets converted to an action and +fed back to the state machine. + +**EventSource** - Central service that aggregates events from all other services +and forwards them to the state machine. + +**Deterministic Execution** - Principle that given the same inputs, the system +behaves identically. Achieved by isolating non-determinism to services. + +### Development Terms + +**New Style** - Current architecture pattern with unified reducers that handle +both state updates and action dispatching. + +**Old Style** - Legacy architecture pattern with separate reducer and effects +files. Still used in transition frontier. For migration instructions, see +[ARCHITECTURE.md](../../ARCHITECTURE.md). + +**Component** - Self-contained state machine handling a specific domain (e.g., +transaction pool, P2P networking). + +**Summary.md** - File in each component directory documenting purpose, technical +debt, and implementation notes. + +**ActionEvent** - Derive macro that generates structured logging for actions. + +### Network and Protocol Terms + +**Network Configuration** - System supporting multiple networks (devnet/mainnet) +with different parameters. + +**OCaml Compatibility** - Many components are direct ports from the OCaml Mina +implementation. + +**P2P** - Peer-to-peer networking layer using libp2p with custom WebRTC +transport. + +**SNARK** - Zero-knowledge proof system used for blockchain verification. + +**Ledger** - Blockchain account state management system. + +**Transition Frontier** - Core consensus and blockchain state management +component. + +### Testing Terms + +**Scenario** - Structured test case that can be recorded, saved, and replayed +deterministically. + +**Recording/Replay** - System for capturing execution traces and replaying them +exactly for debugging. + +**Differential Testing** - Comparing OpenMina behavior against the OCaml +implementation. + +**Fuzzing** - Automated testing with random inputs to find edge cases. + +## Key Concepts to Understand + +Before diving into the documentation, familiarize yourself with these core +concepts: + +1. **Redux Pattern** - State management through actions and reducers +2. **Deterministic Execution** - Separation of pure state logic from side + effects +3. **Network Configuration** - Support for multiple networks (devnet/mainnet) + with different parameters +4. **OCaml Compatibility** - Many components are direct ports from the OCaml + implementation +5. **Service Architecture** - External interactions handled by services, not + state machines + +## Additional Resources + +- **Source Code Comments** - Many modules have detailed inline documentation +- **Summary Files** - Look for `summary.md` files in component directories for + technical debt and implementation notes +- **P2P WebRTC Documentation** - See [p2p/readme.md](../../p2p/readme.md) for + details on the WebRTC implementation +- **Technical Debt Analysis** - See the technical debt documents for + comprehensive analysis of known issues diff --git a/docs/handover/adding-rpc-endpoints.md b/docs/handover/adding-rpc-endpoints.md new file mode 100644 index 000000000..1f8707953 --- /dev/null +++ b/docs/handover/adding-rpc-endpoints.md @@ -0,0 +1,110 @@ +# Adding RPC Endpoints Guide + +This guide explains how to navigate the codebase when adding new RPC endpoints +to OpenMina, focusing on the files and architectural patterns specific to RPC +functionality. + +## Prerequisites + +Before adding RPC endpoints, understand: + +- [State Machine Development Guide](state-machine-development-guide.md) - Core + Redux patterns +- [Architecture Walkthrough](architecture-walkthrough.md) - Overall system + design +- [Services](services.md) - Service layer integration + +## RPC Architecture Overview + +OpenMina's RPC system follows this flow: + +``` +HTTP Request → RPC Action → RPC Reducer → RPC Effect → Service → Response +``` + +The RPC layer uses the same Redux patterns as other components but adds HTTP +server integration and service response handling. This guide shows you where to +make changes for each step. + +## Files to Modify + +1. **Request/Response Types** (`node/src/rpc/mod.rs`) - Add to `RpcRequest` enum + and define response type alias +2. **RPC Actions** (`node/src/rpc/rpc_actions.rs`) - Add variant with `rpc_id` + field and enabling condition +3. **RPC Reducer** (`node/src/rpc/rpc_reducer.rs`) - Implement business logic + and dispatch effectful action with response data +4. **Effectful Action** (`node/src/rpc_effectful/rpc_effectful_action.rs`) - Add + variant that carries the response data +5. **Effects** (`node/src/rpc_effectful/rpc_effectful_effects.rs`) - Thin + wrapper that calls service respond method +6. **Service Interface** (`node/src/rpc_effectful/rpc_service.rs`) - Add + `respond_*` method to `RpcService` trait +7. **Service Implementation** (`node/native/src/http_server.rs`) - Implement + service method (usually just calls `self.respond`) +8. **HTTP Routes** (`node/native/src/http_server.rs`) - Add endpoint to + `rpc_router` function + +## Reference Examples + +Study these existing endpoints in the codebase to understand the patterns: + +**Simple Endpoints:** + +- `StatusGet` - Basic node status information +- `HeartbeatGet` - Simple health check +- `SyncStatsGet` - Component statistics + +**Parameterized Endpoints:** + +- `ActionStatsGet` - Takes query parameters +- `LedgerAccountsGet` - Filtered data retrieval + +**Streaming Endpoints:** + +- Look at WebRTC signaling endpoints for `multishot_request` patterns + +Each endpoint follows the same 8-step pattern above. The business logic in +effects varies, but the structural pattern is consistent. + +## Key RPC-Specific Patterns + +### Request/Response Flow + +- HTTP requests come in via `oneshot_request` or `multishot_request` +- RPC actions include an `rpc_id` field for tracking +- Effects call service `respond_*` methods to send responses back +- The framework handles request state tracking automatically + +### Service Layer Integration + +- The `RpcService` trait abstracts over different transport mechanisms (HTTP, + WASM) +- HTTP implementation is in `node/native/src/http_server.rs` +- WASM bindings are in `node/common/src/service/rpc/sender.rs` + +### Streaming vs Single Responses + +- Use `oneshot_request` for endpoints that return one response +- Use `multishot_request` for endpoints that return multiple responses over time +- Most endpoints use `oneshot_request` + +## WASM Frontend Integration + +WASM bindings are in `node/common/src/service/rpc/` organized across multiple +helper structs (`State`, `Stats`, `Ledger`, etc.) plus direct methods on +`RpcSender`. Check `frontend/src/app/core/services/web-node.service.ts` to see +how they're accessed from the frontend (e.g., `webnode.state().peers()`, +`webnode.stats().sync()`). + +## Testing + +Test RPC endpoints with curl against the HTTP server: + +```bash +curl http://localhost:3000/your-endpoint +``` + +The RPC layer follows standard OpenMina Redux patterns with the addition of HTTP +routing and service response handling. Study existing endpoints to understand +the complete flow. diff --git a/docs/handover/architecture-walkthrough.md b/docs/handover/architecture-walkthrough.md new file mode 100644 index 000000000..72c3ef077 --- /dev/null +++ b/docs/handover/architecture-walkthrough.md @@ -0,0 +1,899 @@ +# OpenMina Architecture & Code Walk-through + +## Table of Contents + +1. [Introduction](#introduction) +2. [Architecture Philosophy](#architecture-philosophy) +3. [State Machine Architecture](#state-machine-architecture) +4. [Core Components Overview](#core-components-overview) +5. [Network Configuration System](#network-configuration-system) +6. [Code Organization Patterns](#code-organization-patterns) +7. [Testing & Debugging](#testing--debugging) +8. [Development Guidelines](#development-guidelines) +9. [Communication Patterns](#communication-patterns) + +## Introduction + +OpenMina uses a Redux-inspired architecture pattern where application state is +centralized and all state changes flow through a predictable action dispatch +system. The system is designed as one large state machine composed of smaller, +domain-specific state machines (P2P networking, block production, consensus, +etc.) that work together. + +All CPU-intensive operations, I/O, and non-deterministic operations are moved to +services - separate components that interact with the outside world and run in +their own threads. This separation ensures the core state machine remains +deterministic, making the system predictable, testable, and debuggable. + +> **Next Steps**: After this overview, read +> [State Machine Structure](state-machine-structure.md) for implementation +> details, then [Project Organization](organization.md) for codebase navigation. + +### Key Design Principles + +- **Deterministic execution** - Given same inputs, behavior is always identical +- **Pure state management** - State changes only through reducers +- **Effect isolation** - Side effects separated from business logic +- **Component decoupling** - Clear boundaries between subsystems + +## Architecture Philosophy + +The architecture distinguishes between two fundamental types of components: + +### State Machine Components (Stateful Actions) + +- Manage core application state through pure functions +- Business logic resides in reducers with controlled state access +- Designed for determinism and predictability +- Interact with services only via effectful actions + +### Service Components (Effectful Actions) + +- Handle "outside world" interactions (network, disk, heavy computation) +- Run asynchronously to keep state machine responsive +- Minimal internal state - decision-making stays in state machine +- Communicate back via Events wrapped in actions + +This separation ensures the core state management remains deterministic and +testable while side effects are handled in a controlled manner. + +## State Machine Architecture + +### Core Concepts + +#### State + +State is the central concept in the architecture - it represents the entire +application's data at any point in time. The global state is composed of smaller +domain-specific states: + +```rust +pub struct State { + pub p2p: P2pState, + pub transition_frontier: TransitionFrontierState, + pub snark_pool: SnarkPoolState, + pub transaction_pool: TransactionPoolState, + pub block_producer: BlockProducerState, + // ... etc +} +``` + +Each component manages its own state structure, often using enums to represent +different stages of operations: + +```rust +pub enum ConnectionState { + Disconnected, + Connecting { attempt: u32, started_at: Timestamp }, + Connected { peer_info: PeerInfo }, + Error { reason: String }, +} +``` + +State is directly mutable by reducers as an optimization - rather than returning +new state, reducers modify the existing state in place. + +#### Actions + +Actions represent state transitions in the system. They are nested +hierarchically by context: + +```rust +pub enum Action { + CheckTimeouts(CheckTimeoutsAction), + P2p(P2pAction), + Ledger(LedgerAction), + TransitionFrontier(TransitionFrontierAction), + // ... etc +} +``` + +Actions are divided into two categories: + +- **Stateful Actions**: Update state and dispatch other actions (handled by + reducers) +- **Effectful Actions**: Thin wrappers for service interactions (handled by + effects) + +#### Enabling Conditions + +Every action must implement `EnablingCondition` to prevent invalid state +transitions: + +```rust +pub trait EnablingCondition { + fn is_enabled(&self, state: &State, time: Timestamp) -> bool; +} +``` + +Reducers, after performing a state update, will attempt to advance the state +machine in all directions that make sense from that point by dispatching +multiple potential next actions. However, it is the enabling conditions that +ultimately decide which of these transitions actually proceed. This creates a +natural flow where reducers propose all possible next steps, and enabling +conditions act as gates that filter out invalid paths based on the current +state. + +For example, a reducer might dispatch actions to send messages to all connected +peers, but enabling conditions will filter out actions for peers that have since +disconnected. + +#### Reducers (New Style) + +In the new architecture, reducers handle both state updates and action +dispatching: + +```rust +impl ComponentState { + pub fn reducer( + mut state_context: crate::Substate, + action: ComponentActionWithMetaRef<'_>, + ) { + let Ok(state) = state_context.get_substate_mut() else { return }; + + match action { + ComponentAction::SomeAction { data } => { + // Phase 1: State updates + state.field = data.clone(); + + // Phase 2: Dispatch follow-up actions + let dispatcher = state_context.into_dispatcher(); + // Or use into_dispatcher_and_state() for global state access + // let (dispatcher, global_state) = state_context.into_dispatcher_and_state(); + dispatcher.push(ComponentAction::NextAction { ... }); + } + } + } +} +``` + +The `Substate` context enforces separation between state mutation and action +dispatching phases. + +#### Effects (New Style) + +Effects are now thin wrappers that call service methods: + +```rust +impl EffectfulAction { + pub fn effects(&self, _: &ActionMeta, store: &mut Store) { + match self { + EffectfulAction::LoadData { id } => { + store.service.load_data(id.clone()); + } + EffectfulAction::ComputeProof { input } => { + store.service.compute_proof(input.clone()); + } + } + } +} +``` + +Effects do NOT dispatch actions - they only interact with services. Services +communicate results back via Events. + +### Execution Model + +#### Single-Threaded Concurrent State Machines + +A critical architectural principle: **all state machines run in a single thread +but operate concurrently**. + +**Concurrent, Not Parallel:** + +- Multiple state machines can be in different phases of their lifecycles + simultaneously +- A connection may be `Pending` while VRF evaluation is `InProgress` and block + production is `Idle` +- Only one action processes at a time - no race conditions or synchronization + needed + +**Single-Threaded Benefits:** + +- **Deterministic execution** - Actions process in a predictable order +- **Simplified debugging** - No thread synchronization issues +- **State consistency** - No locks or atomic operations needed +- **Replay capability** - Exact reproduction of execution sequences + +**Example Flow:** + +``` +Time 1: P2pConnectionAction::Initialize → P2P state becomes Connecting +Time 2: VrfEvaluatorAction::BeginEpoch → VRF state becomes Evaluating +Time 3: P2pConnectionAction::Success → P2P state becomes Ready +Time 4: VrfEvaluatorAction::Continue → VRF continues evaluation +``` + +Each action executes atomically, but multiple state machines progress +independently. + +**Services and Threading:** While the state machine is single-threaded, +CPU-intensive work runs in dedicated service threads: + +- Main thread: Redux store, all state transitions +- Service threads: Proof generation, cryptographic operations, I/O +- Communication: Services send events back via channels + +This design keeps the state machine responsive while isolating non-deterministic +operations. + +### Defensive Programming with `bug_condition!` + +The codebase uses a `bug_condition!` macro for defensive programming and +invariant checking: + +```rust +P2pChannelsRpcAction::RequestSend { .. } => { + let Self::Ready { local, .. } = rpc_state else { + bug_condition!( + "Invalid state for `P2pChannelsRpcAction::RequestSend`, state: {:?}", + rpc_state + ); + return Ok(()); + }; + // Continue processing... +} +``` + +**Purpose**: `bug_condition!` marks code paths that should be unreachable if +enabling conditions work correctly. It provides a safety net for catching +programming logic errors. + +**Behavior**: + +- **Development** (`OPENMINA_PANIC_ON_BUG=true`): Panics immediately to catch + bugs early +- **Production** (default): Logs error and continues execution gracefully + +**Relationship to Enabling Conditions**: + +1. Enabling conditions prevent invalid actions from reaching reducers +2. `bug_condition!` double-checks the same invariants in reducers +3. If `bug_condition!` triggers, it indicates a mismatch between enabling + condition logic and reducer assumptions + +This is **not error handling** - it's invariant checking for scenarios that +should never occur in correct code. + +### State Machine Inputs + +The state machine has three types of inputs ensuring deterministic behavior: + +1. **Events** - External data from services wrapped in + `EventSourceNewEventAction` +2. **Time** - Attached to every action via `ActionMeta` +3. **Synchronous service returns** - Avoided when possible + +This determinism enables recording and replay for debugging. + +## Core Components Overview + +### Node (`node/`) + +The main orchestrator containing: + +**State Machine Components:** + +- Core state/reducer/action management +- Block producer scheduling +- Transaction/SNARK pools +- Transition frontier (blockchain state) - _Note: Still uses old-style + architecture_ +- RPC request handling +- Fork resolution logic (with core consensus rules in `core/src/consensus.rs`) + +**Service Components:** + +- Block production service (prover interactions) +- Ledger service (database operations) +- External SNARK worker coordination +- Event source (aggregates external events) + +### P2P Networking (`p2p/`) + +Manages peer connections and communication through two distinct network layers: + +**State Machine Components:** + +- Connection lifecycle management +- Channel state machines that abstract over the differences between the two + networks +- Channel management (RPC, streaming) +- Peer discovery (Kademlia DHT) +- Message routing (Gossipsub) + +**Dual Network Architecture:** + +_libp2p-based Network:_ + +- Used by native node implementations +- Transport protocols (TCP, custom WebRTC) +- Security (Noise handshake) +- Multiplexing (Yamux) +- Protocol negotiation + +_WebRTC-based Network:_ + +- Used by webnode (browser-based node) +- Direct WebRTC transport implementation +- Different design pattern from libp2p +- Optimized for browser constraints + +### SNARK Verification (`snark/`) + +Handles zero-knowledge proof verification: + +**State Machine Components:** + +- Block verification state +- SNARK work verification +- Transaction proof verification + +**Service Components:** + +- Async proof verification services +- Batching for efficiency + +### Ledger (`ledger/`) + +A comprehensive Rust port of the OCaml Mina ledger with identical business +logic: + +**Core Components:** + +- **BaseLedger trait** - Fundamental ledger interface for account management and + Merkle operations +- **Mask system** - Layered ledger views with copy-on-write semantics for + efficient state management +- **Database** - In-memory account storage and Merkle tree management + +**Transaction Processing:** + +- **Transaction Pool** - Fee-based ordering, sender queue management, nonce + tracking +- **Staged Ledger** - Transaction application and block validation +- **Scan State** - Parallel scan tree for SNARK work coordination + +**Advanced Features:** + +- **Proof System Integration** - Transaction, block, and zkApp proof + verification using Kimchi +- **zkApp Support** - Full zkApp transaction processing with account updates and + permissions +- **Sparse Ledger** - Efficient partial ledger representation for SNARK proof + generation + +**OCaml Compatibility:** + +- Direct port maintaining same Merkle tree structure, transaction validation + rules, and account model +- Memory-only implementation adapted to Rust idioms (Result types, ownership + model) + +_For detailed documentation, see [`ledger-crate.md`](ledger-crate.md)_ + +### Supporting Components + +- **Core Types** (`core/`) - Shared data structures +- **Cryptography** (`vrf/`, `poseidon/`) - Crypto primitives +- **Serialization** (`mina-p2p-messages/`) - Network messages + +## Network Configuration System + +OpenMina supports multiple networks through a centralized configuration system +defined in `core/src/network.rs`: + +### Network Types + +- **Devnet** (`NetworkId::TESTNET`) - Development and testing network +- **Mainnet** (`NetworkId::MAINNET`) - Production Mina network + +### Configuration Components + +Each network configuration includes: + +- **Cryptographic Parameters**: Network-specific signature prefixes and hash + parameters +- **Circuit Configuration**: Directory names and circuit blob identifiers for + each proof type +- **Default Peers**: Bootstrap peers for initial P2P connection +- **Constraint Constants**: Consensus parameters like ledger depth, work delay, + block timing +- **Fork Configuration**: Hard fork parameters including state hash and + blockchain length + +### Configuration Initialization + +1. **Global Access**: `NetworkConfig::global()` provides access to the active + configuration +2. **Network Selection**: `NetworkConfig::init(network_name)` sets the global + config once +3. **Service Integration**: All services access network parameters through the + global config + +This design ensures OpenMina can operate on different Mina networks while +maintaining protocol compatibility. + +## Code Organization Patterns + +### New Architecture Style + +Most state machine components follow the new pattern with: + +1. **Substate Access** - Fine-grained state control +2. **Unified Reducers** - Handle both state updates and action dispatching in + two enforced phases +3. **Thin Effects** - Only wrap service calls +4. **Callbacks** - Enable decoupled component communication +5. **Clear Separation** - Stateful vs Effectful actions + +Example structure: + +```rust +// Stateful action with reducer +impl WatchedAccountsState { + pub fn reducer( + mut state_context: crate::Substate, + action: WatchedAccountsActionWithMetaRef<'_>, + ) { + let Ok(state) = state_context.get_substate_mut() else { return }; + + match action { + WatchedAccountsAction::Add { pub_key } => { + // Update state + state.insert(pub_key.clone(), WatchedAccountState { ... }); + + // Dispatch follow-up + let dispatcher = state_context.into_dispatcher(); + dispatcher.push(WatchedAccountsAction::LedgerInitialStateGetInit { + pub_key: pub_key.clone() + }); + } + } + } +} + +// Effectful action with service interaction +impl LedgerEffectfulAction { + pub fn effects(&self, _: &ActionMeta, store: &mut Store) { + match self { + LedgerEffectfulAction::Write { request } => { + store.service.write_ledger(request.clone()); + } + } + } +} +``` + +### Old Architecture Style (Transition Frontier) + +The transition frontier still uses the original Redux pattern: + +- **Reducers** only update state (no action dispatching) +- **Effects** handle all follow-up action dispatching after state changes +- Separate reducer and effects functions + +This pattern matches traditional Redux but creates challenges in following the +flow since state updates and the resulting next actions are separated across +different files. The new style was introduced to improve code locality and make +the execution flow easier to follow. + +### Callbacks Pattern + +Callbacks enable dynamic action composition, allowing callers to specify +different flows after completion of the same underlying action. This pattern +solves several architectural problems: + +**Before Callbacks:** + +- All action flows were static and hardcoded +- Same actions needed to be duplicated for different completion flows +- Components were tightly coupled since actions had fixed next steps +- Adding new use cases required modifying existing actions + +**With Callbacks:** + +- Callers can reuse the same action with different completion behaviors +- Reduces component coupling by making actions more generic +- Eliminates action duplication across different contexts +- Easy to extend with new flows without modifying existing code + +```rust +// Same action, different completion flows based on caller context +dispatcher.push(SnarkBlockVerifyAction::Init { + req_id, + block: block.clone(), + on_success: redux::callback!( + on_verify_success(hash: BlockHash) -> Action { + ConsensusAction::BlockVerifySuccess { hash } // Flow for consensus + } + ), + on_error: redux::callback!( + on_verify_error((hash: BlockHash, error: Error)) -> Action { + ConsensusAction::BlockVerifyError { hash, error } + } + ), +}); + +// Same verification action, but different completion flow for RPC context +dispatcher.push(SnarkBlockVerifyAction::Init { + req_id, + block: block.clone(), + on_success: redux::callback!( + on_rpc_verify_success(hash: BlockHash) -> Action { + RpcAction::BlockVerifyResponse { hash, success: true } // Flow for RPC + } + ), + on_error: redux::callback!( + on_rpc_verify_error((hash: BlockHash, error: Error)) -> Action { + RpcAction::BlockVerifyResponse { hash, success: false, error } + } + ), +}); +``` + +### Directory Structure + +Each major component follows a consistent pattern: + +``` +component/ +├── component_state.rs # State definition +├── component_actions.rs # Stateful action types +├── component_reducer.rs # State transitions + dispatching +└── component_effectful/ # Effectful actions + ├── component_effectful_actions.rs + ├── component_effectful_effects.rs + └── component_service.rs # Service interface +``` + +## Testing & Debugging + +Testing benefits from the deterministic execution model: + +### Testing Approaches + +1. **Scenarios** - Specific network setups testing behaviors +2. **Simulator** - Multi-node controlled environments +3. **Fuzz Testing** - Random inputs finding edge cases +4. **Differential Fuzz Testing** - Comparing ledger implementation against the + original OCaml version +5. **Invariant Checking** - Ensuring state consistency + +### Debugging Features + +1. **State Recording** - All inputs can be recorded +2. **Replay Capability** - Reproduce exact execution +3. **State Inspection** - Direct state examination in tests +4. **Deterministic Behavior** - Same inputs = same outputs + +### Key Testing Properties + +- **Determinism** - Predictable state transitions +- **Isolation** - State logic testable without services +- **Composability** - Complex scenarios from simple actions +- **Observability** - Full state visibility + +## Development Guidelines + +### Understanding the Codebase + +1. **Start with State** - State definitions reveal the flow +2. **Follow Actions** - Stateful vs effectful distinction +3. **Check Enabling Conditions** - Understand validity rules +4. **Trace Callbacks** - See component interactions + +### Adding New Features + +1. **Design State First** - State should represent the flow +2. **Categorize Actions** - Stateful or effectful? +3. **Strict Enabling Conditions** - Prevent invalid states +4. **Use Callbacks** - For decoupled responses +5. **Keep Effects Thin** - Only service calls + +### Best Practices + +1. **State Represents Flow** - Make state self-documenting +2. **Actions Match Transitions** - Consistent naming conventions +3. **Reducers Handle Logic** - State updates + dispatching +4. **Effects Only Call Services** - No business logic +5. **Services Stay Minimal** - I/O and computation only + +### Common Patterns + +1. **Async Operations** - Effectful action → Service → Event → New action + dispatch +2. **State Machines** - Enum variants representing stages +3. **Timeouts** - CheckTimeouts action triggers checks +4. **Error States** - Explicit error variants in state + +### Architecture Evolution + +The state machine components have been transitioning from old to new style: + +- **New Style**: Unified reducers, thin effects, callbacks - most components + have been migrated +- **Old Style**: Separate reducers/effects - transition frontier still uses this + pattern +- **Migration Path**: State machine components updated incrementally + +For detailed migration instructions, see +[ARCHITECTURE.md](../../ARCHITECTURE.md). + +## Communication Patterns + +The architecture provides several patterns for components to communicate while +maintaining decoupling and predictability. + +### Direct Action Dispatching + +Components can dispatch actions to trigger behavior in other components. This is +the primary pattern for synchronous communication. + +**Example: Ledger to Block Producer Communication** + +```rust +// From node/src/ledger/read/ledger_read_reducer.rs +// After receiving delegator table, notify block producer +match table { + None => { + dispatcher.push( + BlockProducerVrfEvaluatorAction::FinalizeDelegatorTableConstruction { + delegator_table: Default::default(), + }, + ); + } + Some(table) => { + dispatcher.push( + BlockProducerVrfEvaluatorAction::FinalizeDelegatorTableConstruction { + delegator_table: table.into(), + }, + ); + } +} +``` + +**Example: P2P Best Tip Propagation** + +```rust +// From p2p/src/channels/best_tip/p2p_channels_best_tip_reducer.rs +// When best tip is received, update peer state +dispatcher.push(P2pPeerAction::BestTipUpdate { peer_id, best_tip }); +``` + +### Callback Pattern + +Components can register callbacks that get invoked when asynchronous operations +complete. This enables loose coupling between components. + +**Example: P2P Channel Initialization** + +```rust +// From p2p/src/channels/best_tip/p2p_channels_best_tip_reducer.rs +dispatcher.push(P2pChannelsEffectfulAction::InitChannel { + peer_id, + id: ChannelId::BestTipPropagation, + on_success: redux::callback!( + on_best_tip_channel_init(peer_id: PeerId) -> crate::P2pAction { + P2pChannelsBestTipAction::Pending { peer_id } + } + ), +}); +``` + +**Example: Transaction Pool Account Fetching** + +```rust +// From node/src/transaction_pool/transaction_pool_reducer.rs +dispatcher.push(TransactionPoolEffectfulAction::FetchAccounts { + account_ids, + ledger_hash: best_tip_hash.clone(), + on_result: callback!( + fetch_to_verify((accounts: BTreeMap, id: Option, from_source: TransactionPoolMessageSource)) + -> crate::Action { + TransactionPoolAction::StartVerifyWithAccounts { accounts, pending_id: id.unwrap(), from_source } + } + ), + pending_id: Some(pending_id), + from_source: *from_source, +}); +``` + +### Event Source Pattern + +Services communicate results back through events that get converted to actions. +The event source acts as the bridge between the async service world and the +synchronous state machine. + +**Note:** Currently, all event handling is centralized in +`node/src/event_source/`. The architectural intention is to eventually +distribute this logic across the individual effectful state machines that care +about specific events, making the system more modular and maintainable. + +**Example: Service Event Processing** + +```rust +// From node/src/event_source/event_source_effects.rs +Event::Ledger(event) => match event { + LedgerEvent::Write(response) => { + store.dispatch(LedgerWriteAction::Success { response }); + } + LedgerEvent::Read(id, response) => { + store.dispatch(LedgerReadAction::Success { id, response }); + } +}, +Event::Snark(event) => match event { + SnarkEvent::BlockVerify(req_id, result) => match result { + Err(error) => { + store.dispatch(SnarkBlockVerifyAction::Error { req_id, error }); + } + Ok(()) => { + store.dispatch(SnarkBlockVerifyAction::Success { req_id }); + } + }, +} +``` + +### State Callbacks Pattern + +Components can expose callbacks in their state that other components can +register to. This enables dynamic subscription to events. + +**Example: P2P RPC Response Handling** + +```rust +// From p2p/src/channels/rpc/p2p_channels_rpc_reducer.rs +let (dispatcher, state) = state_context.into_dispatcher_and_state(); +let p2p_state: &P2pState = state.substate()?; + +// Notify interested components about RPC response +if let Some(callback) = &p2p_state.callbacks.on_p2p_channels_rpc_response_received { + dispatcher.push_callback(callback.clone(), (peer_id, rpc_id, response)); +} + +// Handle timeout notifications +if let Some(callback) = &p2p_state.callbacks.on_p2p_channels_rpc_timeout { + dispatcher.push_callback(callback.clone(), (peer_id, id)); +} +``` + +### Service Request with Callbacks + +Components can make service requests and provide callbacks for handling both +success and error cases. + +**Example: SNARK Verification Request** + +```rust +// From node/src/transaction_pool/transaction_pool_reducer.rs +dispatcher.push(SnarkUserCommandVerifyAction::Init { + req_id, + commands: verifiable, + from_source: *from_source, + on_success: callback!( + on_snark_user_command_verify_success( + (req_id: SnarkUserCommandVerifyId, valids: Vec, from_source: TransactionPoolMessageSource) + ) -> crate::Action { + TransactionPoolAction::VerifySuccess { + valids, + from_source, + } + } + ), + on_error: callback!( + on_snark_user_command_verify_error( + (req_id: SnarkUserCommandVerifyId, errors: Vec) + ) -> crate::Action { + TransactionPoolAction::VerifyError { errors } + } + ) +}); +``` + +### State Machine Lifecycle + +#### Initialization + +``` +Main Node Init ──> Subsystem Creation ──> Service Spawning ──> Ready State +``` + +#### Action Processing + +``` +Event ──> Action Queue ──> Next Action ──> Enabling Check ──┐ + ▲ │ │ + │ │ ▼ + │ │ Rejected + │ ▼ + │ Reducer + │ │ + │ ┌───────────┴───────────┐ + │ │ │ + │ ▼ ▼ + │ State Update 0+ Effectful Actions + │ │ │ + │ ▼ ▼ + └──────── 0+ Stateful Actions Service Calls + │ + ▼ + Queue Empty ──> Listen for Events <─── Result Events +``` + +#### Effect Handling + +``` +Effectful Action ──> Service Call ──> Service Thread ──> Processing ──> Event + │ + ▼ + Action Queue +``` + +### Mental Model + +When working with this architecture, shift from imperative to declarative +thinking: + +**State-First Design:** + +- State enums represent the flow: `Idle → Pending → Success/Error` +- Actions represent transitions: "what event happened?" not "what should I do?" +- Reducers answer two questions: + 1. "Given this state and event, what's the new state?" + 2. "What are all possible next steps from here?" + +**Reducer Orchestration:** + +- Reducers update state AND dispatch multiple potential next actions +- Enabling conditions act as gates - only actions valid for the current state + proceed +- This creates a branching execution where reducers propose paths and conditions + filter them + +**Action Classification:** + +- **Stateful**: Updates state, dispatches other actions (business logic) +- **Effectful**: Calls services, never updates state directly (I/O boundary) +- **Events**: External inputs wrapped in actions (deterministic replay) + +**Async Operations Pattern:** + +``` +1. Dispatch Effectful Action → 2. Service processes → 3. Event generated → 4. Action dispatched +``` + +**Debugging Mental Model:** + +- Logs show the exact sequence of actions - trace execution flow +- State inspection reveals current system state at any moment +- Actions can be recorded for deterministic replay (when enabled) +- Common bugs: missing enabling conditions, incorrect state transitions + +**Common Mental Shift:** Instead of "call API then update state", think +"dispatch action, let reducer update state and propose next actions, enabling +conditions filter valid paths based on current state, services report back via +events that trigger new actions". + +The architecture may feel unusual initially, but its benefits in correctness, +testability, and debuggability make it powerful for building reliable +distributed systems. diff --git a/docs/handover/circuits.md b/docs/handover/circuits.md new file mode 100644 index 000000000..4a3ef7632 --- /dev/null +++ b/docs/handover/circuits.md @@ -0,0 +1,196 @@ +# Circuit Generation and Management + +## Overview + +OpenMina has ported the circuit logic from the Mina protocol, but with an +important architectural distinction: the implementation only handles witness +production, not constraint generation. This means that while OpenMina can +produce proofs using existing circuits, it cannot generate the circuit +definitions themselves. + +For an overview of the proof system implementation in `ledger/src/proofs/`, see +[`ledger/src/proofs/summary.md`](../../ledger/src/proofs/summary.md). + +## Architecture + +### Proof Generation Implementation and Limitations + +The OpenMina codebase includes complete proof generation capabilities with one +key limitation: + +**What OpenMina Can Do:** + +- **Witness generation**: Full implementation for producing witnesses needed for + proof generation +- **Proof production**: Complete capability to create proofs using pre-existing + circuit definitions +- **Circuit logic**: Equivalent to the OCaml implementation for all proof types +- **Proof verification**: Can verify proofs using precomputed verification + indices + +**What OpenMina Cannot Do:** + +- **Circuit constraints**: Missing the constraint declarations from the OCaml + code that define circuit structure +- **Constraint compilation/evaluation**: Missing the functionality to + compile/evaluate constraint declarations into circuit constraints +- **Verification key generation**: Cannot generate verification keys for new + circuits + +**Practical Implications:** + +- Can generate proofs and witnesses for existing circuits +- Cannot create new circuits or modify existing circuit definitions +- Relies on OCaml implementation for all circuit creation and constraint + processing +- Uses precomputed verification indices from the OCaml implementation + +The circuit logic is equivalent to the OCaml implementation except both the +constraint declarations and the constraint compilation/evaluation functionality +are missing - these were not ported due to time constraints during development, +not technical limitations, and could be added for full independence. + +### Circuit Generation Process + +Since these constraint capabilities are missing, OpenMina requires externally +generated circuit data. The following process describes how circuits are created +and distributed using the original Mina codebase: + +1. Using a custom branch in the OpenMina fork of Mina: + https://github.com/openmina/mina + - Branch: `utils/dump-extra-circuit-data-devnet301` + - This branch contains modifications to export circuit data in a format + consumable by OpenMina + - **Note**: This branch is very messy and should be cleaned up and integrated + into mina mainline to ease the process + +2. Running the circuit generation process using the branch above + - Launch the OCaml node which produces circuit cache data in + `/tmp/coda_cache_dir` + - The branch dumps the usual circuit data plus extra data specifically + required by OpenMina + - The process also dumps blocks for use in tests + - Integration with mainline Mina would streamline future circuit generation + +3. The generated circuit blobs are then: + - Committed to the dedicated repository: + https://github.com/openmina/circuit-blobs + - Released as GitHub releases for versioning and distribution + +### Circuit Distribution + +OpenMina nodes handle circuits dynamically: + +- When a node needs a circuit that isn't locally available, it automatically + downloads it from the circuit-blobs repository +- Downloaded circuits are cached locally for future use +- This on-demand approach keeps the base installation size minimal while + ensuring all necessary circuits are available when needed + +## Circuit Blob Repository Structure + +The https://github.com/openmina/circuit-blobs repository serves as the central +distribution point for all circuit definitions used by OpenMina. The repository: + +- Contains pre-generated circuit blobs for all supported proof types +- Uses GitHub releases for versioning +- Provides a stable download source for OpenMina nodes + +## Future Considerations + +Potential future improvements include: + +- Completing the constraint generation implementation in OpenMina for a fully + self-contained system +- Automating the circuit generation and publishing process +- Implementing circuit versioning strategies for protocol upgrades + +## Circuit Loading and Caching + +### Circuit Loading Process + +1. **Circuit Discovery**: When a circuit is needed, the system searches for it + in several locations: + - Environment variable `OPENMINA_CIRCUIT_BLOBS_BASE_DIR` (if set) + - Current manifest directory (for development) + - User's home directory: `~/.openmina/circuit-blobs/` + - System-wide installation: `/usr/local/lib/openmina/circuit-blobs/` + +2. **Automatic Download**: If no local circuit is found, the system: + - Downloads the circuit blob from + `https://github.com/openmina/circuit-blobs/releases/download/` + - Caches it to `~/.openmina/circuit-blobs/` for future use + - Logs the download and caching process + +3. **WASM Handling**: For WebAssembly builds, circuits are loaded via HTTP from + `/assets/webnode/circuit-blobs/` (configurable via + `CIRCUIT_BLOBS_HTTP_PREFIX`) + +### Circuit Components + +Each circuit consists of multiple components that are loaded and cached +independently: + +- **Gates**: Circuit constraint definitions in JSON format (`*_gates.json`) +- **Internal Variables**: Constraint variable mappings in binary format + (`*_internal_vars.bin`) +- **Rows Reverse**: Row-wise constraint data in binary format (`*_rows_rev.bin`) +- **Verifier Indices**: Pre-computed verification data with SHA256 integrity + checks + +### Verifier Index Caching + +The system implements a two-level caching strategy for verifier indices: + +1. **Source Validation**: Each verifier index is validated against a SHA256 + digest of the source JSON +2. **Index Validation**: The processed verifier index is also validated with its + own SHA256 digest +3. **Cache Storage**: Valid indices are cached with both digests for rapid + future access + +### Circuit Types + +The system supports multiple circuit types for different proof operations: + +- **Transaction Circuits**: For transaction proof generation and verification +- **Block Circuits**: For block proof generation and verification +- **Merge Circuits**: For combining multiple proofs +- **ZkApp Circuits**: For zero-knowledge application proofs with various + signature patterns + +### Performance Optimization + +Circuit loading is optimized through: + +- **Lazy Loading**: Circuits are only loaded when actually needed +- **Static Caching**: Once loaded, circuits are cached in static variables for + the lifetime of the process +- **Concurrent Access**: Multiple threads can safely access the same cached + circuit data +- **Integrity Verification**: SHA256 checksums ensure data integrity without + performance penalties + +### Network-Specific Circuit Configuration + +Circuit loading is controlled by the network configuration system in +`core/src/network.rs`: + +- **Directory Selection**: Each network has a specific circuit directory (e.g., + `3.0.1devnet`, `3.0.0mainnet`) +- **Circuit Blob Names**: Network-appropriate circuit blob identifiers for each + proof type +- **Verifier Indices**: Network-specific JSON files embedded in the binary + (`ledger/src/proofs/data/`) +- **Cache Isolation**: Different networks cache circuits in separate + subdirectories + +The `CircuitsConfig` struct defines all circuit blob names for each network, +ensuring the correct circuit versions are loaded for the target environment. + +## Related Documentation + +- For debugging block proof generation, see + [debug-block-proof-generation.md](./debug-block-proof-generation.md) +- For mainnet readiness considerations, see + [mainnet-readiness.md](./mainnet-readiness.md) diff --git a/docs/handover/component-summaries.md b/docs/handover/component-summaries.md new file mode 100644 index 000000000..bd2d81d14 --- /dev/null +++ b/docs/handover/component-summaries.md @@ -0,0 +1,139 @@ +# Component Summaries + +This document provides a tree view of all component summary documentation +throughout the OpenMina codebase. Each component's `summary.md` file contains +technical debt analysis and implementation notes. + +## Component Tree + +- **openmina/** + - **core/** + - [summary.md](../../core/summary.md) + - **ledger/** + - [summary.md](../../ledger/summary.md) + - **src/** + - **proofs/** + - [summary.md](../../ledger/src/proofs/summary.md) + - **node/** + - **src/** + - [summary.md](../../node/src/summary.md) + - **block_producer/** + - [summary.md](../../node/src/block_producer/summary.md) + - **vrf_evaluator/** + - [summary.md](../../node/src/block_producer/vrf_evaluator/summary.md) + - **event_source/** + - [summary.md](../../node/src/event_source/summary.md) + - **external_snark_worker/** + - [summary.md](../../node/src/external_snark_worker/summary.md) + - **ledger/** + - [summary.md](../../node/src/ledger/summary.md) + - **read/** + - [summary.md](../../node/src/ledger/read/summary.md) + - **write/** + - [summary.md](../../node/src/ledger/write/summary.md) + - **rpc/** + - [summary.md](../../node/src/rpc/summary.md) + - **snark_pool/** + - [summary.md](../../node/src/snark_pool/summary.md) + - **candidate/** + - [summary.md](../../node/src/snark_pool/candidate/summary.md) + - **transaction_pool/** + - [summary.md](../../node/src/transaction_pool/summary.md) + - [transaction_pool_refactoring.md](../../node/src/transaction_pool/transaction_pool_refactoring.md) + - **candidate/** + - [summary.md](../../node/src/transaction_pool/candidate/summary.md) + - **transition_frontier/** + - [summary.md](../../node/src/transition_frontier/summary.md) + - **candidate/** + - [summary.md](../../node/src/transition_frontier/candidate/summary.md) + - **genesis/** + - [summary.md](../../node/src/transition_frontier/genesis/summary.md) + - **sync/** + - [summary.md](../../node/src/transition_frontier/sync/summary.md) + - **ledger/** + - [summary.md](../../node/src/transition_frontier/sync/ledger/summary.md) + - **snarked/** + - [summary.md](../../node/src/transition_frontier/sync/ledger/snarked/summary.md) + - **staged/** + - [summary.md](../../node/src/transition_frontier/sync/ledger/staged/summary.md) + - **watched_accounts/** + - [summary.md](../../node/src/watched_accounts/summary.md) + - **p2p/** + - **src/** + - [summary.md](../../p2p/src/summary.md) + - **channels/** + - [summary.md](../../p2p/src/channels/summary.md) + - **best_tip/** + - [summary.md](../../p2p/src/channels/best_tip/summary.md) + - **rpc/** + - [summary.md](../../p2p/src/channels/rpc/summary.md) + - **signaling/** + - **discovery/** + - [summary.md](../../p2p/src/channels/signaling/discovery/summary.md) + - **exchange/** + - [summary.md](../../p2p/src/channels/signaling/exchange/summary.md) + - **snark/** + - [summary.md](../../p2p/src/channels/snark/summary.md) + - **snark_job_commitment/** + - [summary.md](../../p2p/src/channels/snark_job_commitment/summary.md) + - **streaming_rpc/** + - [summary.md](../../p2p/src/channels/streaming_rpc/summary.md) + - **transaction/** + - [summary.md](../../p2p/src/channels/transaction/summary.md) + - **connection/** + - [summary.md](../../p2p/src/connection/summary.md) + - **incoming/** + - [summary.md](../../p2p/src/connection/incoming/summary.md) + - **outgoing/** + - [summary.md](../../p2p/src/connection/outgoing/summary.md) + - **disconnection/** + - [summary.md](../../p2p/src/disconnection/summary.md) + - **network/** + - [summary.md](../../p2p/src/network/summary.md) + - **identify/** + - [summary.md](../../p2p/src/network/identify/summary.md) + - **stream/** + - [summary.md](../../p2p/src/network/identify/stream/summary.md) + - **kad/** + - [summary.md](../../p2p/src/network/kad/summary.md) + - **bootstrap/** + - [summary.md](../../p2p/src/network/kad/bootstrap/summary.md) + - **request/** + - [summary.md](../../p2p/src/network/kad/request/summary.md) + - **stream/** + - [summary.md](../../p2p/src/network/kad/stream/summary.md) + - **noise/** + - [summary.md](../../p2p/src/network/noise/summary.md) + - [p2p_network_noise_refactoring.md](../../p2p/src/network/noise/p2p_network_noise_refactoring.md) + - **pnet/** + - [summary.md](../../p2p/src/network/pnet/summary.md) + - [p2p_network_pnet_refactoring.md](../../p2p/src/network/pnet/p2p_network_pnet_refactoring.md) + - **pubsub/** + - [summary.md](../../p2p/src/network/pubsub/summary.md) + - [p2p_network_pubsub_refactoring.md](../../p2p/src/network/pubsub/p2p_network_pubsub_refactoring.md) + - **rpc/** + - [summary.md](../../p2p/src/network/rpc/summary.md) + - **scheduler/** + - [summary.md](../../p2p/src/network/scheduler/summary.md) + - **select/** + - [summary.md](../../p2p/src/network/select/summary.md) + - **yamux/** + - [summary.md](../../p2p/src/network/yamux/summary.md) + - [p2p_network_yamux_refactoring.md](../../p2p/src/network/yamux/p2p_network_yamux_refactoring.md) + - **snark/** + - **src/** + - [summary.md](../../snark/src/summary.md) + - **block_verify/** + - [summary.md](../../snark/src/block_verify/summary.md) + - **user_command_verify/** + - [summary.md](../../snark/src/user_command_verify/summary.md) + - **work_verify/** + - [summary.md](../../snark/src/work_verify/summary.md) + +## Navigation Tips + +- Each `summary.md` file contains technical debt analysis and implementation + notes for that component +- Files ending with `_refactoring.md` contain detailed refactoring plans +- The tree structure reflects the actual code organization in the repository +- Components are organized into three main areas: `node`, `p2p`, and `snark` diff --git a/docs/handover/debug-block-proof-generation.md b/docs/handover/debug-block-proof-generation.md new file mode 100644 index 000000000..7686cad6f --- /dev/null +++ b/docs/handover/debug-block-proof-generation.md @@ -0,0 +1,61 @@ +# Instructions on using the failed block proof dumps for debugging block proofs + +1. First save our private rsa key: + +``` +$ cp private_key $HOME/.openmina/debug/rsa.priv +``` + +2. To decrypt producer private key: + +``` +$ cp failed_block_proof_input_$HASH.binprot /tmp/block_proof.binprot +$ cd openmina/ledger +$ cargo test --release add_private_key_to_block_proof_input -- --nocapture +# This create the file /tmp/block_proof_with_key.binprot +``` + +3. Run proof generation in Rust: + Apply those changes to the test `test_block_proof`: + +```diff +modified ledger/src/proofs/transaction.rs +@@ -4679,10 +4679,11 @@ pub(super) mod tests { + #[test] + fn test_block_proof() { + let Ok(data) = std::fs::read( +- Path::new(env!("CARGO_MANIFEST_DIR")) +- .join(devnet_circuit_directory()) +- .join("tests") +- .join("block_input-2483246-0.bin"), ++ "/tmp/block_proof_with_key.binprot" + ) else { + eprintln!("request not found"); + panic_in_ci(); +@@ -4690,7 +4691,8 @@ pub(super) mod tests { + }; + + let blockchain_input: v2::ProverExtendBlockchainInputStableV2 = +- read_binprot(&mut data.as_slice()); ++ v2::ProverExtendBlockchainInputStableV2::binprot_read(&mut data.as_slice()).unwrap(); ++ // read_binprot(&mut data.as_slice()); + + let BlockProver { + block_step_prover, +``` + +Then you can run: + +``` +$ cd openmina/ledger +$ cargo test --release test_block_proof -- --nocapture +``` + +4. Run proof generation in OCaml: + Use this branch: https://github.com/openmina/mina/tree/proof-devnet + +``` +$ cd mina +$ export CC=gcc CXX=g++ RUST_BACKTRACE=1 DUNE_PROFILE=devnet +$ make build && _build/default/src/app/cli/src/mina.exe internal run-prover-binprot < /tmp/block_proof_with_key.binprot 2>&1 | tee /tmp/LOG.txt +``` diff --git a/docs/handover/fuzzing.md b/docs/handover/fuzzing.md new file mode 100644 index 000000000..9f20aff18 --- /dev/null +++ b/docs/handover/fuzzing.md @@ -0,0 +1,234 @@ +# OpenMina Fuzzing Infrastructure + +**Note: This document is very incomplete and contains unverified claims.** + +This document explains the fuzzing infrastructure for testing OpenMina's +transaction processing logic against the reference OCaml implementation. + +## Overview + +The OpenMina fuzzer is a differential testing system that validates the Rust +implementation of the Mina Protocol by comparing it against the OCaml reference +implementation. It focuses on transaction processing, validation, and ledger +state management. + +## Architecture + +### Components + +The fuzzer is located in `tools/fuzzing/` and implements differential fuzzing +between the OCaml and Rust implementations of some components. + +## What Gets Fuzzed + +### Primary Targets + +**Transaction Pool Validation** (`pool_verify`) + +- Validates transactions before adding to mempool + +**Transaction Application** (`apply_transaction`) + +- Applies transactions to ledger state + +## Mutation Strategies + +The fuzzer uses mutation strategies implemented in +`tools/fuzzing/src/transaction_fuzzer/mutator.rs`: + +**Weighted Random Selection:** + +- Uses `rand_elements()` function that gives more weight to fewer mutations +- Comment in code: "We give more weight to smaller amount of elements since in + general we want to perform fewer mutations" + +## Running the Fuzzer + +### Prerequisites + +**Rust Nightly Toolchain:** + +```bash +rustup toolchain install nightly +rustup override set nightly +``` + +**OCaml Reference Implementation:** + +The OCaml fuzzer loop is implemented in +[transaction_fuzzer.ml](https://github.com/openmina/mina/blob/openmina/fuzzer/src/app/transaction_fuzzer/transaction_fuzzer.ml). + +```bash +# Use the openmina/fuzzer branch from https://github.com/openmina/mina +# Branch: openmina/fuzzer +# Build the transaction fuzzer executable in that branch: +# dune build src/app/transaction_fuzzer/transaction_fuzzer.exe + +# Then set the path to the built executable: +export OCAML_TRANSACTION_FUZZER_PATH=/path/to/mina/_build/default/src/app/transaction_fuzzer/transaction_fuzzer.exe +``` + +**Note**: The `openmina/fuzzer` branch is messy and should be cleaned up and +integrated into mina mainline to ease the process. + +### Basic Usage + +**Default Fuzzing:** + +```bash +cd tools/fuzzing +cargo run --release +``` + +**With Specific Configuration:** + +```bash +# Use specific random seed for reproducibility +cargo run --release -- --seed 12345 + +# Enable specific fuzzing modes +cargo run --release -- --pool-fuzzing true --transaction-application-fuzzing true + +# Reproduce a specific failing case +cargo run --release -- --fuzzcase /path/to/fuzzcase.file +``` + +### Configuration Options + +**Command Line Arguments:** + +- `--seed ` - Set random seed for reproducible runs +- `--pool-fuzzing ` - Enable/disable pool validation fuzzing +- `--transaction-application-fuzzing ` - Enable/disable transaction + application fuzzing +- `--fuzzcase ` - Reproduce specific failing test case + +**Environment Variables:** + +- `OCAML_TRANSACTION_FUZZER_PATH` - Path to OCaml transaction fuzzer executable +- `FUZZCASES_PATH` - Directory to save failing cases (default: `/tmp/`) + +### Internal Configuration + +**Verified Parameters (from main.rs):** + +- **Initial Accounts:** 1000 accounts created at startup +- **Minimum Fee:** 1,000,000 currency units (default in context.rs) +- **Default Seed:** 42 +- **Coverage Updates:** Every 1000 iterations +- **Snapshots:** Every 10000 iterations + +**Additional Configuration:** See +`tools/fuzzing/src/transaction_fuzzer/context.rs` for cache sizes and other +parameters. + +## Coverage Analysis + +The fuzzer includes coverage tracking using LLVM instrumentation. + +**Usage:** + +```bash +# Coverage collection is built into the fuzzer when using nightly toolchain +# The fuzzer automatically tracks coverage and generates reports +# See coverage implementation in main.rs CoverageStats struct +cargo run --release +``` + +## Technical Implementation + +### Binary Protocol Communication + +**OCaml Interoperability:** + +- Uses `binprot` serialization for data exchange +- Implements length-prefixed message framing +- Command-based interaction model with stdin/stdout communication + +**Communication Protocol:** The OCaml fuzzer supports multiple action types: + +- `SetConstraintConstants` - Configure blockchain constraint parameters +- `InitializeAccounts` - Setup initial account states +- `SetupTransactionPool` - Initialize transaction pool +- `VerifyPoolTransaction` - Validate transactions for pool admission +- `ApplyTransaction` - Apply transactions to ledger state +- `GetAccounts` - Retrieve account information +- `Exit` - Terminate fuzzer process + +**Rust Integration:** See `main.rs` functions: + +- `ocaml_pool_verify()` - Pool validation testing +- `ocaml_apply_transaction()` - Transaction application testing +- `serialize()`/`deserialize()` - Binary protocol communication + +### OCaml Fuzzer Architecture + +**Core Components:** + +- **Ledger Simulation**: Creates ephemeral ledgers for isolated testing +- **Mock Components**: Uses `Mock_transition_frontier` for controlled blockchain + simulation +- **Async Operations**: Leverages OCaml's Async library for non-blocking + operations +- **Error Tracking**: Comprehensive error handling with backtrace generation + +**Testing Environment:** + +- Simulated blockchain ledger with configurable constraint constants +- Account initialization and management +- Transaction pool setup and verification +- Isolated transaction application testing + +**Communication Model:** + +- Loop-based command processing from stdin +- Binary-encoded responses to stdout +- Supports graceful termination via `Exit` command + +### Error Handling and Debugging + +**Panic Detection:** Implemented in `main.rs` `fuzz()` function using +`panic::catch_unwind()` to detect panics and save fuzzcases for reproduction. + +**OCaml Error Handling:** + +- Comprehensive error tracking with backtrace generation +- Structured error responses for debugging +- Async error propagation for non-blocking operations + +## Basic Usage Guide + +1. **Reproduce with Fixed Seed** - Use `--seed` to reproduce specific runs +2. **Examine Saved Cases** - Check `/tmp/` for automatically saved failing cases +3. **Check OCaml Connectivity** - Ensure OCaml fuzzer path is correct + +## Troubleshooting + +**OCaml Process Communication Failures:** + +```bash +# Check OCaml fuzzer path +ls -la $OCAML_TRANSACTION_FUZZER_PATH + +# Test OCaml fuzzer directly +$OCAML_TRANSACTION_FUZZER_PATH --help +``` + +**Coverage Collection Problems:** + +```bash +# Ensure nightly toolchain +rustup show + +# Coverage is built into the fuzzer, no additional tools needed +``` + +**Permission Denied Errors:** + +```bash +# Check write permissions for fuzzcase directory +ls -la /tmp/ + +# Use alternative directory +export FUZZCASES_PATH=/path/to/writable/directory +``` diff --git a/docs/handover/git-workflow.md b/docs/handover/git-workflow.md new file mode 100644 index 000000000..51e0896af --- /dev/null +++ b/docs/handover/git-workflow.md @@ -0,0 +1,174 @@ +# Git Workflow and PR Policy + +This document outlines the git workflow and pull request policy used in the +OpenMina repository. + +## Branch Management + +### Main Branches + +- **`develop`** - Main integration branch for active development +- **`main`** - Stable branch that receives periodic merges from develop + +**Note**: Unlike the OCaml Mina node which has a `compatible` branch, OpenMina +does not maintain a compatibility branch because we haven't had to support two +different protocol versions simultaneously. + +### Branch Naming Conventions + +**Feature Branches:** + +- `feat/feature-name` - New features and enhancements +- `fix/issue-description` - Bug fixes and corrections +- `tweaks/component-name` - Improvements and optimizations +- `chore/description` - Maintenance, CI, and tooling updates + +**Release Branches:** + +- `prepare-release/vX.Y.Z` - Release preparation branches + +**Examples:** + +``` +feat/error-sink-service +fix/initial-peers-resolve-failure +tweaks/yamux +chore/ci-runner-images +prepare-release/v0.16.0 +``` + +## Pull Request Development Workflow + +### 1. Development Phase + +- **Create feature branch** from latest `develop` +- **Make incremental commits** without squashing +- **Use descriptive commit messages** following conventional commits format +- **Push all commits** to remote branch regularly +- **Rebase branch** regularly to stay current with develop +- **Use WIP/tmp commits** for work-in-progress saves + +### 2. Review Phase + +- **Keep linear intermediary commits** during review process +- **All commits remain visible** to ease reviewers' job +- **Reviewers can see progression** of changes and iterations +- **Address review feedback** with additional commits +- **Regular rebasing** to keep branch current with develop + +### 3. Pre-Merge Phase + +- **Code review** - PRs should ideally be reviewed by another team member +- **Squash related commits** that don't make sense alone: + - Fixup commits (`fixup tests`, `more refactor`) + - WIP commits (`tmp`, `WIP`) + - Incremental improvements to the same feature + - Review feedback commits +- **Preserve meaningful commits** that represent separate logical changes +- **Final rebase** against latest develop +- **Clean history for posterity** - helps when checking history after merge + +### 4. Merge Phase + +- **Merge with merge commit** (no fast-forward) +- **Delete feature branch** using GitHub's UI after merge + +## Commit Message Format + +Follow conventional commits format: `type(scope): description` + +**Common Types:** + +- `feat` - New features +- `fix` - Bug fixes +- `chore` - Maintenance tasks +- `refactor` - Code restructuring +- `tweak` - Minor improvements + +**Examples:** + +``` +feat(yamux): Split incoming frame handling into multiple actions +fix(p2p): Do not fail when an initial peer address cannot be resolved +chore(ci): Upgrade jobs to use Ubuntu 22.04 +tweak(yamux): Set max window size to 256kb +``` + +## Commit Squashing Policy + +### Purpose + +- **During review**: Keep all commits visible to help reviewers understand the + development process +- **Before merge**: Squash commits to create clean history for future reference + and debugging + +### When to Squash + +- **Fixup commits** - Commits that fix issues in previous commits +- **WIP commits** - Temporary work-in-progress commits +- **Incremental improvements** - Multiple commits that refine the same feature +- **Review feedback** - Commits that address review comments + +### When NOT to Squash + +- **Separate logical changes** - Commits that represent distinct features or + fixes +- **Different components** - Changes that affect different parts of the system +- **Meaningful progression** - Commits that show logical development steps + +### Examples + +**Single-commit PRs** (no squashing needed): + +``` +fix(p2p): Do not fail when an initial peer address cannot be resolved +``` + +**Multi-commit PRs** (preserve logical units): + +``` +chore(ci): Upgrade jobs to use Ubuntu 22.04 +chore(ci): Install libssl3 from bookworm in mina bullseye images +``` + +**Complex PRs** (squash related work): + +``` +Before squashing: +- feat(yamux): Set max window size to 256kb +- tweak(yamux): Simplify reducer a bit +- tweak(yamux): Simplify reducer a bit more +- feat(yamux): Split incoming frame handling into multiple actions +- feat(yamux): Add tests +- fixup tests +- more refactor +- fixup tests (clippy) + +After squashing: +- feat(yamux): Improve reducer and add comprehensive tests +- feat(yamux): Split incoming frame handling into multiple actions +``` + +## Best Practices + +1. **Rebase regularly** - Keep feature branches up-to-date with develop +2. **Commit often** - Make small, focused commits during development +3. **Clean before merge** - Ensure final commit history is logical and readable +4. **Descriptive messages** - Write clear, specific commit messages +5. **Review history** - Check that squashed commits tell a coherent story +6. **Test before merge** - Ensure all commits in the final history build and + pass tests + +## Merge Strategy + +- **Merge commits** are created for all PRs: + `Merge pull request #XXXX from openmina/branch-name` +- **No fast-forward merges** - Merge commits preserve PR context and history +- **Rebase before merge** - Branches are rebased to develop before merging +- **Delete merged branches** - Use GitHub's UI to delete feature branches after + successful merge + +This workflow balances development flexibility with clean version history, +allowing for iterative development while ensuring the final merged result has a +clear, logical commit structure. diff --git a/docs/handover/ledger-crate.md b/docs/handover/ledger-crate.md new file mode 100644 index 000000000..e12a45681 --- /dev/null +++ b/docs/handover/ledger-crate.md @@ -0,0 +1,168 @@ +# Ledger Crate + +## Overview + +The `ledger` crate is a comprehensive Rust implementation of the Mina protocol's +ledger, transaction pool, staged ledger, scan state, proof verification, and +zkApp functionality, providing a direct port of the OCaml implementation. For +developers familiar with the OCaml codebase, this maintains the same +architecture and business logic while adapting to Rust idioms. + +For technical debt and critical issues, see +[`ledger/summary.md`](../../ledger/summary.md). + +## Architecture + +### Core Components + +**BaseLedger Trait** (`src/base.rs`) + +- Direct mapping to OCaml's `Ledger_intf.S` +- Defines the fundamental ledger interface for account management, Merkle tree + operations, and state queries +- All ledger implementations (Database, Mask) implement this trait + +**Mask System** (`src/mask/`) + +- Port of OCaml's `Ledger.Mask` with identical copy-on-write semantics +- Provides layered ledger views for efficient state management +- Uses `Arc>` for cheap reference counting; `Mask::clone()` is + fast +- Used extensively in transaction processing to create temporary ledger states + +**Database** (`src/database/`) + +- In-memory implementation (ondisk module exists but is not used) +- Corresponds to OCaml's `Ledger.Db` interface +- Handles account storage and Merkle tree management + +### Transaction Processing + +**Transaction Pool** (`src/transaction_pool.rs`) + +- Complete port of `Transaction_pool.Indexed_pool` with identical behavior: + - Fee-based transaction ordering + - Sender queue management with nonce tracking + - Revalidation on best tip changes + - `VkRefcountTable` for verification key reference counting +- Handles transaction mempool operations, expiration, and replacement logic + +**Staged Ledger** (`src/staged_ledger/`) + +- Maps directly to OCaml's staged ledger implementation +- `Diff` corresponds to `Staged_ledger_diff` with same partitioning +- Manages transaction application and block validation +- Handles pre-diff info for coinbase and fee transfers + +**Scan State** (`src/scan_state/`) + +- Direct port of the parallel scan tree structure +- `transaction_logic` module maps to OCaml's `Transaction_logic` +- Manages SNARK work coordination and pending coinbase +- Maintains same proof requirements as OCaml implementation + +### Proof System Integration + +**Proof Generation and Verification** (`src/proofs/`) + +- Transaction proof generation and verification (`transaction.rs`) +- Block proof generation and verification (`block.rs`) +- zkApp proof generation and handling (`zkapp.rs`) +- Merge proof generation for scan state +- Witness generation for circuits (`witness.rs`) +- Uses Kimchi proof system via proof-systems crate +- Maintains protocol compatibility with OCaml proofs + +Note: The crate implements witness generation for circuits but not the +constraint generation, so circuits cannot be fully generated from this crate +alone. + +For detailed technical information about the proof system implementation, see +[`ledger/src/proofs/summary.md`](../../ledger/src/proofs/summary.md). + +**zkApp Support** (`src/zkapps/`) + +- Full zkApp transaction processing +- Account update validation +- Permission and authorization checks +- SNARK verification for zkApp proofs + +### Additional Components + +**Account Management** (`src/account/`) + +- Account structure with balances, permissions, and timing +- Token support with owner tracking +- Delegate and voting rights management + +**Sparse Ledger** (`src/sparse_ledger/`) + +- Efficient partial ledger representation +- Used for witness generation in SNARK proofs +- Maintains minimal account set needed for proof creation + +## Key Differences from OCaml + +1. **Memory-only implementation** - No persistent disk storage used +2. **Rust idioms**: + - `Result` instead of OCaml's `Or_error.t` + - `HashMap`/`BTreeMap` instead of OCaml's `Map`/`Hashtbl` + - Ownership model instead of garbage collection +3. **Serialization** - Uses serde for state machine persistence and network + communication + +**Note**: The FFI code present in the crate is stale and unused - it was from +earlier integration attempts before a implementing a full node was even planned. + +## Compatibility + +The crate maintains full protocol compatibility with the OCaml implementation: + +- Same Merkle tree structure and hashing +- Identical transaction validation rules +- Compatible proof verification +- Same account model and permissions + +The `port_ocaml` module provides compatible implementations with the OCaml +runtime, including: + +- Hash functions that match OCaml's behavior +- Hash table implementation that behaves like Jane Street's `Base.Hashtbl` for + compatibility + +## Future Refactoring + +The ledger crate is currently monolithic and should ideally be split into +separate crates. At a minimum, it could be split into: + +- `mina-account` - Account structures and management +- `mina-ledger` - Base ledger implementation, staged ledger, masks, sparse + ledger, Merkle tree infrastructure +- `mina-transaction-logic` - Transaction application and validation logic, + currency types +- `mina-scan-state` - SNARK work coordination and parallel scan (depends on + transaction-logic) +- `mina-transaction-pool` - Transaction mempool logic (depends on ledger for + masks) +- `mina-proofs` - Proof generation and verification + +**Note**: The staged ledger remains with the core ledger as it's tightly coupled +with the mask system and represents the fundamental "next state" computation. +Attempting to separate it would create circular dependencies and break the +natural layering of Database → Mask → StagedLedger. + +This would improve compilation times, enable better testing isolation, and allow +other components to depend only on what they need. + +## Usage in OpenMina + +The ledger crate is primarily used by: + +- Block production for transaction selection +- Block application for state transitions +- P2P for transaction pool management +- SNARK workers for proof generation +- RPC endpoints for balance and account queries + +All ledger operations go through the state machine actions defined in the node +crate, ensuring deterministic execution. diff --git a/docs/handover/mainnet-readiness.md b/docs/handover/mainnet-readiness.md new file mode 100644 index 000000000..25e7354d4 --- /dev/null +++ b/docs/handover/mainnet-readiness.md @@ -0,0 +1,379 @@ +# Mainnet Readiness + +This document outlines the key features and improvements required for OpenMina +to be ready for mainnet deployment. + +## Critical Requirements + +### 1. Persistence Implementation + +**Status**: Draft design +([Issue #522](https://github.com/openmina/openmina/issues/522), see +[persistence.md](persistence.md)) + +The ledger is currently kept entirely in memory, which is not sustainable for +mainnet's scale. Persistence is required for: + +- Reducing memory usage to handle mainnet-sized ledgers and amount of snarks. +- Enabling fast node restarts without full resync +- Supporting webnodes with browser storage constraints +- Providing a clean foundation for implementing SNARK verification deduplication + +**Note**: There is a very old implementation for on-disk storage in +`ledger/src/ondisk` that was never used - a lightweight key-value store +implemented to avoid the RocksDB dependency. This is unrelated to the new +persistence design which intends to solve persistence for everything, not just +the ledger. But the old implementation may be worth revisiting anyway. + +**Performance Impact**: The importance of SNARK verification deduplication for +mainnet performance has been demonstrated in the OCaml node, where we achieved +dramatic improvements (8-14 seconds → 0.015 seconds for block application). See +the "SNARK Verification Deduplication" section in +[persistence.md](persistence.md) for details. + +### 2. Wide Merkle Queries + +**Status**: Not implemented +([Issue #1086](https://github.com/openmina/openmina/issues/1086)) + +Wide merkle queries are needed for: + +- Protocol compatibility +- Faster synchronization + +### 3. Delta Chain Proof Verification + +**Status**: Not implemented +([Issue #1017](https://github.com/openmina/openmina/issues/1017)) + +When verifying blocks, OpenMina should verify the delta chain proofs. + +### 4. Automatic Hardfork Handling + +**Status**: Not implemented + +OpenMina needs a mechanism to automatically handle protocol hardforks to +maintain compatibility with the Mina network. There is an implementation in +progress for the OCaml node, and the Rust node should implement something +compatible. However, we lack detailed knowledge of the OCaml implementation to +provide specific guidance. + +### 5. Security Audit + +**Status**: Not performed + +A comprehensive security audit by qualified third-party security experts is +essential before mainnet deployment. This should cover cryptographic +implementations, consensus logic, networking protocols, and potential attack +vectors specific to the Rust implementation. + +### 6. Mainnet Genesis Ledger Distribution + +**Status**: Solution exists for devnet, mainnet needs implementation + +The node requires efficient genesis ledger loading for practical operation. A +binary genesis ledger must be produced for mainnet and included in the node +distribution (or made downloadable from a location where the node can fetch it). +Currently, mainnet genesis ledgers would be too large and expensive to process +in JSON format. + +**Current Implementation**: + +- Genesis ledgers are available in JSON format at + [openmina-genesis-ledgers](https://github.com/openmina/openmina-genesis-ledgers) + repository +- Devnet uses a prebuilt binary format (`genesis_ledgers/devnet.bin`) that loads + very quickly +- The `tools/ledger-tool` utility can generate these binary formats + +**Requirements for Mainnet**: + +- Generate equivalent binary format for mainnet genesis ledger to ensure fast + node startup +- Establish workflow for creating and distributing mainnet binary ledgers +- Implement process for updating binary ledgers after hardforks +- Handle potential devnet relaunch scenarios requiring new binary ledgers + +**Note**: These binary ledger files would become deprecated once persistence is +implemented, as the persisted database itself could be provided for initial node +setup instead. + +### 7. Error Sink Service Integration + +**Status**: Partially implemented (PR #1097) + +Currently, the node intentionally panics in recoverable error situations (such +as block proof failures) to make errors highly visible during development. For +mainnet deployment, this needs to transition to using the error sink service to +report errors and continue operation instead of forcing node shutdown. This is +critical for operational stability in production environments. + +## Protocol Compliance and Feature Parity + +### 1. Block Processing Deviation + +Currently, only the best block is processed and broadcasted. See +[this analysis](https://gist.github.com/tizoc/4a364dc2f8f29396a4097428a07f58d8) +for details on this deviation from the full protocol. + +### 2. SNARK Work Partitioner + +Feature parity requirement for full node capabilities. + +### 3. GraphQL API for SNARK Workers + +Feature parity requirement for supporting external SNARK workers. SNARK workers +need a proper API interface to: + +- Submit completed work +- Query work requirements +- Coordinate with block producers + +## Future Mina Compatibility + +### 1. Dependency Updates + +- Update proof-systems, ark, and other cryptographic dependencies to latest + versions +- Ensure compatibility with future Mina protocol changes + +### 2. App State Field Increase + +Support for increasing app state from 8 to 32 fields, enabling more complex +smart contracts and zkApps. + +## Webnode-Specific Requirements + +For webnodes to handle mainnet: + +- Persistence implementation (see [persistence.md](persistence.md)) - critical + for webnodes due to memory constraints, intermittent connectivity, and + frequent restarts +- Memory usage constraints are manageable without block proving +- Block production for webnodes is not a priority since block producers are + unlikely to use browser-based setups for production operations, so the + separate WASM memory block + ([Issue #1128](https://github.com/openmina/openmina/issues/1128)) may not be + necessary + +### zkApp Integration + +A popular feature request from users is direct zkApp and webnode integration. +This would allow zkApps to: + +- Get better visibility into the network state +- Query the blockchain directly without relying on third-party nodes +- Avoid query limits imposed by external services +- Provide a more decentralized and reliable infrastructure for zkApp developers + +Additionally, the webnode could be packaged as a Node.js library, enabling zkApp +developers to build testing frameworks that take advantage of OpenMina's +simulator capabilities for more comprehensive and realistic testing +environments. In such testing setups, block production would use dummy proofs +rather than full proof generation. + +### Known Issues from Community Testing + +During testing with ~100 webnode operators from the community, a few critical +issues were identified. See the +[official retrospective](https://minaprotocol.com/blog/retro-mina-web-node-testing) +for complete details. + +#### 1. Seed Node Performance + +- **Issue**: Performance problems on the first day due to UDP socket management + issues in the webrtc-rs library. +- **Resolution**: Switched from webrtc-rs to the C++ "datachannel" + implementation, which performs worse but is more stable. +- **Future Improvement**: Using QUIC transport for webnode-to-server + communication would also help with seed node performance (see + [P2P Evolution Plan](p2p-evolution.md)). +- **Status**: Resolved for testing, but seed node scalability should be + monitored for mainnet deployment. + +#### 2. Memory Limitations + +- **Issue**: Webnodes are limited to 4GB of memory due to WASM constraints. When + nodes sometimes reached this limit during block proving, they became stuck or + experienced major thread crashes. +- **Root Cause**: + - Block proving operations consuming excessive memory within the same memory + space + - WASM memory allocator limitations leading to fragmentation +- **Solutions**: + - Moving the prover to its own WASM heap would alleviate the issue + ([Issue #1128](https://github.com/openmina/openmina/issues/1128)) + - Memory limitations are less critical for nodes that don't produce and prove + blocks + - Implementing persistence (as per [persistence.md](persistence.md)) would + considerably improve the situation + +#### 3. Network Connectivity and Bootstrap Issues + +- **Issue**: Many nodes had difficulty completing initial bootstrap, especially + when the webnode or peers providing staged ledgers experienced network + connectivity problems (instability, low bandwidth, high latency). +- **Impact**: Hard to finish initial sync for many community operators. +- **Note**: The RPC used to fetch the staged ledger is particularly problematic + because it is very heavy and needs to download a lot of data in a single + request. It would be a good idea to redesign this RPC, ideally enabling it to + fetch parts from multiple peers in the same way that the snarked ledger sync + process does. +- **Planned Solutions**: + - Prefer server-side nodes for fetching initial ledgers instead of relying on + other webnodes + - Add support for wide merkle queries to reduce roundtrips and improve sync + efficiency + - Better peer selection algorithms for initial bootstrap + +## Rollout Plan + +### Testing Requirements + +Before mainnet deployment, OpenMina requires extensive testing to ensure +protocol compatibility and reliable operation. Note that OpenMina has already +demonstrated stability in devnet environments: block producer nodes have run +continuously for over two months without issues (only restarting for upgrades), +and webnodes without block production have maintained perfect uptime for two +continuous weeks without memory issues. + +However, there are rare situations where produced blocks cannot be proven (or +more precisely, where invalid proofs are produced). This is one of the primary +areas requiring further investigation and resolution. + +#### Scenario Testing Expansion + +- **Significantly increase scenario tests** to cover edge cases and protocol + interactions +- **Multi-node scenarios** testing various network configurations and failure + modes +- **Long-running stability tests** to validate node behavior over extended + periods +- **Compatibility testing** with OCaml nodes to ensure seamless network + operation +- **Stress testing** under high transaction volume and network load + +#### Protocol Compatibility Validation + +- **Comprehensive testing against OCaml implementation** for all protocol + interactions +- **Cross-implementation consensus testing** to verify identical blockchain + state +- **P2P protocol compatibility** testing for message handling and propagation +- **RPC API compatibility** testing to ensure client applications work + seamlessly + +### Prover/Verifier Implementation Strategy Options + +OpenMina has a full implementation of both prover and verifier in Rust. The +prover includes many optimizations that make proving significantly faster than +the original OCaml implementation, including optimizations for server-side +proving. However, these implementations have not been audited, and auditing plus +ongoing maintenance requires significant time and effort. + +#### Option 1: Complete Rust Implementation + +Continue using OpenMina's Rust prover and verifier implementations. + +**Pros**: + +- Full control and better integration with the Rust codebase +- Significant performance improvements over OCaml implementation +- Single codebase without external dependencies + +**Cons**: + +- Requires comprehensive security audit before mainnet deployment +- Ongoing maintenance and compatibility responsibilities +- More implementation work for any missing features + +#### Option 2: OCaml Subprocess Integration + +Reuse the proven OCaml prover and verifier implementations through subprocess +services with wrapper services in the state machine to handle communication +(similar to what OpenMina used before it had its own prover). + +**Pros**: + +- Leverages battle-tested OCaml prover/verifier code +- Reduces implementation work and compatibility risks +- Faster path to mainnet readiness +- Proven correctness and performance + +**Cons**: + +- Additional complexity in service management +- Cross-process communication overhead +- Dependency on OCaml runtime + +**Implementation Approach for Option 2**: + +- Create service interfaces for prover/verifier subprocess communication +- Implement subprocess lifecycle management (start, restart, health checks) +- Design efficient data serialization for cross-process communication +- Add comprehensive error handling and fallback mechanisms + +#### Webnode-Specific Considerations + +**Essential Verifier Requirements**: All webnodes need verifiers for: + +- Block proof verification +- zkApp proof verification +- Transaction proof verification + +**Block Production Capability**: Webnodes can produce and prove blocks, but this +may not be a rollout priority. + +**Implementation Strategy for Webnodes**: + +- **Rust verifier implementation**: Much smaller and simpler than prover, making + it easier to review, verify correctness, and maintain +- **Prover considerations**: If webnode block production isn't prioritized + initially, prover implementation can be deferred +- **Alternative approaches**: If Rust verifier proves challenging, alternative + verification methods need investigation +- **Hybrid approach**: Rust verifiers for webnodes, OCaml subprocess for full + node provers when needed + +### Deployment Phases + +1. **Extended Testnet Testing** - Months of testing with comprehensive scenario + coverage +2. **Limited Mainnet Beta** - Controlled deployment with selected nodes, + potentially with block production disabled +3. **Initial Mainnet Rollout** - Non-block-producing nodes for network health + safety + - Verification-only deployment for both server-side and webnodes + - Allows earlier rollout while building confidence + - Reduces risk to network stability +4. **Block Production Enablement** - Enable block proving in later releases when + confidence is established +5. **Full Production** - Complete mainnet readiness with all features enabled + +**Rollout Strategy Benefits**: + +- **Earlier deployment**: Verification-only nodes can be deployed sooner +- **Network safety**: Reduces risk of consensus issues during initial rollout +- **Gradual feature introduction**: Block production can be added once the core + node functionality is proven stable +- **Confidence building**: Allows time to validate node behavior before enabling + block production + +## Summary + +The most critical items for mainnet readiness are: + +- **Persistence** - Without this, nodes may not be able to handle mainnet's + ledger and snark pool size +- **Wide Merkle Queries** - Needed for compatibility and faster sync +- **Delta Chain Verification** - Required for protocol compliance +- **Hardfork Handling** - Essential for network compatibility +- **Security Audit** - Third-party security review before mainnet deployment +- **Mainnet Genesis Ledger** - Binary format required for mainnet distribution + and hardfork updates +- **Error Sink Service** - Replace intentional panics with graceful error + reporting +- **Comprehensive Testing** - Extensive scenario testing for protocol + compatibility +- **Prover/Verifier Strategy** - Decision on Rust implementation vs OCaml + subprocess integration diff --git a/docs/handover/ocaml-coordination.md b/docs/handover/ocaml-coordination.md new file mode 100644 index 000000000..ce970410e --- /dev/null +++ b/docs/handover/ocaml-coordination.md @@ -0,0 +1,116 @@ +# OCaml Node Coordination for Rust Development + +This document outlines features and improvements that the OCaml node could +implement to enhance Rust node development and testing workflows, based on +limitations and needs identified in the handover documentation. + +## Overview + +The OpenMina (Rust) and OCaml Mina implementations need to work together in +several key areas: + +- **Cross-implementation testing** for protocol compatibility +- **Circuit generation** workflows that rely on OCaml implementation +- **Fuzzing infrastructure** for differential testing +- **P2P protocol evolution** toward unified networking +- **Development workflows** that involve both implementations + +This document consolidates all identified areas where OCaml improvements would +benefit Rust development. + +## Maintenance Burden Coordination + +OpenMina maintains custom branches in the https://github.com/openmina/mina +repository for features not yet integrated into mainline Mina: + +### Circuit Generation Branch + +OpenMina's circuit generation process requires launching a custom build of the +OCaml node from the `utils/dump-extra-circuit-data-devnet301` branch, which +produces circuit cache data in `/tmp/coda_cache_dir` and dumps both usual +circuit data plus extra data specifically required by OpenMina. + +Without mainline integration, the OpenMina team must manually maintain this +branch, making this a high priority coordination need. The branch requires +significant cleanup before integration into mainline Mina. When integrated, the +circuit generation functionality could be added as a node subcommand that +exports the required circuit data without starting the full node. + +For detailed information about the circuit generation process, see +[Circuit Generation Process](circuits.md#circuit-generation-process). + +### Fuzzer Branch + +The `openmina/fuzzer` branch contains the OCaml transaction fuzzer +implementation used for differential testing between OCaml and Rust +implementations. Like the circuit generation branch, this requires manual +maintenance by the OpenMina team when not integrated into mainline Mina. While +lower priority than circuit generation, integration would reduce maintenance +overhead and streamline the fuzzing setup process. + +For detailed information about the fuzzing infrastructure and setup, see +[Fuzzing Infrastructure](fuzzing.md). + +## Cross-Implementation Testing Challenges + +Cross-implementation testing between OCaml and Rust nodes faces several +challenges due to architectural differences. The OCaml node was not designed to +integrate with the Rust testing framework: + +- **Time Control**: Cannot be controlled via test framework time advancement +- **State Inspection**: Cannot be inspected by Rust testing infrastructure +- **Network Control**: Cannot manually control P2P connections from test + framework +- **Behavioral Control**: No control over internal execution flow from test + framework + +These differences restrict the types of cross-implementation testing that can be +performed. Currently, OCaml nodes can be used for basic interoperability +validation rather than comprehensive protocol behavior testing. Addressing these +differences through coordination with the OCaml Mina team could enable more +thorough cross-implementation testing and better validation of protocol +compatibility between the implementations. + +For detailed information about these limitations and potential improvements, see +[Testing Infrastructure - OCaml Node Limitations](testing-infrastructure.md#ocaml-node-limitations). + +## Shared Infrastructure Dependencies + +### P2P Evolution + +The documented vision includes replacing the current Golang `libp2p_helper` with +a Rust implementation that reuses OpenMina's P2P code, creating a unified +networking layer across all Mina implementations. + +For detailed information about the P2P evolution plan and coordination +requirements, see [P2P Evolution Plan](p2p-evolution.md). + +### Archive Service Integration + +OpenMina uses the same archive node helper processes as the OCaml node. Any +incompatible changes to the archive interface would require coordinated updates +to ensure both implementations continue to work with the shared archive +infrastructure. + +## Protocol Compatibility Coordination + +### Hardfork Compatibility + +An OCaml implementation for automatic hardfork handling is currently in +progress, and the Rust node needs to implement compatible behavior. Without +coordination, incompatible hardfork implementations could lead to network splits +where OCaml and Rust nodes follow different protocol rules, breaking consensus +and network unity. + +Coordination is needed to ensure both implementations handle hardforks +identically and maintain network compatibility during protocol upgrades. + +## Related Documentation + +- [Testing Infrastructure](testing-infrastructure.md) - OCaml node limitations + in testing +- [P2P Evolution Plan](p2p-evolution.md) - Unified P2P layer vision +- [Fuzzing Infrastructure](fuzzing.md) - Current fuzzing setup and limitations +- [Circuits](circuits.md) - Circuit generation process dependencies +- [Mainnet Readiness](mainnet-readiness.md) - Cross-implementation compatibility + requirements diff --git a/docs/handover/organization.md b/docs/handover/organization.md new file mode 100644 index 000000000..568d0fd68 --- /dev/null +++ b/docs/handover/organization.md @@ -0,0 +1,260 @@ +# OpenMina Project Organization + +This document provides a navigation guide to the OpenMina codebase structure, +focusing on entry points, supporting libraries, and build organization. + +> **Prerequisites**: Read +> [Architecture Walkthrough](architecture-walkthrough.md) and +> [State Machine Structure](state-machine-structure.md) first. **Next Steps**: +> After understanding the codebase layout, dive into specific components via +> [Services](services.md) or start developing with +> [State Machine Development Guide](state-machine-development-guide.md). + +## Component Details + +For detailed information about state machine components and their interactions, +see [State Machine Structure](state-machine-structure.md). This document focuses +on codebase navigation and supporting infrastructure. + +## Entry Points + +### CLI (`cli/`) + +The main entry point for running OpenMina nodes. Contains: + +- **`src/main.rs`** - Application entry point with memory allocator setup and + signal handling +- **`src/commands/`** - CLI command implementations: + - `node/` - Node startup and configuration + - `replay/` - State replay functionality for debugging + - `snark/` - SNARK-related utilities and precalculation + - `build_info/` - Build information and version details + - `misc.rs` - Miscellaneous utilities + +The CLI supports different networks (devnet/mainnet) and provides the +server-side node functionality. + +## Core Components + +### Node (`node/`) + +The main orchestrating component containing both state machine logic and service +implementations. For state machine component details, see +[State Machine Structure](state-machine-structure.md). + +**Special Implementations:** + +- **`node/web/`** - WebAssembly-compatible service layer for browser deployment + - Exports a default `Service` for web (WASM) environments + - Provides Rayon-based parallelism configuration for web workers + - Enables OpenMina to run as a light client in browsers + +### P2P Networking (`p2p/`) + +OpenMina includes two distinct P2P implementations. For state machine details, +see [State Machine Structure](state-machine-structure.md). + +#### libp2p Implementation + +Traditional P2P networking for server-to-server communication and OCaml node +compatibility with custom WebRTC transport, Noise security, Yamux multiplexing, +and standard libp2p protocols. + +#### WebRTC-Based P2P Implementation + +Pull-based (long-polling) P2P protocol designed for webnode deployment and +browser environments. + +**Design Features:** + +- Pull-based message flow where recipients request messages instead of receiving + unsolicited pushes (long-polling approach) +- 8 specialized channels (BestTipPropagation, TransactionPropagation, etc.) +- Efficient pool propagation with eventual consistency +- DDOS resilience through fairness mechanisms +- WebRTC transport for browser-to-browser communication + +**Implementation:** + +- Core WebRTC code: `p2p/src/webrtc/` and `p2p/src/service_impl/webrtc/` +- Channel implementations: `p2p/src/channels/` (8 specialized state machines) +- Multi-backend support: Rust WebRTC, C++ WebRTC, and Web/WASM + +**Future Enhancements:** + +- QUIC transport integration for protocol consolidation +- Block propagation optimization with header/body splitting +- Advanced bandwidth reduction using local pool references +- See [P2P Evolution Plan](p2p-evolution.md) for detailed plans + +**Documentation:** See [p2p/readme.md](../../p2p/readme.md) for design overview + +### SNARK Verification (`snark/`) + +State machine components for managing proof verification workflows. For +component details, see [State Machine Structure](state-machine-structure.md). + +**Note:** The actual proof system implementations and cryptographic proof +generation/verification are located in the `ledger` crate's `proofs/` module. +This crate only contains the state machine logic for orchestrating proof +verification workflows. + +### Ledger (`ledger/`) + +Blockchain state management library. For detailed information, see +[Ledger Crate](ledger-crate.md). + +## Supporting Libraries + +### Core Types (`core/`) + +Foundational shared types and utilities used across the entire codebase: + +- **Block types** - Applied blocks, genesis configuration, prevalidation +- **Transaction types** - Transaction info and hash wrappers +- **SNARK types** - Job commitments, IDs, and comparison utilities +- **Network types** - P2P configuration and network utilities +- **Request types** - Request and RPC ID management +- **Constants** - Constraint constants and protocol parameters +- **Substate system** - Fine-grained state access control for the state machine +- **Logging and threading utilities** + +This crate provides the common foundation that all other components depend on. + +### Cryptographic Primitives + +#### VRF (`vrf/`) + +Verifiable Random Function implementation for block producer selection: + +- Implements the cryptographic VRF used in Proof of Stake consensus +- Generates verifiable random numbers for fair block producer selection +- Provides threshold evaluation and message handling +- Compatible with the OCaml node's VRF implementation + +#### Poseidon Hash (`poseidon/`) + +Poseidon hash function implementation optimized for zero-knowledge proofs: + +- Sponge construction constants and parameters +- Field arithmetic over Mina's base field (Fp) and scalar field (Fq) +- ZK-friendly hash function used throughout the protocol +- Compatible with Kimchi proof system requirements + +### Message Serialization (`mina-p2p-messages/`) + +Comprehensive message format definitions for network communication, generated +from OCaml binprot shapes: + +**Code Generation:** + +- Types auto-generated from OCaml `bin_prot` shapes stored in `shapes/` + directory +- Generated code in `src/v2/generated.rs` with OCaml source references for every + type +- Manual implementations in `src/v2/manual.rs` for complex types requiring + custom logic +- Configuration files (`default-v2.toml`) control the generation process + +**Protocol Support:** + +- **v2 protocol** - Current Mina protocol version with full type coverage +- **RPC messages** - Method definitions and request/response types +- **Gossip messages** - Network propagation message formats +- **Binary compatibility** - Full `bin_prot` serialization compatibility with + OCaml + +**Current Limitations:** + +- **Monomorphized types** - All generic types have been specialized, leading to + code duplication +- **Manual maintenance** - Complex types require hand-written implementations + that must be kept in sync +- **Code bloat** - Many similar wrapper types for different contained types + +**Future Improvements Needed:** + +- **Transition to manual maintenance** - Move away from code generation to + hand-written types +- **Polymorphize types to match OCaml** - Where OCaml uses polymorphic types + (generics), Rust should use matching generic definitions rather than + monomorphized variants +- **Maintain structural compatibility** - Ensure type definitions match the + original OCaml structure and polymorphism +- **Preserve protocol compatibility** - Ensure binary serialization + compatibility is maintained during manual refactoring + +### Development Tools + +#### Macros (`macros/`) + +Procedural macros for code generation: + +- Action and event system macros +- Serialization helpers for OCaml compatibility + +#### Testing Infrastructure (`node/testing/`) + +Comprehensive testing framework for multi-node scenarios. For details, see +[`testing-infrastructure.md`](testing-infrastructure.md). + +## Additional Components + +### Frontend (`frontend/`) + +Angular-based web interface providing: + +- Node monitoring dashboard +- Network visualization +- Block and transaction exploration +- Real-time metrics and debugging tools + +### Tools (`tools/`) + +Various utilities for development and analysis: + +- **`ledger-tool/`** - Ledger inspection and manipulation +- **`hash-tool/`** - Hash verification utilities +- **`bootstrap-sandbox/`** - Network bootstrapping testing +- **`fuzzing/`** - Fuzz testing infrastructure +- **And many more specialized tools** + +### Producer Dashboard (`producer-dashboard/`) + +Block producer monitoring and metrics collection system. + +## Build and Configuration + +### Root Configuration + +- **`Cargo.toml`** - Workspace configuration defining all crates +- **Docker configurations** - Various deployment scenarios +- **Helm charts** - Kubernetes deployment configurations (probably stale) + +### Development Support + +- **`tests/`** - Integration test files and test data +- **`genesis_ledgers/`** - Genesis ledger data for devnet +- **Scripts and tooling** - Build, deployment, and analysis scripts + +## Dependencies and Build Order + +The project follows a clear dependency hierarchy: + +1. **Foundation**: `core`, `macros`, `poseidon`, `vrf` +2. **Protocols**: `mina-p2p-messages`, `ledger` +3. **Networking**: `p2p` +4. **Services**: `snark`, `node` +5. **Applications**: `cli`, frontend tools + +This organization enables: + +- **Incremental compilation** - Changes to high-level components don't rebuild + foundations +- **Clear boundaries** - Each component has well-defined responsibilities +- **Testability** - Components can be tested in isolation +- **Modularity** - Browser deployment through `node/web`, various tool builds + +The architecture supports multiple deployment targets: native nodes, browser +light clients, testing frameworks, and various specialized tools, all sharing +the same core protocol implementation. diff --git a/docs/handover/p2p-evolution.md b/docs/handover/p2p-evolution.md new file mode 100644 index 000000000..39838b67e --- /dev/null +++ b/docs/handover/p2p-evolution.md @@ -0,0 +1,268 @@ +# P2P Layer Evolution Plan + +This document outlines the evolution plan for Mina's P2P networking layer, +building on the successful pull-based design already implemented for OpenMina +webnodes. The idea of using QUIC as a transport was originally proposed by +George in his "Networking layer 2.0" document. + +**Status**: The pull-based P2P protocol is implemented and operational. This +document proposes enhancements including QUIC transport, block propagation +optimizations, and integration with the OCaml node to create a unified +networking layer across all Mina implementations. Coordination with OCaml Mina +team required for ecosystem-wide adoption. + +## Current State + +### The Problem: Divergent P2P Architectures + +The Mina ecosystem currently has divergent P2P implementations: + +1. **Mina (OCaml) nodes** + - Use libp2p exclusively via external Golang helper process (`libp2p_helper`) + - Push-based GossipSub protocol + - Known weaknesses in network performance and scalability + +2. **OpenMina (Rust) nodes** + - Support both libp2p (for OCaml compatibility) AND pull-based WebRTC + - Must internally normalize between push and pull models, adding complexity + - Webnodes use WebRTC exclusively and require Rust nodes as bridges to libp2p + network + - Maintenance burden of supporting two different protocol designs + +This creates significant complexity: + +- OpenMina maintains two protocol implementations +- Webnodes cannot directly communicate with OCaml nodes +- Different security and performance characteristics +- Inconsistent behavior and debugging challenges + +## Vision: Unified Pull-Based P2P Layer + +The goal is to evolve OpenMina's pull-based P2P design to improve webnode +networking immediately and potentially become the universal networking layer for +all Mina nodes (both Rust and OCaml), with multiple transport options. Full +ecosystem adoption would require coordination and agreement with the OCaml Mina +team. + +### Core Design Principles + +The pull-based model (detailed in [p2p/readme.md](../../p2p/readme.md)) +provides: + +- **Security**: Recipients grant permission to send, preventing flooding +- **Fairness**: Protocol-enforced resource allocation among peers +- **Simplicity**: No message queues or dropping strategies needed +- **Consistency**: Senders know what recipients have processed + +### Target Architecture + +The plan is for Mina OCaml nodes to replace `libp2p_helper` with OpenMina's Rust +P2P implementation. + +| Node Type | Current P2P | Target P2P | Transport | +| ---------------- | ------------------- | ------------------- | ----------------------------------------------------- | +| OpenMina Webnode | Pull-based | Same protocol | WebRTC (browser-to-browser), QUIC (browser-to-server) | +| OpenMina Server | libp2p + Pull-based | Pull-based only | QUIC + WebRTC signaling | +| Mina OCaml Node | libp2p via Golang | Pull-based via Rust | QUIC + WebRTC signaling | + +_Note: Server nodes primarily use QUIC for direct communication, with WebRTC +signaling infrastructure maintained to help webnodes discover peers._ + +## Evolution Phases + +### Phase 1: Add QUIC Transport to OpenMina + +**Goal**: Extend OpenMina's existing pull-based protocol to support QUIC as an +additional transport. + +**Transport Quality Note**: Server-side WebRTC implementations are generally of +lower quality compared to QUIC implementations. This led to consideration of +implementing a minimalistic custom WebRTC library with only features needed for +the webnode protocol. However, if QUIC is adopted for webnode-to-server +communication, this custom implementation becomes unnecessary. Server nodes +would primarily use QUIC for direct communication, maintaining WebRTC signaling +infrastructure only to help webnodes discover and connect to peers. + +**Scope**: + +- Research and select Rust QUIC library (preference for minimal dependencies) +- Extend existing P2P channels abstraction to support QUIC transport +- Implement QUIC transport for server-to-server communication +- Maintain existing WebRTC support for browser compatibility + +**Benefits**: + +- Direct server connections without WebRTC signaling overhead +- Better performance (0-RTT, improved congestion control) +- Foundation for replacing libp2p + +**Research needed**: + +- Channel-to-stream mapping strategy +- Integration of QUIC flow control with pull-based model +- Library evaluation (quinn, s2n-quic, etc.) + +### Phase 2: Create Rust P2P Library for OCaml + +**Goal**: Package OpenMina's P2P implementation as a library for OCaml +integration. + +**Prerequisites**: Before integration, the Rust libp2p implementation must be: + +- Thoroughly reviewed and cleaned up +- Stress tested for production readiness +- Extended with testing features from `libp2p_helper` if still required (e.g., + gating and other ITN testing features) +- Validated for feature parity with current `libp2p_helper` functionality + +**Scope**: + +- Create Rust P2P helper process (similar to `libp2p_helper`) or OCaml-Rust FFI + integration +- Design integration approach (helper process vs direct FFI) +- Create migration path from `libp2p_helper` to Rust P2P +- Implement configuration for transport selection + +Mina OCaml nodes would replace `libp2p_helper` with a Rust program that reuses +OpenMina's P2P implementation. The integration could be either as a helper +process (like current `libp2p_helper`) or through OCaml-Rust FFI. Initially, +this Rust P2P helper would support both libp2p (for backward compatibility) and +the pull-based protocol, allowing for a gradual transition. + +**Architecture**: + +``` +Mina OCaml Node + ↓ +Rust P2P Helper (replaces `libp2p_helper`) + ↓ +Pull-based Protocol + libp2p (initially) + ↓ +QUIC / WebRTC Signaling / libp2p transport +``` + +**Benefits**: + +- Eliminate Golang dependency in Mina +- Single P2P implementation across ecosystem +- Direct integration without external processes + +### Phase 3: Dual Protocol Support Period + +**Goal**: Support both libp2p and pull-based protocols while proving the new +system in production. + +**Scope**: + +- Dual protocol support maintained (libp2p + pull-based) +- QUIC transport initially used only for webnode-to-server communication +- Extensive testing of server-to-server pull-based communication on private + networks or devnet +- Production validation before wider adoption + +**Testing Strategy**: + +- Private network deployment with full server-to-server pull-based communication +- Devnet testing under realistic load conditions +- Performance comparison between libp2p and pull-based protocols +- Stability and reliability validation over extended periods + +**Success Criteria**: + +- Proven performance and stability of pull-based server-to-server communication +- Successful integration with OCaml nodes via Rust P2P library +- Demonstrable benefits over current libp2p implementation + +### Phase 4: libp2p Deprecation + +**Goal**: Complete transition to unified pull-based P2P layer. + +**Important Note**: Full replacement of libp2p across the Mina ecosystem +requires coordination with the OCaml Mina team. This evolution plan represents +OpenMina's vision for improving P2P networking, starting with immediate benefits +for webnode-to-server communication and potentially becoming the new P2P +standard for Mina if adopted ecosystem-wide. + +**Scope**: + +- Coordinate network-wide migration timeline with OCaml Mina team +- Remove libp2p support from OpenMina (after ecosystem coordination) +- Remove libp2p protocol support from both implementations +- Simplify both codebases to single protocol + +**End state**: + +- All nodes use pull-based protocol +- Multiple transport options (WebRTC, QUIC) +- Unified implementation via Rust library + +## Technical Enhancements + +### Block Propagation Optimization + +Independent of transport changes, an important optimization is planned +([Issue #998](https://github.com/openmina/openmina/issues/998)): + +**Problem**: Mina blocks are large, causing slow propagation as nodes must +verify before forwarding. + +**Solution**: + +1. Propagate only consensus-critical headers first +2. Fetch full block body after consensus validation +3. Download only missing transactions/snarks not in local pools + +**Advanced Optimization**: Since nodes maintain local pools of transactions and +snarks, many items in a new block may already be present locally. The protocol +could be enhanced to: + +- Reference pool items by hash/ID rather than including full data +- Download only missing transactions and snarks not in local pools +- Leverage the existing efficient pool propagation mechanisms + +**Impact**: Significant bandwidth reduction and faster block propagation. + +## Implementation Considerations + +### QUIC Library Selection + +- Minimal dependencies (preferably avoiding async runtimes like tokio) +- Active maintenance and security updates + +### Rust P2P Library Preparation + +- Review and cleanup of existing libp2p implementation +- Stress testing for production readiness +- Extend with testing features from `libp2p_helper` if still required (e.g., + gating, ITN features) +- Validate feature parity with current `libp2p_helper` + +### OCaml Integration + +- Main challenge: architectural shift from push-based gossip to pull-based + protocol +- Choose between helper process (like `libp2p_helper`) vs direct OCaml-Rust FFI +- Memory management across language boundaries (if FFI approach chosen) +- Error handling and recovery +- Shared implementation benefits: same Rust P2P code used by both Rust and OCaml + nodes + +### Testing Strategy + +- Private network testing of server-to-server communication +- Devnet testing under realistic load +- Performance comparison between libp2p and pull-based protocols + +### Network Transition + +- Potential hardfork or softfork required for operator transition +- Gradual approach: nodes support both protocols, default switches to pull-based +- Eventually deprecate and remove libp2p support +- Network governance coordination needed for transition timeline + +## References + +- [OpenMina WebRTC P2P Implementation](../../p2p/readme.md) +- George's Networking layer 2.0 Proposal +- [Block Propagation Optimization (Issue #998)](https://github.com/openmina/openmina/issues/998) +- [Current libp2p Architecture](../../p2p/libp2p.md) diff --git a/docs/handover/persistence.md b/docs/handover/persistence.md new file mode 100644 index 000000000..de5562687 --- /dev/null +++ b/docs/handover/persistence.md @@ -0,0 +1,127 @@ +# Persistence Design (Not Yet Implemented) + +This document outlines the proposed design for persisting the Mina ledger and +other critical state to disk, reducing memory usage and enabling faster node +restarts. + +**Status**: Not yet implemented - this is a design proposal only. + +**Critical for Mainnet**: This is one of the most important changes required to +make the webnode mainnet-ready. + +## Overview + +Currently, OpenMina keeps the entire ledger in memory, which creates scalability +issues for mainnet deployment where the ledger can be large. A persistent +storage solution is needed to: + +- Reduce memory usage for both server-side nodes and webnodes +- Enable faster node restarts by avoiding full ledger reconstruction +- Deduplicate SNARK verification work across blocks and pools +- Support partial ledger storage for light clients + +## Design Reference + +A draft design for the persistence database is outlined in +[Issue #522](https://github.com/openmina/openmina/issues/522), which proposes an +approach for efficiently storing, updating, and retrieving accounts and hashes. + +**Note**: There is a very old implementation for on-disk storage in +`ledger/src/ondisk` that was never used - a lightweight key-value store +implemented to avoid the RocksDB dependency. This is unrelated to the new +persistence design which intends to solve persistence for everything, not just +the ledger. But the old implementation may be worth revisiting anyway. + +**Database Design Resources**: For those implementing persistence, +[Database Internals](https://www.databass.dev/) and +[Designing Data-Intensive Applications](https://dataintensive.net/) are +excellent books on database design and implementation. However, for Mina's +storage needs, nothing terribly advanced is required. + +## Key Design Principles (from Issue #522) + +1. **Simplicity First**: The design prioritizes simplicity over optimal + performance +2. **Fixed-Size Storage**: Most data (except zkApp accounts) uses fixed-size + slots for predictable access patterns +3. **Sequential Account Creation**: Mina creates accounts sequentially, filling + leaves from left to right in the Merkle tree, enabling an append-only design +4. **Selective Persistence**: Only epoch ledgers and the root ledger need + persistence; masks can remain in-memory +5. **Infrequent Updates**: Root ledger updates occur only when the transition + frontier root moves (at most once per slot during high traffic) +6. **Hashes in Memory**: All Merkle tree hashes remain in RAM for quick access +7. **Recoverable**: Data corruption is not catastrophic as ledgers can be + reconstructed from the network, but corruption must be easily detectable + (e.g., through checksums or hash verification) + +## Problems to be Solved + +### 1. Memory Usage Reduction + +- **Current**: Entire ledger in memory +- **Proposed Solution**: Only active masks and hashes in memory +- **Expected Impact**: Would dramatically reduce memory footprint for + mainnet-scale ledgers + +### 2. Faster Node Restarts + +- **Current**: Must reconstruct ledger from genesis or snapshot +- **Proposed Solution**: Load persisted ledger directly from disk +- **Expected Impact**: Could reduce restart times from minutes to seconds +- **Critical for Webnodes**: Network sync is particularly expensive for webnodes + due to limited bandwidth and connection quality typical in browser + environments, making fast restarts essential for usability + +### 3. Webnode Scalability + +- **Current**: Limited by browser memory constraints - cannot handle + mainnet-scale ledgers +- **Proposed Solution**: Store ledger through browser storage APIs + (IndexedDB/OPFS) +- **Expected Impact**: Would enable true browser-based full nodes (hard + requirement for mainnet support) + +### 4. SNARK Verification Deduplication + +- **Current**: OpenMina re-verifies all SNARKs every time they appear, even if + previously seen + - When a SNARK arrives in the snark pool, it's verified + - When the same SNARK appears in a block, it's verified again + - When the same SNARK appears in another block, it's verified yet again +- **Proposed Solution**: Store verified SNARKs in the persistence database + - When a SNARK arrives, check if it exists in the database + - If found in database, skip verification (already verified) + - If not found, verify and store the result + - Simple database lookup replaces expensive re-verification +- **Expected Impact**: Would significantly reduce redundant verification work, + especially during high network activity +- **Reference Implementation**: We implemented this optimization in the OCaml + node ([PR #12522](https://github.com/MinaProtocol/mina/pull/12522)) and + demonstrated dramatic performance improvements: block application time for + blocks with many completed works was reduced from ~8-14 seconds to ~0.015 + seconds by avoiding re-verification of SNARKs already present in the SNARK + pool. This was not implemented in OpenMina yet as it was planned to be done as + part of the persistence implementation. + +### 5. Reduced Network Traffic and Improved Pool Consistency + +- **Current**: Nodes frequently need to sync ledgers and pools from peers, + creating network overhead +- **Proposed Solution**: Persist ledgers, SNARK pools, and transaction pools to + disk + - Nodes maintain state across restarts without full resync + - Combined with webnode's pull-based P2P layer, enables better pool + convergence + - Less frequent ledger synchronization reduces network bandwidth usage + - Especially beneficial for webnodes that may be restarted more often than + server nodes +- **Expected Impact**: Would reduce overall network traffic and help nodes + maintain consistent views of transaction and SNARK pools + +## Open Questions + +1. Exact zkApp slot size (depends on 8 vs 32 field implementation and + verification key maximum size) +2. Optimal prefetching strategies for block producers? +3. Integration with existing mask hierarchy? diff --git a/docs/handover/release-process.md b/docs/handover/release-process.md new file mode 100644 index 000000000..321f3d92e --- /dev/null +++ b/docs/handover/release-process.md @@ -0,0 +1,293 @@ +# OpenMina Release Process + +This document outlines the release process for OpenMina, including version +management, tagging, and automated Docker image builds. + +## Overview + +The OpenMina release process involves: + +1. Creating a release preparation branch from `develop` +2. Updating version numbers across all Cargo.toml files +3. Updating the changelog with release notes and comparison links +4. Updating Docker Compose files with new version tags +5. Creating a PR to merge release changes to `develop` +6. Creating a PR to merge `develop` to `main` +7. Creating a git tag from `main` with the proper metadata +8. Automated CI/CD workflows that build and publish Docker images + +## Branch Strategy + +- **develop**: All changes between releases go to the `develop` branch +- **main**: Stable release branch, updated only during releases +- **prepare-release/vX.X.X**: Temporary branch for preparing release changes +- Public releases are always tagged from the `main` branch after merging from + `develop` +- Internal/non-public patch releases can be tagged directly from `develop` + +## Release Cadence + +During active development, OpenMina follows a monthly release schedule. At the +end of each month, all changes that have been merged to `develop` are packaged +into a new release. This regular cadence ensures: + +- Predictable release cycles for users +- Regular integration of new features and fixes +- Manageable changelog sizes +- Consistent testing and deployment rhythm + +## Prerequisites + +- All desired changes merged to `develop` branch +- All tests passing on `develop` +- Access to create and push git tags +- Permission to merge to `main` branch + +## Release Steps + +### 1. Create Release Preparation Branch + +Create a new branch from `develop` for preparing the release: + +```bash +git checkout develop +git pull origin develop +git checkout -b prepare-release/vX.Y.Z +``` + +### 2. Update Version Numbers + +Use the `versions.sh` script to update all Cargo.toml files with the new +version: + +```bash +./versions.sh X.Y.Z +``` + +This script will: + +- Find all Cargo.toml files in the project +- Update the version field in each file (except `mina-p2p-messages/Cargo.toml` + which is handled manually) +- Display the version changes for each file + +### 3. Update Changelog + +Update the CHANGELOG.md file following the +[Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format: + +1. **Move unreleased changes to new version section**: + - Change `## [Unreleased]` to `## [X.Y.Z] - YYYY-MM-DD` (use current date) + - Add a new empty `## [Unreleased]` section at the top + +2. **Organize changes by category**: + - `### Added` - for new features + - `### Changed` - for changes in existing functionality + - `### Deprecated` - for soon-to-be removed features + - `### Removed` - for now removed features + - `### Fixed` - for bug fixes + - `### Security` - for vulnerability fixes + +3. **Update comparison links at the bottom**: + - Add a new comparison link for the release: + ```markdown + [X.Y.Z]: https://github.com/openmina/openmina/compare/vA.B.C...vX.Y.Z + ``` + - Update the `[Unreleased]` link to compare from the new version to develop: + ```markdown + [Unreleased]: https://github.com/openmina/openmina/compare/vX.Y.Z...develop + ``` + + The release link compares the previous version tag with the new version tag. + +### 4. Update Docker Compose Files + +Update the image versions in all docker-compose files. For example, in +`docker-compose.local.producers.yml`: + +```yaml +image: openmina/openmina:X.Y.Z +``` + +### 5. Commit Version Changes + +Commit all the version updates, changelog, and docker-compose changes: + +```bash +git add CHANGELOG.md +git add Cargo.toml */Cargo.toml */*/Cargo.toml # Add all Cargo.toml files +git add Cargo.lock +git add docker-compose*.yml +git commit -m "chore: Prepare release vX.Y.Z" +``` + +**Note**: Avoid using `git add .` to prevent accidentally committing unrelated +files. + +### 6. Create PR to Develop + +Push the release preparation branch and create a PR to merge it into `develop`: + +```bash +git push origin prepare-release/vX.Y.Z +``` + +Then create a PR from `prepare-release/vX.Y.Z` to `develop` on GitHub. Once +approved and merged, continue with the next steps. + +### 7. Create PR to Main + +After the release preparation has been merged to `develop`, create a PR to merge +`develop` into `main`: + +1. Create the PR from `develop` to `main` on GitHub +2. Title it something like "Release vX.Y.Z" +3. Once approved and merged, continue to tagging + +### 8. Create Release Tag + +After `develop` has been merged into `main`, create the release tag from `main`: + +```bash +git checkout main +git pull origin main +env GIT_COMMITTER_DATE=$(git log -n1 --pretty=%aD) git tag -a -f -m "Release X.Y.Z" vX.Y.Z +``` + +**Important**: The tag must follow the format `v[0-9]+.[0-9]+.[0-9]+` to trigger +the CI workflows. + +### 9. Push Tag + +Push the tag to trigger the release workflows: + +```bash +git push origin vX.Y.Z +``` + +## Automated Release Process + +Once the tag is pushed, the following automated processes occur: + +### GitHub Release Creation (.github/workflows/release.yaml) + +This workflow: + +1. Triggers on version tags matching `v[0-9]+.[0-9]+.[0-9]+` +2. Creates a versioned directory with Docker Compose files +3. Packages the files into both .zip and .tar.gz archives +4. Creates a **draft** GitHub release +5. Uploads the archives as release assets + +**Note**: The release is created as a draft and must be manually published +through GitHub's UI. + +### Docker Image Building (.github/workflows/docker.yaml) + +This workflow: + +1. Builds multi-architecture Docker images (linux/amd64 and linux/arm64) +2. Creates images for: + - OpenMina node (`openmina/openmina`) + - Frontend (`openmina/frontend`) +3. Tags images with: + - Branch name (for branches) + - Short SHA + - Semantic version (for tags) + - `latest` (for main branch) + - `staging` (for develop branch) + +## Post-Release Steps + +### 1. Review Draft Release + +1. Go to the GitHub releases page +2. Find the draft release created by the workflow +3. Review the release notes and assets +4. Ensure the release assets include: + - `openmina-vX.Y.Z-docker-compose.zip` + - `openmina-vX.Y.Z-docker-compose.tar.gz` +5. Add any additional release notes or documentation if needed + +### 2. Publish Release + +Click "Publish release" in GitHub's UI to make the release publicly available. + +### 3. Verify Docker Images + +Verify the Docker images are available on Docker Hub: + +```bash +docker pull openmina/openmina:vX.Y.Z +docker pull openmina/frontend:vX.Y.Z +``` + +Check that both `amd64` and `arm64` architectures are available: + +```bash +docker manifest inspect openmina/openmina:vX.Y.Z +``` + +## Version Tagging Best Practices + +- Always use semantic versioning: `vMAJOR.MINOR.PATCH` +- Use annotated tags (`-a` flag) for releases +- Include a meaningful message with `-m` +- Preserve commit date with `GIT_COMMITTER_DATE` for traceability + +## Troubleshooting + +### CI Workflow Not Triggered + +Ensure: + +- The tag format matches exactly: `v[0-9]+.[0-9]+.[0-9]+` +- The tag was pushed to the remote repository +- Check GitHub Actions for any workflow errors + +### Docker Build Failures + +- Check the GitHub Actions logs for specific error messages +- Ensure all tests pass before creating a release +- Verify Dockerfile syntax and dependencies + +## Example Release Commit Structure + +A typical release PR (like #1134) includes these commits: + +1. **chore: Update CHANGELOG** - Updates the changelog with release notes and + comparison link +2. **chore: Bump version to X.X.X** - Result of running `versions.sh` +3. **chore: Update Cargo.lock** - Updated dependencies lock file +4. **chore: Update version in docker compose files** - Docker image version + updates + +## Internal/Patch Releases + +For internal or non-public patch releases (e.g., vX.Y.Z+1), you can tag directly +from `develop`: + +1. Follow steps 1-5 above (create release branch, update versions, commit + changes, PR to develop) +2. After merging to `develop`, tag directly from `develop`: + ```bash + git checkout develop + git pull origin develop + env GIT_COMMITTER_DATE=$(git log -n1 --pretty=%aD) git tag -a -f -m "Release X.Y.Z+1" vX.Y.Z+1 + git push origin vX.Y.Z+1 + ``` +3. The same CI/CD workflows will trigger and create draft releases + +This approach is useful for: + +- Quick fixes that need immediate deployment +- Internal testing releases +- Patch releases that don't warrant a full main branch update + +## Reference + +For examples of previous releases, see: + +- PR #1134 (Release v0.16.0) and similar release PRs +- The git tag history: `git tag -l` +- The CHANGELOG.md file for release note formats diff --git a/docs/handover/services-technical-debt.md b/docs/handover/services-technical-debt.md new file mode 100644 index 000000000..4baac421f --- /dev/null +++ b/docs/handover/services-technical-debt.md @@ -0,0 +1,322 @@ +# OpenMina Services Technical Debt Analysis + +This document covers technical debt across OpenMina services. + +## Executive Summary + +The services layer has accumulated technical debt from rapid development with +deferred decisions and incomplete error handling. Key issues include: + +- Use of `todo!()` in production code paths (EventSourceService, + BlockProducerVrfEvaluatorService) +- Intentional panics for block proof failures that need error sink service + integration +- Inconsistent error propagation between services and state machines +- Synchronous operations that should be async (LedgerService) +- Raw protocol implementation in Archive Service with hardcoded byte sequences +- Resource management gaps (unbounded buffers in P2P services, missing timeouts) +- WebRTC 1-second delay workaround for message loss + +## Service-by-Service Analysis + +### 1. EventSourceService + +**Location**: `node/src/event_source/` + +#### Critical Issues + +- **Missing Error Actions** (event_source_effects.rs - `P2pChannelEvent::Opened` + handler): P2P channel opening failures are logged but not dispatched as error + actions +- **Unimplemented Error Paths** (event_source_effects.rs - + `BlockProducerEvent::BlockProve` and `Event::GenesisLoad` handlers): Using + `todo!()` for block proof and genesis load failures + +#### Moderate Issues + +- **Genesis Loading Order** (event_source_effects.rs - + `TransitionFrontierGenesisAction::ProveSuccess` dispatch): Documented need to + refactor genesis inject dispatch order +- **Incomplete Error Strategy**: Errors are logged but not consistently + propagated through the action system + +### 2. LedgerService & LedgerManager + +**Location**: `node/src/ledger/` + +#### Critical Issues + +- **Blocking Operations** (ledger_manager.rs - `get_accounts()` method): + Synchronous account retrieval in async context - "TODO: this should be + asynchronous" + +#### Moderate Issues + +- **Error Handling TODOs** (ledger_manager.rs - `LedgerService::run()` staged + ledger reconstruction): Staged ledger reconstruction failures not properly + handled +- **Network Constants** (ledger_manager.rs - `LedgerRequest` enum definition): + FIXME for hardcoded network-specific values +- **Silent Failures**: Multiple locations where errors are logged but not + propagated + +#### Code Quality + +- Dead code with TODO comments (ledger_manager.rs - `LedgerRequest` enum) +- Tuple returns that should be proper structs (ledger_manager.rs - various + handler methods) + +### 3. P2P Services + +**Location**: `p2p/src/service_impl/` + +#### Moderate Issues + +- **WebRTC Message Loss** (webrtc/mod.rs - `peer_start()` connection auth and + `peer_loop()` channel handler): 1-second sleep workaround after channel + opening + - Root cause: Messages sent immediately after channel open are lost + - Impact: Adds unnecessary latency to all connections + - Proper fix needed: Ensure channel is fully established before sending + - Maybe this was only an issue with the webrtc-rs (Rust) library, and not the + C++ "datachannel" library used now (or the browser implementation). Worth + revising. +- **Fake Network Detection** (webrtc/mod.rs - network interface detection): + "TODO: detect interfaces properly" +- **Missing Bounds Checks** (webrtc/mod.rs - `peer_loop()` function): Buffer + resizing without upper bounds +- **Unwrap Operations** (webrtc/mod.rs - `peer_loop()` function): Multiple + unwrap calls that could panic + +#### Architectural Debt + +- **Stream Cleanup** (webrtc/mod.rs - `RTCChannelConfig` struct): TODO for + cleaning up after libp2p channels/streams +- **Connection Types**: Temporary vs normal connection distinction poorly + implemented + +### 4. SNARK Verification Services + +**Location**: `snark/src/` + +#### Moderate Issues + +- **Error Propagation** (throughout): Multiple "TODO: log or propagate" comments +- **Missing Callbacks** (snark_user_command_verify_reducer.rs - + `SnarkUserCommandVerifyAction::Error` handler): Error callback dispatch not + implemented +- **Debug Output** (block_verify module): TODO to display hashes instead of full + state + +#### Code Organization + +- **Crate Dependencies** (snark_work_verify_state.rs - + `SnarkWorkVerifyStatus::Init` struct): p2p identity needs to move to shared + crate + +### 5. Block Producer Services + +**Location**: `node/src/block_producer/` + +#### Critical Issues + +- **Intentional Panic on Block Proof Failure** (event_source_effects.rs - + `BlockProducerEvent::BlockProve` handler): When block proof generation fails, + the system intentionally panics with `todo!()` to make failures highly visible + for debugging + - **Current behavior**: Block proof failures cause deliberate node shutdown to + ensure failures are noticed + - **Planned improvement**: Should use error sink service (partially + implemented in PR #1097) to make failures easily visible without forcing + node exit + - **Service layer**: Properly handles failures by logging errors and dumping + comprehensive debug data with encrypted private keys +- **Unimplemented States** + (vrf_evaluator/block_producer_vrf_evaluator_reducer.rs - + `BlockProducerVrfEvaluatorState::reducer()` SelectInitialSlot handler): + `todo!()` for "Waiting" epoch context +- **Currency Overflow** (block_producer_reducer.rs - + `reduce_block_unproved_build()` method): `todo!()` for total_currency overflow + handling + +#### Moderate Issues + +- **Hardcoded Values** (vrf_evaluator module): slots_per_epoch hardcoded with + TODO +- **Fork Assumptions** (block_producer_reducer.rs - blockchain fork handling): + TODO assumes short range fork +- **Potential Panics** (block_producer_reducer.rs - state update logic): Fix + unwrap that could panic + +#### Code Quality + +- **Dead Code** (vrf_evaluator module): Multiple redundant functions marked for + removal +- **Test Infrastructure** (vrf_evaluator module - test sections): Genesis best + tip update tests need rework +- **Missing Tests** (block_producer_reducer.rs - test module): Test coverage + gaps + +### 6. Pool Management Services + +#### VerifyUserCommandsService (Transaction Pool) + +**Location**: `node/src/transaction_pool/` + +- **Dead Code** (transaction_pool_service.rs): Trait defined but never + implemented - transaction verification is actually handled by + SnarkUserCommandVerifyService + +### 7. Archive Service + +**Location**: `node/common/src/service/archive/` + +**Purpose**: Forwards block application results to external archive process +(reuses same archive process as OCaml node) via Jane Street's async-rpc protocol +when archive mode is enabled. Also supports filesystem storage, GCP, and other +backends. + +**Integration**: Called from `transition_frontier_sync_effects.rs` via +`BlocksSendToArchive` action after successful block application in +`ledger_write_reducer.rs`. Runs in separate thread to avoid blocking sync +process. + +#### Critical Issues + +- **Raw Protocol Implementation** (rpc.rs): Manual async-rpc protocol handling + with hardcoded byte sequences instead of proper protocol implementation + - Magic bytes without documentation: `[2, 253, 82, 80, 67, 0, 1]`, + `[2, 1, 0, 1, 0]` + - Complex manual state machine with boolean flags (`handshake_received`, + `handshake_sent`) + - Manual message parsing with potential buffer overflows and panics +- **Poor Connection Management** (rpc.rs): Creates new TCP connection for each + message instead of connection pooling +- **Resource Management** (rpc.rs): Unbounded memory growth in message + buffering, no cleanup guarantees + +#### Moderate Issues + +- **Missing State Machine Structure**: Should follow standard + actions/reducer/effects pattern (consider during transition frontier + refactoring) +- **Service Architecture**: Mixed blocking/async patterns, dedicated thread + instead of proper async service +- **Error Handling**: Simplistic retry logic without exponential backoff or + circuit breaker patterns +- **Configuration**: Hard-coded values (retry counts, timeouts) and environment + variable dependencies + +#### Code Quality + +- **Data Conversion**: Inefficient cloning in `BlockApplyResult` to + `ArchiveTransitionFrontierDiff` conversion +- **Serialization**: Poor error handling in binprot serialization with typos in + error messages +- **Service Lifecycle**: No graceful shutdown or health monitoring mechanisms + +**Priority**: Low - works in practice but the implementation (especially the RPC +part) needs thorough review and cleanup + +### 8. External SNARK Worker Service + +**Location**: `node/common/src/service/snark_worker.rs` + +- **Error Handling** (snark_worker.rs - `ExternalSnarkWorkerFacade::start()` + method): Terminal errors sent through channel instead of proper exit + +## Cross-Cutting Concerns (Services Layer) + +### Service Error Handling + +- Services return results but errors often not propagated back through events +- Missing error event types for several service operations (e.g., + EventSourceService TODO for error dispatch) +- Inconsistent error handling across service trait implementations +- Block producer failures intentionally panic for visibility - need error sink + service integration + +### LedgerService Blocking Operations + +- LedgerService `get_accounts()` method performs synchronous retrieval that + should be async +- `get_mask()` method exists only to support tests and should not be used in + production +- Synchronous operations are deprecated and violate the async architecture + principles +- Documented as "TODO: this should be asynchronous" + +### WebRTC Service Issues + +- 1-second sleep workaround in P2P WebRTC implementation for message loss +- Affects all P2P connections with unnecessary latency +- Root cause: Messages sent immediately after channel open are lost +- Requires proper fix to ensure channel readiness before sending + +### Resource Management in Services + +- P2P services: Missing upper bounds on buffer sizes (MIO service buffer + resizing) +- Missing timeout mechanisms for long-running service operations (e.g., ledger + operations, SNARK verification) + +### Service Implementation Patterns + +- Inconsistent service trait implementation locations (native vs web vs p2p) +- No unified logging or monitoring approach across services +- Service initialization scattered across NodeServiceCommonBuilder without clear + pattern + +## Prioritized Recommendations + +### Immediate (P0) + +1. **Make LedgerService async** - Convert synchronous `get_accounts()` to async + operation +2. **Complete EventSourceService error handling** - Replace `todo!()` in error + paths with proper error event dispatch +3. **Implement error sink service** - Complete PR #1097 to handle block proof + failures gracefully without node shutdown +4. **Service error propagation** - Add missing error event types and ensure all + service errors reach state machine + +### Short Term (P1) + +1. **P2P buffer bounds** - Add upper limits to MIO service buffer resizing +2. **Fix unwrap operations** - Replace panicking unwraps in P2P WebRTC service +3. **Clean up VRF evaluator** - Remove redundant functions marked for deletion +4. **Remove dead code** - Delete unused VerifyUserCommandsService trait + +### Medium Term (P2) + +1. **Service lifecycle framework** - Create unified initialization/shutdown + patterns for all services +2. **Resource cleanup** - Add timeout mechanisms for long-running service + operations +3. **Service monitoring** - Add health checks and metrics for service + availability +4. **WebRTC investigation** - Determine if message loss issue with sleep + workaround persists with C++ "datachannel" implementation + +### Long Term (P3) + +1. **Archive Service cleanup** - Replace raw async-rpc protocol implementation + with proper library +2. **Service trait consolidation** - Unify service implementation patterns + across native/web/p2p +3. **Comprehensive logging** - Implement consistent logging strategy across all + services +4. **Performance profiling** - Identify and optimize service bottlenecks +5. **Documentation** - Document service patterns and best practices + +## Conclusion + +The services layer is operational and stable but has accumulated technical debt +from rapid development. Key issues include synchronous operations that should be +async (LedgerService), incomplete error propagation between services and state +machines, and various TODOs marking deferred implementation decisions. The +intentional panic on block proof failures serves its purpose of making failures +highly visible but should be replaced with the error sink service for better +operational stability. While some issues like the WebRTC workaround have proven +stable in practice, addressing the high-priority items will improve system +reliability and maintainability. diff --git a/docs/handover/services.md b/docs/handover/services.md new file mode 100644 index 000000000..c9b54510c --- /dev/null +++ b/docs/handover/services.md @@ -0,0 +1,296 @@ +# OpenMina Services + +This document provides a roadmap to services in the OpenMina system. Services +handle external I/O, heavy computations, and asynchronous operations, keeping +the state machine deterministic and pure. + +## Architecture Overview + +Services isolate non-deterministic operations from the Redux state machine: + +- **State machines** handle business logic and dispatch effectful actions +- **Services** handle "outside world" interactions (I/O, crypto, networking) +- **Events** carry results back from services to state machines +- **Threading** enables CPU-intensive work without blocking state machine + +For architectural details, see +[`architecture-walkthrough.md`](architecture-walkthrough.md). + +## Service Organization + +Services follow a consistent pattern: + +- **Trait** - Interface defined in `*_effectful_service.rs` files +- **Implementation** - Platform-specific implementations in: + - `node/native/src/service/` - Native implementations + - `node/web/src/` - WASM implementations + - `p2p/src/service_impl/` - P2P implementations + +## Core System Services + +### EventSourceService + +**Trait**: `node/src/event_source/event_source_service.rs` +**Implementation**: `node/native/src/service/mod.rs` (part of `NodeService`) + +Central event aggregation service that collects events from all services, +batches them, and routes them to the state machine. Provides the bridge between +async service world and synchronous state machine. + +**Usage Pattern:** Services send events → EventSource batches them → Main loop +processes via `EventSourceAction::ProcessEvents` + +### TimeService + +**Location**: `node/common/src/service/service.rs` (part of `NodeService`) + +Provides time abstraction for the entire system, enabling deterministic replay. + +**Key Concepts:** + +- Abstracts system time for deterministic execution +- Normal mode returns actual system time, replay mode returns recorded + timestamps +- Critical for slot calculations, VRF evaluation, and block production timing +- All actions receive timestamps from TimeService + +**Why it matters:** Makes non-deterministic time access deterministic, enabling +perfect reproduction of execution sequences and debugging of time-sensitive +consensus issues. + +### LedgerService + +**Trait**: `node/src/ledger/ledger_service.rs` +**Implementation**: `node/src/ledger/ledger_manager.rs` (dedicated thread) + +Provides interface to the LedgerManager for all ledger operations. + +**Threading Model:** + +- Dedicated "ledger-manager" thread for all operations +- Worker threads spawned for heavy computations +- Async communication via event-based responses + +**Key Operations:** + +- **Read**: Account queries, merkle tree lookups, scan state info +- **Write**: Block application, staged ledger operations, commits +- **Storage**: Manages snarked ledgers, staged ledgers, and sync state + +**Note**: Contains deprecated synchronous methods (`get_accounts()`, +`get_mask()`) that should not be used in new code. + +## P2P Networking Services + +### P2pService + +**Trait**: `p2p/src/p2p_service.rs` +**Implementation**: `p2p/src/service_impl/` (multiple backends) + +Composite service managing all peer-to-peer networking operations. + +**Core Sub-services:** + +- **P2pConnectionService** - WebRTC connection establishment and authentication +- **P2pDisconnectionService** - Peer disconnection handling +- **P2pChannelsService** - Channel communication and message + encryption/decryption + +**Extended Sub-services (with libp2p):** + +- **P2pMioService** - Low-level network I/O and socket management +- **P2pCryptoService** - Cryptographic operations +- **P2pNetworkService** - Network utilities (DNS, IP detection) + +**Architecture Notes:** + +- Each peer runs in dedicated async task +- WebRTC: SDP exchange → authentication → data channels +- Trait-based composition enables different backends +- Services handle only I/O, never business logic + +## SNARK Verification Services + +All SNARK verification services delegate to the `ledger` crate for actual +cryptographic operations. + +### SnarkBlockVerifyService + +**Trait**: `snark/src/block_verify_effectful/snark_block_verify_service.rs` +**Implementation**: `node/common/src/service/snarks.rs` + +Verifies block proofs using dedicated "block_proof_verifier" thread. Called from +transition frontier when blocks need proof verification. + +### SnarkUserCommandVerifyService + +**Trait**: +`snark/src/user_command_verify_effectful/snark_user_command_verify_service.rs` +**Implementation**: `node/common/src/service/snarks.rs` + +Verifies user command signatures using Rayon thread pool for parallel +processing. Called from transaction pool for signature validation. + +### SnarkWorkVerifyService + +**Trait**: `snark/src/work_verify_effectful/snark_work_verify_service.rs` +**Implementation**: `node/common/src/service/snarks.rs` + +Verifies SNARK work submissions (transaction and zkApp proofs) using Rayon +thread pool. Called from SNARK pool when evaluating work submissions. + +### ExternalSnarkWorkerService + +**Trait**: +`node/src/external_snark_worker_effectful/external_snark_worker_service.rs` +**Implementation**: `node/common/src/service/snark_worker.rs` + +Manages external SNARK worker process for scan state SNARK work production. + +**Key Operations:** + +- Start/stop worker with fee configuration +- Submit work specifications for proof generation +- Generates transaction and zkApp proofs via dedicated "snark_worker" thread + +**Usage:** Called exclusively from SNARK pool for scan state work production. + +## Block Production Services + +### BlockProducerService + +**Trait**: +`node/src/block_producer_effectful/block_producer_effectful_service.rs` +**Implementation**: `node/common/src/service/block_producer/mod.rs` + +Provides block proof generation using dedicated "openmina_block_prover" thread. + +**Key Operations:** + +- Returns cached block prover instances +- Generates block proofs with blockchain state input +- Provides secure access to producer's secret key +- Failed proofs dump encrypted debug data to disk + +### BlockProducerVrfEvaluatorService + +**Trait**: +`node/src/block_producer_effectful/vrf_evaluator_effectful/block_producer_vrf_evaluator_effectful_service.rs` +**Implementation**: +`node/common/src/service/block_producer/vrf_evaluator.rs` + +Evaluates VRF for slot leadership determination using dedicated +"openmina_vrf_evaluator" thread. Receives epoch seed, delegator table, and slot +info to determine if node won the slot based on stake distribution. + +## Pool Management Services + +### SnarkPoolService + +**Trait**: `node/src/snark_pool/snark_pool_service.rs` +**Implementation**: `node/common/src/service/service.rs` (part of `NodeService`) + +Provides randomization for SNARK job selection when using random snarker +strategy. Isolates non-deterministic random selection from the deterministic +state machine. + +### VerifyUserCommandsService (Dead Code) + +**Location**: `node/src/transaction_pool/transaction_pool_service.rs` + +Unused trait with no implementations. Transaction verification is actually +handled by `SnarkUserCommandVerifyService`. Should be removed. + +## State Synchronization Services + +### TransitionFrontierGenesisService + +**Trait**: +`node/src/transition_frontier/genesis_effectful/transition_frontier_genesis_service.rs` +**Implementation**: +`node/common/src/service/service.rs` (part of `NodeService`) + +Manages genesis configuration loading and ledger initialization. + +### TransitionFrontierSyncLedgerSnarkedService + +**Trait**: +`node/src/transition_frontier/sync/ledger/snarked/transition_frontier_sync_ledger_snarked_service.rs` +**Implementation**: +Generic implementation delegating to `LedgerService` + +Handles snarked ledger operations during blockchain synchronization: + +- Merkle tree operations (child hash retrieval, hash computation) +- Ledger management (copying for sync, account population) +- All operations delegate to LedgerManager for thread-safe access + +## External Integration Services + +### RpcService + +**Trait**: `node/src/rpc_effectful/rpc_service.rs` +**Implementation**: `node/common/src/service/rpc/mod.rs` (part of `NodeService`) + +Manages RPC response delivery to external clients through channel-based +communication. Contains 30+ `respond_*` methods for different RPC operations. + +**Response Categories:** State, Status, P2P, SNARK Pool, Transactions, Ledger, +Consensus + +**Key Pattern:** Service only handles response delivery - all RPC logic is in +the state machine. Uses unique `RpcId` for request/response correlation. + +### ArchiveService + +**Trait**: `node/src/transition_frontier/archive/archive_service.rs` +**Implementation**: `node/common/src/service/archive/mod.rs` + +Provides asynchronous block persistence to external storage systems. + +**Storage Backends:** + +- AWS S3 - JSON storage with configurable bucket/region +- Google Cloud Platform - Cloud storage integration +- Local Filesystem - Direct file system storage +- Archive Process - RPC communication with external archiver + +Uses dedicated thread with Tokio runtime for async operations. Implements retry +logic (5 attempts) for failed uploads. Called exclusively during block +application. + +## Service Lifecycle + +Services are created via `NodeServiceCommonBuilder`, configured based on node +type, and connected via event channels. Dedicated threads are spawned for +CPU-intensive services. + +**Runtime Flow:** State machine dispatches effectful actions → Effects call +service methods → Services perform async operations → Results sent back via +events + +## Service Implementation Patterns + +**Implementation Locations:** + +- **Node Services** (`node/native/src/service/`) - Most services, often + delegating to specialized crates +- **P2P Services** (`p2p/src/service_impl/`) - Multiple backends (libp2p, + WebRTC) +- **Web/WASM Services** (`node/web/src/`) - Browser-compatible implementations + +**Threading Patterns:** + +- Some services use dedicated threads (LedgerManager, VRF evaluator, SNARK + workers) +- Others use Rayon thread pool or run on main thread +- WASM uses web workers instead of threads + +**Communication Patterns:** + +- State Machine → Service: Via effectful actions +- Service → State Machine: Via events with request IDs +- Service → Service: Not allowed - all coordination through state machine + +This architecture provides clean separation between deterministic state +management and non-deterministic external operations. diff --git a/docs/handover/state-machine-debugging-guide.md b/docs/handover/state-machine-debugging-guide.md new file mode 100644 index 000000000..bb232e72e --- /dev/null +++ b/docs/handover/state-machine-debugging-guide.md @@ -0,0 +1,458 @@ +# State Machine Debugging Guide + +This guide provides comprehensive tools and techniques for troubleshooting and +investigating issues in OpenMina's state machine architecture. + +## Prerequisites + +Before using this guide, understand: + +- [Architecture Walkthrough](architecture-walkthrough.md) - Core concepts and + patterns +- [State Machine Development Guide](state-machine-development-guide.md) - + Implementation basics +- [State Machine Structure](state-machine-structure.md) - System organization + +> **Related Guides**: [Testing Infrastructure](testing-infrastructure.md), +> [Services Technical Debt](services-technical-debt.md), +> [State Machine Technical Debt](state-machine-technical-debt.md) + +## Action Tracing and Logging + +### Understanding the ActionEvent Macro + +The `ActionEvent` derive macro generates structured logging for actions in +OpenMina. It automatically creates log events when actions are dispatched, +integrating with the tracing framework for efficient debugging. + +**Basic Usage (from actual codebase):** + +```rust +#[derive(Serialize, Deserialize, Debug, Clone, ActionEvent)] +pub enum BlockProducerAction { + VrfEvaluator(BlockProducerVrfEvaluatorAction), + BestTipUpdate { best_tip: ArcBlockWithHash }, +} +``` + +**Setting Default Log Levels:** + +```rust +#[derive(Serialize, Deserialize, Debug, Clone, ActionEvent)] +#[action_event(level = info)] // Default for all variants +pub enum BlockProducerAction { + WonSlotSearch, // Uses info level + + #[action_event(level = trace)] // Override for specific variant + BlockInject, +} +``` + +**Field Extraction (real examples from codebase):** + +```rust +// Simple field inclusion +#[action_event(level = info, fields(slot, current_time))] +WonSlot { + slot: u32, + current_time: Timestamp, +} + +// Complex field expressions (from VRF evaluator) +#[action_event( + level = info, + fields( + slot = won_slot.global_slot.slot_number.as_u32(), + slot_time = openmina_core::log::to_rfc_3339(won_slot.slot_time) + .unwrap_or_else(|_| "".to_owned()), + ) +)] +WonSlot { + won_slot: BlockProducerWonSlot, +} + +// Display formatting +#[action_event(level = info, fields(display(chain_id)))] +Initialize { chain_id: openmina_core::ChainId }, +``` + +**Automatic Level Assignment:** + +- Actions ending in `Error` or `Warn` automatically get `warn` level +- Default level is `debug` if not specified +- Enum-level `#[action_event(level = X)]` sets default for all variants + +**Documentation Integration:** + +```rust +/// Initializes p2p layer. +#[action_event(level = info)] +Initialize { chain_id: ChainId }, +``` + +Doc comments become `summary = "Initializes p2p layer"` in log events. + +### Using Log Levels for Debugging + +**Environment Variable Control:** + +```bash +# See everything (expensive, use sparingly) +OPENMINA_TRACING_LEVEL=trace cargo run --release -p cli node + +# Development debugging +OPENMINA_TRACING_LEVEL=debug cargo run --release -p cli node + +# Production logging +OPENMINA_TRACING_LEVEL=info cargo run --release -p cli node + +# Only warnings and errors +OPENMINA_TRACING_LEVEL=warn cargo run --release -p cli node +``` + +**Level Guidelines (based on actual usage):** + +- **trace** - Very frequent actions (use sparingly due to performance) +- **debug** - Regular operations during development +- **info** - Important business events suitable for production +- **warn** - Error conditions and anomalies + +**Debugging Strategy:** + +1. Start with `OPENMINA_TRACING_LEVEL=info` to see important events +2. Increase specific component actions to `debug` level in code +3. Use `OPENMINA_TRACING_LEVEL=debug` to see those specific actions +4. Only use `trace` level for short debugging sessions + +## Recording and Replay + +### Record Execution + +OpenMina supports recording execution for deterministic debugging: + +```bash +# Record input actions and initial state for replay +./target/release/openmina node --record state-with-input-actions + +# No recording (default) +./target/release/openmina node --record none +``` + +The recorded data is stored in `~/.openmina/recorder/` and includes: + +- Initial state snapshot with RNG seed and P2P secret key +- All input actions (timeouts and external events that drive state changes) + +### Replay Debugging Sessions + +Replay previously recorded sessions to reproduce issues deterministically: + +```bash +# Replay a recorded session +./target/release/openmina replay-state-with-input-actions --dir ~/.openmina/recorder + +# Replay with verbose output for debugging +./target/release/openmina replay-state-with-input-actions --verbosity debug --dir ~/.openmina/recorder + +# Ignore build environment mismatches if needed +./target/release/openmina replay-state-with-input-actions --ignore-mismatch --dir ~/.openmina/recorder +``` + +**Programmatic replay usage:** + +```rust +// Read recorded session +let reader = StateWithInputActionsReader::new("~/.openmina/recorder"); +let initial_state = reader.read_initial_state()?; + +// Replay actions step by step +for (path, actions) in reader.read_actions() { + // Process recorded actions to reproduce bug +} +``` + +## Network Analysis Tools + +### P2P Connection Analysis + +Network debugging in OpenMina is primarily done through structured logging and +the testing framework: + +**For P2P debugging, use:** + +- Action tracing with `OPENMINA_TRACING_LEVEL=debug` to see P2P events +- Component-specific logging for connection and message flow analysis +- Testing framework tools for controlled network scenario testing + +### Protocol Message Analysis + +**Available through logging:** + +- Connection lifecycle events via P2P action traces +- Kademlia DHT operations in debug logs +- Gossipsub message propagation events +- Stream multiplexing details in trace-level logs + +**For advanced network analysis:** + +- The testing framework in `node/testing/` includes network debugging + capabilities +- The `bpf-recorder` tool provides packet-level analysis in test scenarios +- See [Testing Infrastructure](testing-infrastructure.md) for testing-specific + debugging + +## Testing Framework + +OpenMina has comprehensive testing infrastructure for state machines. For +detailed information, see [Testing Infrastructure](testing-infrastructure.md). + +**Available Testing Approaches:** + +- **Unit Testing** - Test individual actions and reducers +- **Scenario-Based Testing** - Test component workflows +- **Multi-Node Simulation** - Test distributed behavior +- **Fuzzing** - Test with random inputs +- **Differential Testing** - Compare against OCaml implementation + +**Basic Unit Test Pattern:** + +```rust +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_enabling_conditions() { + let state = create_test_state(); + let action = YourComponentAction::Process { data: test_data() }; + assert!(action.is_enabled(&state, Timestamp::ZERO)); + } +} +``` + +## Common Error Patterns + +### Enabling Condition Mismatches + +**Problem:** `bug_condition!` triggers indicate enabling conditions don't match +reducer assumptions. + +**What `bug_condition!` Does:** + +- Defensive programming macro for unreachable code paths +- In development (`OPENMINA_PANIC_ON_BUG=true`): Panics to catch bugs early +- In production (default): Logs error and continues gracefully +- Should only trigger if enabling conditions have bugs + +**Example:** + +```rust +// Enabling condition allows action +impl EnablingCondition for MyAction { + fn is_enabled(&self, state: &State, _time: Timestamp) -> bool { + state.my_component.is_ready() // Returns true + } +} + +// But reducer expects different state +MyAction::Process { data } => { + let Some(processor) = &state.processor else { + bug_condition!("Process action enabled but no processor available"); + return; + }; + // This bug_condition! indicates a mismatch +} +``` + +**Solution:** Align enabling condition with reducer expectations: + +```rust +impl EnablingCondition for MyAction { + fn is_enabled(&self, state: &State, _time: Timestamp) -> bool { + state.my_component.is_ready() && state.processor.is_some() + } +} +``` + +### State Machine Initialization Issues + +**Problem:** Actions dispatched before component is ready. + +**Example:** + +```rust +// Action reaches reducer before initialization +Action::P2p(p2p_action) => match &mut state.p2p { + P2p::Pending(_) => { + // This indicates premature action dispatch + error!(meta.time(); "p2p not initialized", action = debug(p2p_action)); + } + P2p::Ready(_) => { + // Process action normally + } +} +``` + +**Solution:** Use initialization state in enabling conditions: + +```rust +impl EnablingCondition for P2pAction { + fn is_enabled(&self, state: &State, _time: Timestamp) -> bool { + matches!(state.p2p, P2p::Ready(_)) + } +} +``` + +### Invalid State Transitions + +**Problem:** Attempting impossible state transitions. + +**Example:** + +```rust +// Multiple initialization attempts +YourAction::Init { config } => { + if state.is_initialized() { + bug_condition!("Already initialized but Init action enabled"); + return; + } + state.initialize(config); +} +``` + +**Solution:** Use state enums to enforce valid transitions: + +```rust +#[derive(Debug, Clone)] +pub enum YourComponentState { + Uninitialized, + Initializing { config: Config }, + Ready { data: ComponentData }, + Error { error: String }, +} + +// Enabling condition prevents invalid transitions +impl EnablingCondition for YourAction { + fn is_enabled(&self, state: &State, _time: Timestamp) -> bool { + match self { + YourAction::Init { .. } => { + matches!(state.your_component, YourComponentState::Uninitialized) + } + YourAction::Process { .. } => { + matches!(state.your_component, YourComponentState::Ready { .. }) + } + } + } +} +``` + +### Service Communication Errors + +**Problem:** External services not responding or returning unexpected data. + +**Example:** + +```rust +// Handle service failures gracefully +YourAction::ServiceError { error } => { + warn!(meta.time(); "service call failed", error = display(error)); + + // Update state to reflect failure + state.status = YourComponentStatus::Error { + error: error.to_string() + }; + + // Dispatch retry or fallback action + let dispatcher = state_context.into_dispatcher(); + dispatcher.push(YourAction::RetryOrFallback); +} +``` + +**Solution:** Implement robust error handling: + +```rust +// In effectful actions +YourEffectfulAction::ExternalRequest { params } => { + // Service call with timeout and retry logic + store.service.call_with_retry(params, max_retries, timeout); +} + +// In service implementation +impl YourService for ServiceImpl { + fn call_with_retry(&self, params: Params, max_retries: u32, timeout: Duration) { + // Implement retry logic with exponential backoff + // Send success or failure events back to state machine + } +} +``` + +### Mixing Stateful and Effectful Logic + +**Problem:** Putting business logic in effects instead of reducers. + +**Wrong:** + +```rust +// Business logic in effects - DON'T DO THIS +impl YourEffectfulAction { + pub fn effects(&self, store: &mut Store) { + if complex_business_condition { + // Complex logic here violates architecture + if some_other_condition { + store.dispatch(SomeAction); + } + } + store.service.call_external(); + } +} +``` + +**Right:** + +```rust +// Business logic in reducers +YourAction::ProcessRequest { request_id, data } => { + let Ok(state) = state_context.get_substate_mut() else { + // TODO: log or propagate + return; + }; + + // State updates (enabling condition already verified this is valid) + state.status = Status::Processing { request_id }; + state.pending_requests.push_back(PendingRequest { + id: request_id, + data: data.clone(), + timestamp: meta.time(), + }); + + // Prepare and dispatch effectful action + let dispatcher = state_context.into_dispatcher(); + dispatcher.push(YourEffectfulAction::ExternalCall { + request_id, + params: data.into_params(), + }); +} + +// Thin effects wrapper +impl YourEffectfulAction { + pub fn effects(&self, store: &mut Store) { + match self { + YourEffectfulAction::ExternalCall { params } => { + // Only service calls, no business logic + store.service.call_external(params); + } + } + } +} +``` + +## Debugging Best Practices + +1. **Start with `info` level logs** to understand the overall flow +2. **Use `debug` level selectively** for components under investigation +3. **Record execution** for reproducible debugging sessions +4. **Write enabling conditions** that match reducer logic exactly +5. **Use `bug_condition!`** for invariant checking in development +6. **Test with scenarios** that cover your component's edge cases +7. **Use structured logging and testing tools** for P2P communication issues +8. **Check technical debt** in `summary.md` files for known issues diff --git a/docs/handover/state-machine-development-guide.md b/docs/handover/state-machine-development-guide.md new file mode 100644 index 000000000..042d7e238 --- /dev/null +++ b/docs/handover/state-machine-development-guide.md @@ -0,0 +1,454 @@ +# State Machine Development Guide + +This guide provides practical knowledge for developers working with OpenMina's +state machine architecture. It focuses on common development patterns and +workflows for implementing features within the Redux-style state management +system. + +## Prerequisites + +Before using this guide, read: + +- [Architecture Walkthrough](architecture-walkthrough.md) - Core concepts and + patterns +- [State Machine Structure](state-machine-structure.md) - Action/reducer + organization +- [State Machine Patterns](state-machine-patterns.md) - Common patterns and when + to use them +- [Project Organization](organization.md) - Codebase navigation +- Main [README](../../README.md) - Building and running the project + +> **Related Guides**: [Adding RPC Endpoints](adding-rpc-endpoints.md), +> [State Machine Debugging Guide](state-machine-debugging-guide.md), +> [Testing Infrastructure](testing-infrastructure.md) + +## Making Your First Changes + +### Choose the Right Pattern + +Before implementing a new state machine, determine which pattern fits your use +case: + +**For Async Operations** (most common): + +- Use **Pure Lifecycle Pattern** (Init → Pending → Success/Error) +- Examples: Network requests, proof generation, data loading +- See [State Machine Patterns](state-machine-patterns.md#pure-lifecycle-pattern) + for examples + +**For Multi-Phase Operations**: + +- Use **Sequential Lifecycle Pattern** or **Connection Lifecycle Pattern** +- Examples: Sync operations, P2P handshakes, protocol negotiations +- See + [State Machine Patterns](state-machine-patterns.md#sequential-lifecycle-pattern) + for examples + +**For Complex Workflows**: + +- Use **Hybrid Patterns** or **Iterative Process Pattern** +- Examples: Block production, VRF evaluation, long-running computations +- See + [State Machine Patterns](state-machine-patterns.md#hybrid-lifecycle--domain-specific-patterns) + for examples + +### Finding the Right Component + +When implementing a feature or fixing a bug, locate the relevant state machine: + +**By Feature Domain:** + +- **P2P networking** → `p2p/src/` +- **Block production** → `node/src/block_producer/` +- **Transaction processing** → `node/src/transaction_pool/` +- **Ledger operations** → `node/src/ledger/` +- **SNARK verification** → `snark/src/` +- **Consensus logic** → `node/src/transition_frontier/` + +**By Action Type:** + +1. **Search for existing actions** - Use `rg "SomeAction"` to find similar + functionality +2. **Follow state flow** - Look at state definitions to understand data flow +3. **Check `summary.md`** - Most components have purpose and technical debt + notes + +**Example: Adding transaction validation** + +```bash +# Find transaction-related actions +rg "TransactionPool.*Action" --type rust + +# Look at transaction pool state +cat node/src/transaction_pool/transaction_pool_state.rs + +# Check component documentation +cat node/src/transaction_pool/summary.md +``` + +### Code Change Workflow + +**1. Understand the Existing Pattern** + +- Find similar functionality in the same component +- Note the action → reducer → effect flow +- Check enabling conditions and state transitions + +**2. Follow Component Conventions** + +- Use existing naming patterns for actions and state +- Match the style of enabling conditions +- Follow the same error handling patterns + +**3. Use Existing Code as Templates** + +```rust +// Template for new stateful actions (most common pattern) +YourComponentAction::NewAction { data } => { + let Ok(state) = state_context.get_substate_mut() else { + // TODO: log or propagate + return; + }; + state.some_field = data.clone(); + + // Dispatch follow-up actions + let dispatcher = state_context.into_dispatcher(); + dispatcher.push(YourComponentAction::NextAction { ... }); +} + +// Template for new effectful actions +YourEffectfulAction::NewRequest { params } => { + store.service.call_external_method(params); +} +``` + +## Adding New State Machines + +### Directory Structure + +Follow the standard layout for new components: + +``` +your_component/ +├── your_component_state.rs # State definition +├── your_component_actions.rs # Stateful action types +├── your_component_reducer.rs # State transitions + dispatching +├── your_component_effectful/ # Effectful actions directory +│ ├── your_component_effectful_actions.rs # Effectful action types +│ ├── your_component_effectful_effects.rs # Effects implementations +│ └── your_component_service.rs # Service interface +└── summary.md # Component purpose and notes +``` + +**Alternative flat structure (also used):** + +``` +your_component/ +├── your_component_state.rs # State definition +├── your_component_actions.rs # Stateful action types +├── your_component_reducer.rs # State transitions + dispatching +├── your_component_effects.rs # Effectful actions and effects +├── your_component_service.rs # Service interface +└── summary.md # Component purpose and notes +``` + +**Architecture Migration Status:** The codebase is in transition from "old +style" (separate reducer/effects) to "new style" (unified reducers). The +transition frontier (`node/src/transition_frontier/`) still uses the old +pattern, while most other components use the new pattern described in this +guide. For detailed migration instructions, see +[ARCHITECTURE.md](../../ARCHITECTURE.md). + +### Action Patterns + +**1. Define State Structure** + +```rust +#[derive(Serialize, Deserialize, Debug, Clone)] +pub struct YourComponentState { + pub status: YourComponentStatus, + pub data: BTreeMap, + pub pending_requests: VecDeque, +} + +#[derive(Serialize, Deserialize, Debug, Clone)] +pub enum YourComponentStatus { + Idle, + Processing { request_id: Id }, + Error { error: String }, +} +``` + +**2. Categorize Actions** + +```rust +// Stateful actions - handled by reducers +#[derive(Serialize, Deserialize, Debug, Clone, ActionEvent)] +pub enum YourComponentAction { + #[action_event(level = info)] + Init { config: Config }, + + #[action_event(level = debug)] + ProcessData { data: Data }, + + #[action_event(level = warn, fields(debug(error)))] + Error { error: String }, +} + +// Effectful actions - handled by effects +#[derive(Serialize, Deserialize, Debug, Clone, ActionEvent)] +pub enum YourComponentEffectfulAction { + #[action_event(level = debug)] + ExternalRequest { params: Params }, + + #[action_event(level = trace)] + ServiceCall { input: Input }, +} +``` + +**3. Write Enabling Conditions** + +```rust +impl EnablingCondition for YourComponentAction { + fn is_enabled(&self, state: &crate::State, time: Timestamp) -> bool { + match self { + YourComponentAction::Init { .. } => { + // Only allow initialization when not already initialized + matches!(state.your_component.status, YourComponentStatus::Idle) + } + YourComponentAction::ProcessData { data } => { + // Only process when ready and data is valid + matches!(state.your_component.status, YourComponentStatus::Idle) + && data.is_valid() + } + YourComponentAction::Error { .. } => { + // Errors always allowed for defensive programming + true + } + } + } +} +``` + +### Integration Points + +**1. Add to Main State** + +```rust +// In node/src/state.rs +pub struct State { + // ... existing fields + pub your_component: YourComponentState, +} +``` + +**2. Add to Main Action Enum** + +```rust +// In node/src/action.rs +pub enum Action { + // ... existing variants + YourComponent(YourComponentAction), + YourComponentEffectful(YourComponentEffectfulAction), +} +``` + +**Note on Action Type Generation:** The `node/src/action_kind.rs` file is +autogenerated by the build script. Your actions will be automatically picked up +when you follow the naming convention (`*_actions.rs` or `action.rs`). + +**3. Add to Main Reducer** + +```rust +// In node/src/reducer.rs +Action::YourComponent(action) => { + YourComponentState::reducer(state.substate(), action.with_meta(&meta)) +} +``` + +## Component Communication + +### Callback Pattern + +The codebase uses callbacks for decoupled component communication. The +`redux::callback!` macro enables components to specify how async operations +should respond without tight coupling. + +**When to use callbacks:** + +- Async operations that need custom response handling +- Cross-component communication without dependencies +- Operations where different callers need different completion behavior + +For detailed callback patterns and examples, see +[Architecture Walkthrough](architecture-walkthrough.md#callbacks-pattern). + +### Global State Access Pattern + +When reducers need to access multiple parts of the global state or coordinate +between different subsystems, use `into_dispatcher_and_state()` instead of +`into_dispatcher()`. + +**When to use `into_dispatcher_and_state()`:** + +- Generating request IDs from other subsystems +- Reading configuration from other state machines +- Coordinating between multiple components +- Accessing peer information or network state + +**Common use cases:** + +```rust +// Getting request IDs from other subsystems +let (dispatcher, global_state) = state_context.into_dispatcher_and_state(); +let req_id = global_state.snark.user_command_verify.next_req_id(); +dispatcher.push(SnarkUserCommandVerifyAction::Init { req_id, ... }); + +// Accessing peer information for P2P operations +let (dispatcher, global_state) = state_context.into_dispatcher_and_state(); +for peer_id in global_state.p2p.ready_peers() { + dispatcher.push(TransactionPoolAction::P2pSend { peer_id }); +} + +// Coordinating between multiple state machines +let (dispatcher, global_state) = state_context.into_dispatcher_and_state(); +let best_tip = global_state.transition_frontier.best_tip()?; +let cur_slot = global_state.cur_global_slot()?; +// Use both pieces of information for decision making +``` + +**Pattern structure:** + +```rust +let (dispatcher, global_state) = state_context.into_dispatcher_and_state(); +// Read from global state +let info = global_state.some_subsystem.get_info(); +// Dispatch actions using that information +dispatcher.push(SomeAction::WithInfo { info }); +``` + +**Important notes:** + +- Only use when you need to read from global state +- Prefer `into_dispatcher()` for simple action dispatching +- The global state reference is read-only +- Common in P2P, block producer, and transaction pool reducers + +## Basic Debugging + +For comprehensive debugging tools and troubleshooting, see +[State Machine Debugging Guide](state-machine-debugging-guide.md). + +**Quick debugging tips:** + +```bash +# Basic logging control +OPENMINA_TRACING_LEVEL=debug cargo run --release -p cli node +``` + +```rust +// Add ActionEvent to your actions for automatic logging +#[derive(Serialize, Deserialize, Debug, Clone, ActionEvent)] +pub enum YourAction { + #[action_event(level = debug)] + ProcessData { data: Data }, +} +``` + +## Quick Reference Checklists + +### ✅ Adding a New Action to Existing Component + +**Before you start:** + +- [ ] Find similar actions in the same component +- [ ] Check the component's `summary.md` for known issues +- [ ] Understand the existing state flow + +**Implementation steps:** + +- [ ] Add action variant to `*_actions.rs` +- [ ] Add `#[action_event(level = debug)]` for logging +- [ ] Implement enabling condition in `EnablingCondition` trait +- [ ] Add handler in reducer with proper state access pattern + - Use `into_dispatcher()` for simple action dispatching + - Use `into_dispatcher_and_state()` when needing global state access +- [ ] Test enabling condition logic matches reducer expectations +- [ ] Add documentation comment explaining the action's purpose + +### ✅ Adding a New Service Call + +**Before you start:** + +- [ ] Check if similar service calls exist +- [ ] Identify if this should be effectful action or direct service call +- [ ] Understand the async result handling pattern + +**Implementation steps:** + +- [ ] Add effectful action variant to `*_effectful_actions.rs` +- [ ] Add thin effect handler that only calls service +- [ ] Ensure service sends result via events +- [ ] Add event handling in event source (if needed) +- [ ] Test the complete async flow + +### ✅ Adding a New State Machine Component + +**Planning:** + +- [ ] Identify the component's single responsibility +- [ ] Check if this fits better as part of existing component +- [ ] Plan the state structure (use enums for state flow) +- [ ] Identify what services this component will need + +**File structure:** + +- [ ] Create `component_state.rs` with state definition +- [ ] Create `component_actions.rs` with action types +- [ ] Create `component_reducer.rs` with unified reducer +- [ ] Create `component_effectful/` directory structure +- [ ] Create `summary.md` documenting purpose and any issues + +**Integration:** + +- [ ] Add to main `State` struct in `node/src/state.rs` +- [ ] Add to main `Action` enum in `node/src/action.rs` +- [ ] Add to main reducer in `node/src/reducer.rs` +- [ ] Add substate access with `impl_substate_access!` macro +- [ ] Add service integration if needed + +### ✅ Debugging Common Issues + +**Action not being processed:** + +- [ ] Check if action appears in logs with `OPENMINA_TRACING_LEVEL=debug` +- [ ] Verify enabling condition allows the action +- [ ] Check if action is added to main Action enum and reducer +- [ ] Verify component is initialized before action dispatch + +**Service call not returning results:** + +- [ ] Check service implementation sends events +- [ ] Verify event source processes the event type +- [ ] Check if event gets converted to correct action +- [ ] Look for service-specific logs or errors + +**State machine appears stuck:** + +- [ ] Check logs for panic messages or `bug_condition!` triggers +- [ ] Look for blocking operations (sync service calls) +- [ ] Verify events are being processed by event source +- [ ] Check for infinite loops in reducer logic + +## Best Practices + +1. **Use structured logging** with appropriate `ActionEvent` levels +2. **Write enabling conditions** that match reducer logic exactly +3. **Keep effects thin** - only service calls, no business logic +4. **Use `bug_condition!`** for invariant checking in development +5. **Test with scenarios** that cover your component's edge cases +6. **Document technical debt** in `summary.md` files +7. **Follow existing patterns** in the same component +8. **Use recording/replay** for reproducible debugging +9. **Error handling** - The codebase commonly uses `unwrap()` and `expect()` for + substate access, as enabling conditions should prevent invalid states diff --git a/docs/handover/state-machine-patterns.md b/docs/handover/state-machine-patterns.md new file mode 100644 index 000000000..f1daaba0c --- /dev/null +++ b/docs/handover/state-machine-patterns.md @@ -0,0 +1,537 @@ +# State Machine Patterns in OpenMina + +This guide describes the patterns we use for state machines in OpenMina and when +to apply each one. These patterns grew organically from different contributors +over time, and there may be opportunities for normalization. + +> **Prerequisites**: Read +> [Architecture Walkthrough](architecture-walkthrough.md) and +> [State Machine Structure](state-machine-structure.md) first. **Related**: See +> [State Machine Development Guide](state-machine-development-guide.md) for +> implementation details. + +## Our State Machine Patterns + +OpenMina contains dozens of state machines that use several distinct patterns: + +1. **Pure Lifecycle Pattern** - Simple async operations +2. **Sequential Lifecycle Pattern** - Multi-phase operations with state + accumulation +3. **Connection Lifecycle Pattern** - Complex protocol negotiations +4. **Iterative Process Pattern** - Long-running processes with stepping +5. **Worker State Machine Pattern** - Process management with operational loops +6. **Hybrid Patterns** - Complex domain workflows with embedded patterns + +**Important**: Many actions exist for **debugging and testing granularity** +rather than just async operations. This enables precise state tracking, better +simulator tests, and detailed logging. + +## Quick Reference + +| Pattern | Use For | Naming Convention | Example | +| -------------------- | --------------------------- | ----------------------------------- | ------------------------ | +| Pure Lifecycle | Simple async operations | `Init/Pending/Success/Error` | SNARK verification | +| Sequential Lifecycle | Multi-phase sync operations | `Phase1Pending/Phase1Success` | Transition frontier sync | +| Connection Lifecycle | Network protocol handshakes | `Phase + Pending/Success` | P2P connections | +| Iterative Process | Long-running computations | `Begin/Continue/Finish/Interrupt` | VRF epoch evaluation | +| Worker State Machine | External process management | `Starting/Idle/Working/Ready/Error` | External SNARK worker | +| Hybrid Pattern | Complex domain workflows | Mixed patterns as appropriate | Block producer | + +## Common Patterns + +### 1. Pure Lifecycle Pattern (Init → Pending → Success/Error) + +**Use for**: Simple async operations that follow a clear lifecycle. + +#### SNARK Verification (`snark/src/block_verify/`) + +```rust +pub enum SnarkBlockVerifyAction { + Init { /* block data */ }, + Pending { /* verification progress */ }, + Success { /* verification result */ }, + Error { /* verification error */ }, + Finish { /* cleanup */ }, +} + +pub enum SnarkBlockVerifyState { + Init { /* ... */ }, + Pending { /* ... */ }, + Success { /* ... */ }, + Error { /* ... */ }, +} +``` + +**Pattern**: Single async operation, clear linear flow, simple error handling. + +#### Transaction Pool Candidate (`node/src/transaction_pool/candidate/`) + +```rust +pub enum TransactionPoolCandidateAction { + InfoReceived { /* transaction info */ }, + FetchInit { /* start fetching */ }, + FetchPending { /* fetch progress */ }, + FetchError { /* fetch failed */ }, + FetchSuccess { /* transaction fetched */ }, + VerifyPending { /* verification progress */ }, + VerifyError { /* verification failed */ }, + VerifySuccess { /* verification complete */ }, +} + +pub enum TransactionPoolCandidateState { + InfoReceived { /* ... */ }, + FetchPending { /* ... */ }, + Received { /* ... */ }, + VerifyPending { /* ... */ }, + VerifyError { /* ... */ }, + VerifySuccess { /* ... */ }, +} +``` + +**Pattern**: Multiple phases, each following lifecycle (fetch phase, then verify +phase). + +### 2. Sequential Lifecycle Pattern + +**Complex sync operations** with multiple sequential lifecycle phases. + +#### Transition Frontier Sync (`node/src/transition_frontier/sync/`) + +```rust +pub enum TransitionFrontierSyncState { + Idle, + Init { /* sync start */ }, + + // Phase 1: Staking ledger sync + StakingLedgerPending { /* ... */ }, + StakingLedgerSuccess { /* ... */ }, + + // Phase 2: Next epoch ledger sync + NextEpochLedgerPending { /* ... */ }, + NextEpochLedgerSuccess { /* ... */ }, + + // Phase 3: Root ledger sync + RootLedgerPending { /* ... */ }, + RootLedgerSuccess { /* ... */ }, + + // Phase 4: Block sync + BlocksPending { /* ... */ }, + BlocksSuccess { /* ... */ }, + + // Phase 5: Commit + CommitPending { /* ... */ }, + CommitSuccess { /* ... */ }, + + Synced { /* final state */ }, +} +``` + +**Pattern**: Multiple sequential phases, each with Pending → Success lifecycle. +Each Success state carries forward the data needed for the next phase. + +### 3. Connection Lifecycle Pattern + +**Network connections** that need complex handshake flows. + +#### P2P Outgoing Connection (`p2p/src/connection/outgoing/`) + +```rust +pub enum P2pConnectionOutgoingState { + Init { /* connection parameters */ }, + + // SDP creation phase + OfferSdpCreatePending { /* ... */ }, + OfferSdpCreateSuccess { /* SDP created */ }, + + // Offer phase + OfferReady { /* offer ready */ }, + OfferSendSuccess { /* offer sent */ }, + + // Answer phase + AnswerRecvPending { /* waiting for answer */ }, + AnswerRecvSuccess { /* answer received */ }, + + // Finalization phase + FinalizePending { /* finalizing connection */ }, + FinalizeSuccess { /* connection established */ }, + + // Terminal states + Success { /* connected */ }, + Error { /* connection failed */ }, +} +``` + +**Pattern**: Multiple phases with detailed intermediate states. Each phase has +its own success state that flows to the next phase. + +### 4. Iterative Process Pattern + +**Long-running processes** that execute in steps over time. + +#### VRF Epoch Evaluation (`node/src/block_producer/vrf_evaluator/`) + +```rust +pub enum BlockProducerVrfEvaluatorAction { + // Process control + BeginEpochEvaluation { /* start parameters */ }, + ContinueEpochEvaluation { /* step parameters */ }, + FinishEpochEvaluation { /* completion data */ }, + + // Process interruption + InterruptEpochEvaluation { reason: InterruptReason }, + + // Sub-processes (also iterative) + BeginDelegatorTableConstruction, + FinalizeDelegatorTableConstruction { /* table data */ }, + + // Individual evaluations + EvaluateSlot { /* slot data */ }, + ProcessSlotEvaluationSuccess { /* evaluation result */ }, +} +``` + +**Pattern**: Long-running process that: + +- **Begins** with initialization +- **Continues** through multiple steps/iterations +- **Finishes** when complete or interrupted +- Can be **interrupted** and potentially resumed + +**Key Insight**: Not all Begin/Finish pairs are lifecycle patterns - some +represent iterative processes. + +### 5. Worker State Machine Pattern + +**Worker processes** with lifecycle + operational states. + +#### External SNARK Worker (`node/src/external_snark_worker/`) + +```rust +pub enum ExternalSnarkWorkerState { + None, + Starting, // Lifecycle: init + + // Operational states + Idle, + Working(SnarkWorkId, JobSummary), + WorkReady(SnarkWorkId, SnarkWorkResult), + WorkError(SnarkWorkId, ExternalSnarkWorkerWorkError), + + // Cancellation lifecycle + Cancelling(SnarkWorkId), + Cancelled(SnarkWorkId), + + // Shutdown lifecycle + Killing, + Error(ExternalSnarkWorkerError, bool), +} +``` + +**Pattern**: Initialization lifecycle → Operational loop (Idle → Working → +Ready/Error) → Shutdown lifecycle. Worker can be cancelled during operation. + +### 6. Hybrid Lifecycle + Domain-Specific Patterns + +**Complex business workflows** that embed lifecycle patterns within +domain-specific flows. + +#### Block Producer (`node/src/block_producer/`) + +```rust +pub enum BlockProducerAction { + // Domain-specific workflow + VrfEvaluator { /* VRF evaluation */ }, + WonSlotSearch { /* slot winning check */ }, + WonSlot { /* slot won */ }, + WonSlotWait { /* waiting for slot */ }, + + // Lifecycle pattern for staged ledger diff creation + StagedLedgerDiffCreateInit { /* diff creation start */ }, + StagedLedgerDiffCreatePending { /* diff creation progress */ }, + StagedLedgerDiffCreateSuccess { /* diff created */ }, + + // Lifecycle pattern for block proving + BlockProveInit { /* proof generation start */ }, + BlockProvePending { /* proof generation progress */ }, + BlockProveSuccess { /* proof generated */ }, + + // Domain-specific completion + BlockProduced { /* block completed */ }, + BlockInject { /* inject into network */ }, + BlockInjected { /* injection completed */ }, +} + +pub enum BlockProducerCurrentState { + Idle { /* ... */ }, + WonSlot { /* ... */ }, + WonSlotWait { /* ... */ }, + StagedLedgerDiffCreatePending { /* ... */ }, // Lifecycle state + StagedLedgerDiffCreateSuccess { /* ... */ }, // Lifecycle state + BlockProvePending { /* ... */ }, // Lifecycle state + BlockProveSuccess { /* ... */ }, // Lifecycle state + Produced { /* ... */ }, + Injected { /* ... */ }, +} +``` + +**Pattern**: Domain-specific workflow orchestrates multiple async operations, +each using lifecycle patterns internally. + +## When to Use Each Pattern & Implementation Guidelines + +### Use Pure Lifecycle Pattern When: + +- ✅ **Single async operation** (SNARK verification, simple fetches) +- ✅ **Linear flow** with clear start → progress → completion +- ✅ **Simple error handling** (retry or abort) +- ✅ **No complex business logic** between states + +**Implementation**: + +```rust +// Good - consistent lifecycle naming +pub enum MyAction { + FetchInit { /* ... */ }, + FetchPending { /* ... */ }, + FetchSuccess { /* ... */ }, + FetchError { /* ... */ }, // Always include error handling +} +``` + +### Use Sequential Lifecycle Pattern When: + +- ✅ **Multiple phases** that must complete in order +- ✅ **Each phase** is an async operation with its own lifecycle +- ✅ **State accumulation** (each phase builds on previous results) +- ✅ **Complex sync operations** (ledger sync, blockchain sync) + +### Use Connection Lifecycle Pattern When: + +- ✅ **Network protocols** with handshake flows +- ✅ **Multiple negotiation phases** (SDP, offer, answer, finalize) +- ✅ **Detailed intermediate states** needed for debugging +- ✅ **Connection establishment** processes + +### Use Iterative Process Pattern When: + +- ✅ **Long-running computations** that need to be stepped +- ✅ **Interruptible processes** that can be paused/resumed +- ✅ **Progress tracking** through multiple iterations +- ✅ **Examples**: VRF evaluation, epoch processing, large computations + +**Implementation**: + +```rust +// Good - iterative process naming +pub enum MyAction { + BeginComputation { /* ... */ }, + ContinueComputation { /* ... */ }, + FinishComputation { /* ... */ }, + InterruptComputation { /* ... */ }, +} +``` + +### Use Worker State Machine Pattern When: + +- ✅ **Worker processes** with start/stop lifecycle +- ✅ **Operational loop** (idle → working → result) +- ✅ **Cancellation support** during operation +- ✅ **External process management** (SNARK workers, external services) + +### Use Hybrid Lifecycle + Domain Pattern When: + +- ✅ **Complex business workflows** with embedded async operations +- ✅ **Domain-specific states** mixed with lifecycle operations +- ✅ **Multiple concerns** in one state machine +- ✅ **Examples**: Block production, transaction pool processing + +**Implementation**: + +```rust +// Group related lifecycle operations together +pub enum BlockProducerAction { + // Domain workflow + WonSlot { /* ... */ }, + WonSlotWait { /* ... */ }, + + // Lifecycle group 1: Diff creation + StagedLedgerDiffCreateInit { /* ... */ }, + StagedLedgerDiffCreatePending { /* ... */ }, + StagedLedgerDiffCreateSuccess { /* ... */ }, + + // Lifecycle group 2: Block proving + BlockProveInit { /* ... */ }, + BlockProvePending { /* ... */ }, + BlockProveSuccess { /* ... */ }, + + // Domain completion + BlockProduced { /* ... */ }, +} +``` + +## Common Anti-Patterns to Avoid + +### 1. Missing Error Handling + +```rust +// Bad - no error action +pub enum MyAction { + Init, + Pending, + Success, // What happens if this fails? +} + +// Good - complete error handling +pub enum MyAction { + Init, + Pending, + Success, + Error { error: String, should_retry: bool }, +} +``` + +### 2. Inconsistent Naming + +```rust +// Bad - mixing patterns +pub enum MyAction { + BeginFetch, // Iterative style + FetchPending, // Lifecycle style + FinalizeFetch, // Inconsistent +} + +// Good - consistent lifecycle +pub enum MyAction { + FetchInit, + FetchPending, + FetchSuccess, + FetchError, +} +``` + +### 3. Overly Complex State Hierarchies + +```rust +// Avoid - unnecessarily complex +pub enum BadAction { + PreInit { /* ... */ }, + Init { /* ... */ }, + PostInit { /* ... */ }, + PrePending { /* ... */ }, + Pending { /* ... */ }, + PostPending { /* ... */ }, +} +``` + +### 4. Unclear State Purpose + +```rust +// Bad - unclear state purpose +pub enum MyState { + StateA { /* what does this do? */ }, + StateB { /* when does this happen? */ }, +} + +// Good - clear state meaning +pub enum MyState { + Init { /* initialization data */ }, + Pending { /* async operation in progress */ }, + Success { /* operation completed */ }, +} +``` + +## Action Granularity for Debugging + +**Many actions exist for debugging/testing granularity, not async necessity:** + +```rust +// Block Producer - granular actions for debugging +pub enum BlockProducerAction { + // These provide debugging visibility into async operation + StagedLedgerDiffCreateInit, // Debugging: marks start + StagedLedgerDiffCreatePending, // Debugging: shows progress + StagedLedgerDiffCreateSuccess, // Debugging: marks completion + + // Only the Init triggers actual async work + // Pending/Success are for state tracking and logging +} +``` + +**Benefits:** + +- **Simulator tests** can verify exact state transitions +- **Invariant checker** can validate state at each step +- **Logging** shows detailed progress for debugging +- **Monitoring** can track operation phases + +## Known Issues and Improvement Opportunities + +### Missing Error Handling + +**Block Producer lacks error actions for critical operations:** + +```rust +// Current - missing error handling +pub enum BlockProducerAction { + BlockProveInit, + BlockProvePending, + BlockProveSuccess { proof: Arc }, + // MISSING: BlockProveError - what happens when proof fails? +} +``` + +**Should add:** + +```rust +pub enum BlockProducerAction { + BlockProveInit, + BlockProvePending, + BlockProveSuccess { proof: Arc }, + BlockProveError { + error: String, + retry_count: u32, + should_retry: bool, + }, +} +``` + +### Consider Normalization When: + +- ⚠️ **Inconsistent naming** for similar operations +- ⚠️ **Missing Error actions** for operations that can fail +- ⚠️ **Begin/Finalize** used for simple async instead of **Init/Success** +- ⚠️ **Mix of patterns** within same state machine without clear reason + +**Note**: `Begin/Continue/Finish` is correct for iterative processes, but +`Init/Pending/Success` is better for simple async operations. + +## Best Practices for New State Machines + +1. **Choose the Right Pattern**: Match pattern to problem complexity +2. **Always Include Error Handling**: Every async operation needs Error actions +3. **Design for Debugging**: Use granular actions for state visibility +4. **Use Consistent Naming**: Follow established patterns within your domain +5. **Carry State Forward**: Each phase should build on previous results +6. **Group Related Operations**: Keep lifecycle operations together + +### Naming Conventions: + +- **Lifecycle**: `Init/Pending/Success/Error` +- **Iterative**: `Begin/Continue/Finish/Interrupt` +- **Worker**: `Starting/Idle/Working/Ready/Error` +- Don't mix patterns without clear reason + +## Conclusion + +These state machine patterns have evolved organically to serve different needs, +from simple async operations to complex domain workflows. The diversity enables +appropriate pattern selection for each problem domain, while granular actions +provide excellent debugging capabilities. + +When implementing new state machines, follow these established patterns to +maintain consistency and leverage the debugging infrastructure. The key is +matching the pattern to the problem complexity rather than forcing simple +problems into complex patterns. + +For implementation details and migration guidance, see +[ARCHITECTURE.md](../../ARCHITECTURE.md). diff --git a/docs/handover/state-machine-structure.md b/docs/handover/state-machine-structure.md new file mode 100644 index 000000000..fd66984d7 --- /dev/null +++ b/docs/handover/state-machine-structure.md @@ -0,0 +1,330 @@ +# OpenMina State Machine Structure + +This document maps out the complete hierarchy and organization of state machines +in OpenMina, showing how dozens of state machines are structured and their +relationships. + +> **Prerequisites**: Read +> [Architecture Walkthrough](architecture-walkthrough.md) first to understand +> the Redux pattern and core concepts. **Next Steps**: After understanding the +> structure, see [State Machine Patterns](state-machine-patterns.md) for common +> patterns, then [Project Organization](organization.md) to navigate the +> codebase. + +## Architectural Layers + +### 1. Top-Level Orchestration + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Main Node State Machine │ +│ (node/src/) │ +├─────────────────────────────────────────────────────────────┤ +│ • Orchestrates all subsystems │ +│ • Routes actions between components │ +│ • Manages node lifecycle │ +│ • Coordinates P2P, consensus, storage, and RPC │ +└─────────────────────────────────────────────────────────────┘ + │ + ┌──────────────────────┼──────────────────────┐ + ▼ ▼ ▼ +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ P2P System │ │ SNARK System │ │ Node Subsystems │ +│ (p2p/src/) │ │ (snark/src/) │ │ (node/src/*) │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ +``` + +### 2. Core Systems + +#### P2P State Machine (`p2p/src/`) + +Manages all peer-to-peer networking functionality through two distinct network +layers: + +- **Connection Management**: Handles incoming/outgoing peer connections +- **Dual Network Support**: + - libp2p protocols (Kademlia, PubSub, etc.) for native nodes + - WebRTC-based network for webnodes with different design patterns +- **Message Routing**: Routes protocol messages between peers via channel + abstractions +- **Peer Discovery**: Maintains peer registry and discovery mechanisms + +#### SNARK State Machine (`snark/src/`) + +Handles proof verification: + +- **Block Verification**: Verifies block proofs +- **Work Verification**: Verifies SNARK work proofs from workers +- **User Command Verification**: Verifies transaction proofs and zkApp proofs + +### 3. Node Subsystems + +#### Block Producer (`node/src/block_producer/`) + +Manages block production for validator nodes: + +- **VRF Evaluator**: Determines slot leadership eligibility +- **Block Construction**: Assembles blocks when node wins slots +- **Transaction Selection**: Chooses transactions from mempool + +#### Transition Frontier (`node/src/transition_frontier/`) + +Core consensus and blockchain state management: + +- **Block Processing**: Validates and accepts new blocks +- **Chain Selection**: Handles reorganizations and best tip selection +- **Synchronization**: Downloads missing blocks and ledger data +- **Genesis**: Initializes blockchain from genesis configuration + +_Note: This subsystem still uses the old-style state machine pattern and is +scheduled for migration._ + +#### Transaction Pool (`node/src/transaction_pool/`) + +Maintains mempool of pending transactions: + +- **Validation**: Pre-validates transactions before inclusion +- **Prioritization**: Orders by fee for block inclusion +- **Eviction**: Removes invalid/expired transactions +- **Propagation**: Shares transactions with peers + +#### SNARK Pool (`node/src/snark_pool/`) + +Manages pool of SNARK work proofs: + +- **Work Collection**: Receives proofs from external workers +- **Validation**: Verifies work correctness +- **Pricing**: Manages work fee market +- **Distribution**: Provides work for block production + +#### Ledger (`node/src/ledger/`) + +Manages blockchain account state: + +- **Read Operations**: Concurrent account queries +- **Write Operations**: Atomic state updates from blocks +- **Merkle Proofs**: Generates cryptographic proofs + +#### RPC (`node/src/rpc/`) + +External API interface: + +- **Request Handling**: Processes GraphQL/REST queries +- **Response Formatting**: Serializes node data +- **Client Management**: Handles WebSocket subscriptions + +#### External SNARK Worker (`node/src/external_snark_worker/`) + +Manages external proof computation: + +- **Process Management**: Spawns/monitors worker processes +- **Work Distribution**: Assigns proof tasks +- **Result Collection**: Gathers completed proofs + +#### Watched Accounts (`node/src/watched_accounts/`) + +Account monitoring system: + +- **Registration**: Tracks specific accounts +- **Event Detection**: Monitors balance changes +- **Notifications**: Emits events for account updates + +## Key Interaction Flows + +### 1. Block Production Flow + +``` +Transaction Pool ──┐ + ├──> Block Producer ──> New Block ──> P2P Broadcast +SNARK Pool ────────┘ │ + ▼ + Transition Frontier +``` + +1. **Block Producer checks eligibility** → VRF evaluation determines slot + leadership +2. **Gather transactions** → Pulls from Transaction Pool based on fees +3. **Include SNARK work** → Selects required proofs from SNARK Pool +4. **Construct block** → Assembles block with transactions and proofs +5. **Broadcast block** → P2P system propagates to peers +6. **Update local state** → Transition Frontier processes own block + +### 2. Block Reception Flow + +``` +P2P Network ──> Block Reception ──> SNARK Verification ──┐ + ├──> Transition Frontier + │ │ + │ ▼ + └──> Ledger Updates +``` + +1. **P2P receives block** → Channels process incoming data +2. **Transition frontier notified** → `BlockReceived` action dispatched +3. **Reducer updates state** → Stores block, dispatches verification +4. **SNARK verification initiated** → Effectful action calls service +5. **Service processes proof** → Async verification runs +6. **Event returned** → Success/failure event dispatched +7. **Callback triggered** → Original callback action executed +8. **Block applied if valid** → State machine updates blockchain + +### 3. Transaction Flow + +``` +RPC/P2P ──> Transaction Pool ──> Validation ──┐ + ├──> P2P Propagation + └──> Block Inclusion +``` + +1. **P2P broadcasts** → Via Gossipsub protocol +2. **Pool receives transaction** → Stateful action triggered +3. **Validation requested** → Effectful action to ledger service +4. **Service validates** → Checks against current state +5. **Result event** → Valid/invalid status returned +6. **Pool updated** → Transaction added if valid +7. **Propagate to peers** → Valid transactions shared via P2P + +### 4. SNARK Work Flow + +``` +External Worker ──> SNARK Pool ──> Validation ──┐ + ├──> P2P Distribution + └──> Block Production +``` + +1. **Work received via P2P** → Gossip message processed +2. **Candidate created** → State machine tracks verification +3. **Batch verification** → Multiple proofs queued together +4. **Service verifies** → Async proof checking +5. **Results returned** → Events for each verification +6. **Pool updated** → Valid work added to pool +7. **Distribution** → Share with peers and make available for blocks + +### P2P Subsystem Architecture + +The P2P system contains multiple specialized state machines: + +#### Channel State Machines (`p2p/src/channels/`) + +Protocol-specific communication handlers: + +- **Best Tip**: Propagates chain head updates +- **Transaction**: Gossips pending transactions +- **SNARK**: Distributes SNARK work +- **RPC**: Handles peer-to-peer queries +- **Streaming RPC**: Manages long-lived data streams +- **Signaling**: WebRTC connection establishment + +#### Network State Machines (`p2p/src/network/`) + +Low-level protocol implementations: + +- **Kademlia**: DHT for peer discovery + - Bootstrap: Initial network joining + - Request: Query processing + - Stream: Protocol communication +- **PubSub**: Gossip protocol for broadcasts +- **Identify**: Peer information exchange +- **Yamux**: Stream multiplexing +- **Noise**: Encryption protocol +- **Select**: Protocol negotiation + +#### Connection Management (`p2p/src/connection/`) + +- **Incoming**: Handles inbound connections +- **Outgoing**: Initiates outbound connections +- **Disconnection**: Graceful disconnect handling + +### 5. Peer Connection Flow + +Establishing peer connections differs by network type: + +**libp2p-based connections (native and OCaml nodes):** + +1. **Connection initiated** → Outgoing or incoming TCP connection +2. **Security established** → Noise handshake effectful actions +3. **Multiplexing setup** → Yamux stream creation +4. **Identify exchanged** → Peer capabilities shared +5. **Ready state reached** → Peer available for messaging + +**WebRTC-based connections (webnodes):** + +1. **Signaling initiated** → WebRTC connection establishment +2. **Direct connection** → Peer-to-peer WebRTC channel +3. **Channel ready** → Direct communication available +4. **Ready state reached** → Peer available for messaging + +## State Access Control: Substate System + +OpenMina uses a substate system (`core/src/substate.rs`) to decouple reducers +from the global state representation. This abstraction ensures components don't +depend on the exact structure of the global state: + +### How It Works + +Reducers receive `Substate` values that provide access to specific state +portions: + +- **Decoupling**: Components work with their own state types without knowing the + global state structure +- **Modularity**: State machine components can be moved to their own crates in + the future if necessary +- **Type safety**: Access boundaries are enforced at compile time +- **Flexibility**: Global state structure can change without updating all + reducers + +### Phase Separation Enforcement + +The `Substate` system enforces the two-phase reducer pattern through its API +design: + +1. **Phase 1 - State Updates**: `state_context.get_substate_mut()` provides + mutable access to state +2. **Phase 2 - Action Dispatching**: `state_context.into_dispatcher()` consumes + the context and returns a dispatcher + - Use `into_dispatcher()` for simple action dispatching + - Use `into_dispatcher_and_state()` when you need read-only access to global + state for coordination + +This design makes it impossible to mix state updates and action dispatching: + +- Once you call `into_dispatcher()`, you can no longer access mutable state +- The dispatcher can only dispatch actions, not modify state +- The type system enforces this separation at compile time + +```rust +// Phase 1: State updates only +let Ok(state) = state_context.get_substate_mut() else { return }; +state.field = new_value; // ✓ Allowed + +// Phase 2: Action dispatching only +let dispatcher = state_context.into_dispatcher(); +// Or for global state access: +// let (dispatcher, global_state) = state_context.into_dispatcher_and_state(); +dispatcher.push(SomeAction { ... }); // ✓ Allowed +// state.field = other_value; // ✗ Compiler error - state no longer accessible +``` + +### Implementation + +Substate accesses are defined in `node/src/state.rs` using the +`impl_substate_access!` macro. For example: + +- `impl_substate_access!(State, SnarkState, snark)` - Access to SNARK subsystem + state +- `impl_substate_access!(State, TransitionFrontierState, transition_frontier)` - + Access to blockchain state +- Custom implementations for conditional access (e.g., P2P state only available + when initialized) + +This pattern is fundamental to maintaining modularity in the Redux-style +architecture and enables future refactoring without breaking existing +components. + +## Migration Note + +The Transition Frontier subsystem (`node/src/transition_frontier/`) currently +uses an older state machine pattern and is scheduled for migration to the new +architecture style. + +For migration instructions, see [ARCHITECTURE.md](../../ARCHITECTURE.md). diff --git a/docs/handover/state-machine-technical-debt.md b/docs/handover/state-machine-technical-debt.md new file mode 100644 index 000000000..dfb9b0305 --- /dev/null +++ b/docs/handover/state-machine-technical-debt.md @@ -0,0 +1,385 @@ +# State Machine Technical Debt + +This document covers architectural issues with the state machine implementation +across OpenMina. It focuses on patterns and consistency problems that affect the +overall design, rather than service-layer issues (covered in +`services-technical-debt.md`) or component-specific problems (covered in +individual `summary.md` files). + +## Architecture Migration Issues + +### Critical: Incomplete New-Style Migration + +Several components still use the old-style state machine pattern with separate +reducer and effects files, creating inconsistency and maintenance burden. + +#### Transition Frontier (Medium Priority) + +- **Issue**: Implemented using old-style state machine pattern instead of + new-style architecture +- **Impact**: Doesn't follow unified reducer pattern, effects directly access + state via `state.get()` and `store.state()` +- **Solution**: Migrate to new-style unified reducers with thin effectful + actions +- **Complexity**: High - large component with extensive sync logic + +#### Transaction Pool (High Priority) + +- **Issue**: Uses non-standard patterns that violate core architectural + principles +- **Specific Problems**: + - **Pending Actions Pattern**: Stores actions in `pending_actions` and + retrieves later + - **Blocking Service Calls**: Synchronous `get_accounts()` calls that block + state machine + - **Direct Global State Access**: Uses `unsafe_get_state()` bypassing proper + state management +- **Impact**: Violates Redux principles, creates blocking behavior, breaks state + encapsulation +- **Solution**: Complete refactoring to standard patterns (see + `transaction_pool_refactoring.md`) +- **Complexity**: High - requires significant architectural changes + +## State Machine Design Anti-patterns + +### Complex Logic in Reducers and Enabling Conditions + +A common anti-pattern is placing complex business logic directly in reducers and +enabling conditions instead of extracting it to state methods. + +#### Issue + +- **Reducers**: Complex state update logic embedded directly in reducer match + arms +- **Enabling Conditions**: Heavy business logic in condition checks instead of + simple boolean evaluations +- **Impact**: Reducers become hard to read, test, and maintain; enabling + conditions become complex and difficult to understand + +#### Solution + +- **Extract State Methods**: Move complex logic to helper methods on the state + struct +- **Thin Reducers**: Keep reducers focused on orchestrating state changes, not + implementing them +- **Lightweight Enabling Conditions**: Use simple boolean checks that delegate + to state methods when needed +- **Benefits**: Improved readability, testability, and maintainability; clearer + separation of concerns + +#### Pattern + +```rust +// Bad: Complex logic in reducer +ComponentAction::ComplexUpdate { data } => { + // 50+ lines of complex state update logic + state.field1 = complex_calculation(data); + state.field2.update_with_validation(data); + // ... many more lines +} + +// Good: Logic extracted to state method +ComponentAction::ComplexUpdate { data } => { + state.handle_complex_update(data); +} + +impl ComponentState { + fn handle_complex_update(&mut self, data: Data) { + // Complex logic lives here, easily testable + self.field1 = self.calculate_field1(data); + self.field2.update_with_validation(data); + } +} +``` + +### Monolithic Reducers + +Large, complex reducers that handle multiple concerns and should be decomposed +using state methods. + +#### PubSub (963 lines) + +- **Issue**: Single file handling caching, peer management, and protocol logic +- **Impact**: Difficult to maintain, mixed responsibilities, O(n) performance + issues +- **Solution**: Move message handling logic to state methods, extract separate + managers +- **Reference**: `p2p/src/network/pubsub/summary.md` for detailed analysis + +#### Yamux (387 lines) + +- **Issue**: Deep nesting (4-5 levels), complex buffer management mixing + performance and correctness +- **Impact**: Hard to reason about, complex protocol-required flag combinations +- **Solution**: Extract state methods, improve documentation of flag + combinations +- **Ongoing Work**: PR #1085 (`tweaks/yamux` branch) contains significant + refactoring addressing these issues: + - Action splitting: Broke down incoming frame handling into multiple focused + actions + - State method extraction: Moved state update logic from reducer to state + methods + - Reducer simplification: Reduced complexity and improved readability + - Comprehensive testing: Added 574 lines of tests for better coverage +- **Reference**: `p2p/src/network/yamux/summary.md` for detailed refactoring + plan + +#### Scheduler (650 lines) + +- **Issue**: Handles connection management, protocol selection, and error + handling in single file +- **Impact**: Mixed responsibilities, difficult to maintain +- **Solution**: Break down into focused handlers, extract state methods +- **Note**: Component naming also needs addressing ("scheduler" manages + connections, not scheduling) + +## Enabling Conditions Issues + +### Missing Implementations + +- **Issue**: Some components lack proper enabling conditions, allowing invalid + state transitions +- **Impact**: State machine can enter invalid states, debugging becomes + difficult +- **Solution**: Implement comprehensive enabling conditions for all actions +- **Priority**: Medium - improves state machine correctness + +### Misplaced Logic + +- **Issue**: Complex business logic in enabling conditions instead of state + methods +- **Impact**: Enabling conditions become hard to understand and maintain +- **Solution**: Move complex logic to state methods, keep enabling conditions + simple +- **Pattern**: Enabling conditions should be lightweight boolean checks + +## Service Integration Issues + +### Blocking Operations + +#### Transaction Pool Ledger Calls + +- **Issue**: Synchronous `get_accounts()` calls block the state machine thread +- **Impact**: Violates async architecture, can freeze state machine progression +- **Solution**: Convert to async pattern with proper state management +- **Priority**: Critical - blocking operations are architectural violations + +#### Missing Async Patterns + +- **Issue**: Operations that should be async are implemented synchronously +- **Impact**: State machine becomes unresponsive during heavy operations +- **Solution**: Audit all service calls, ensure proper async patterns + +## Communication and Error Handling + +### Centralized Event Handling + +- **Issue**: Event source centralizes all event handling instead of distributing + to relevant effectful state machines +- **Impact**: Creates unnecessary coupling between unrelated components +- **Solution**: Forward domain-specific events to respective effectful state + machines +- **Reference**: `node/src/event_source/summary.md` for detailed plan + +### Missing Error Actions + +#### Block Producer Error Paths + +- **Issue**: Only `BlockProveSuccess` exists, no `BlockProveError` action +- **Impact**: Error paths use `todo!()` panics instead of proper error handling +- **Solution**: Implement error actions and proper error propagation +- **Priority**: Critical - affects system stability + +### Inconsistent Callback Usage + +- **Issue**: Some components don't use callbacks for decoupled communication +- **Impact**: Creates tight coupling, makes components hard to test +- **Solution**: Standardize callback usage across all components + +### Panic-based Error Handling + +- **Issue**: `todo!()` macros in production code paths (block proof failures, + genesis load failures) +- **Impact**: System crashes instead of graceful error handling +- **Solution**: Implement proper error actions and integrate with error sink + service (partially implemented in PR #1097) +- **Priority**: Critical - affects system stability + +## Action System Technical Debt + +### Action Type Generation System + +**Current Implementation**: `node/src/action_kind.rs` is autogenerated by +`node/build.rs`: + +- Scans all files ending in `_actions.rs` or `action.rs` +- Extracts all action types (structs/enums ending with `Action`) +- Generates a unified `ActionKind` enum consolidating all action types +- Implements `ActionKindGet` trait for all actions + +**Benefits**: + +- Eliminates macro overhead compared to using the `enum-kinds` crate +- Helps avoid recompiling all action-related code when a single action changes + +**Technical Debt**: + +- Build-time code generation adds complexity to the build process +- Creates dependency on build script for core type system functionality +- Temporary solution that requires manual maintenance of naming conventions + +**Future Solutions**: + +- Migrate to multiple disjoint action types resolved at runtime +- Explore trait-based approaches that don't require code generation +- See https://github.com/openmina/state_machine_exp for experimental approaches +- Consider compile-time solutions that don't require build scripts + +**Priority**: Medium - works but creates maintenance burden and architectural +complexity + +## P2P Layer Technical Debt + +### Security Hardening Opportunities + +- **Noise Session Key Cleanup**: Ephemeral session keys not zeroized + (defense-in-depth improvement) + - Reference: `p2p/src/network/noise/summary.md` + +### Major Performance Issues + +- **PubSub**: O(n) message lookups, 963-line monolithic reducer + - Reference: `p2p/src/network/pubsub/summary.md` + +### Architectural Issues + +- **Kad Internals**: 912-line file mixing multiple concerns + - Reference: `p2p/src/network/kad/summary.md` +- **Select Protocol Registry**: Hardcoded protocols limiting extensibility + - Reference: `p2p/src/network/select/summary.md` + +## Implementation Quality Issues + +### Extensive TODOs + +- **Issue**: Widespread TODO comments indicating incomplete functionality +- **Examples**: + - VRF Evaluator: `todo!()` for `EpochContext::Waiting` state + - User Command Verify: Missing error callback dispatch + - Various components: Error handling improvements needed +- **Impact**: Indicates incomplete implementations and deferred decisions +- **Solution**: Systematic TODO resolution with proper prioritization + +### Safety and Linting Improvements + +#### Clippy Lints for Array Access and Arithmetic Safety + +- **Issue**: Currently using `#[allow(clippy::arithmetic_side_effects)]` and + `#[allow(clippy::indexing_slicing)]` in workspace configuration +- **Impact**: Allows potentially unsafe arithmetic operations and unchecked + array access that could panic +- **Current State**: PR #1115 enables these lints as warnings but issues need to + be fixed project-wide +- **Solution**: Fix all clippy warnings for these lints and enable them as + errors +- **Priority**: High - affects runtime safety and reliability +- **Benefits**: + - Prevents integer overflow/underflow in production + - Eliminates potential panic points from array bounds violations + - Improves overall code robustness + +### Testing Constraints + +- **Issue**: Architecture compromised by testing limitations +- **Example**: Ledger mask leak warnings that should be bug conditions but can't + be due to testing +- **Impact**: Production code quality affected by testing constraints +- **Solution**: Improve testing infrastructure to remove architectural + compromises + +### Testing Framework Time Control + +- **Issue**: Random time advancement in simulations causes unintentional + timeouts (Issue #1140) +- **Current Problem**: + - Time is advanced randomly during cluster simulations + - Can cause unwanted RPC timeouts when time advances between request/response + - Requires careful tuning of time ranges, making tests less deterministic +- **Impact**: Tests are slower than necessary and less reliable +- **Proposed Solution**: Event-based time advancement - pause execution until + all async events are ready, then decide whether to deliver, drop, or delay + them +- **Priority**: Medium - not required for mainnet but valuable for test + reliability and speed + +### Hard-coded Values + +- **Issue**: Configuration values embedded in code instead of being configurable +- **Examples**: + - PubSub: 5s, 300s timeouts, magic numbers (3, 10, 50, 100) + - Signaling Discovery: 60-second rate limiting interval + - RPC Channel: 5 concurrent requests limit +- **Impact**: Reduces flexibility, makes system harder to tune +- **Solution**: Extract configuration to proper configuration management system + +## Recommendations and Priorities + +### Phase 1: Critical Architecture Issues + +1. **Fix Blocking Operations**: Convert Transaction Pool to async patterns +2. **Implement Missing Error Actions**: Add error paths for Block Producer and + other components +3. **Remove Panic-based Error Handling**: Replace `todo!()` with proper error + handling and complete error sink service integration (building on PR #1097) + +### Phase 2: High-Priority Refactoring + +1. **Break Down Monolithic Reducers**: Move logic to state methods for PubSub, + Scheduler; complete Yamux refactoring (building on PR #1085) +2. **Standardize Communication Patterns**: Consistent callback usage across + components +3. **Distribute Event Handling**: Move domain-specific events to respective + state machines + +### Phase 3: Medium-Priority Improvements + +1. **Complete Architecture Migrations**: Migrate Transition Frontier to + new-style patterns +2. **Implement Missing Enabling Conditions**: Ensure all actions have proper + validation +3. **Security Hardening**: Implement secure key zeroization for P2P Noise + session keys +4. **Improve Protocol Documentation**: Better documentation of complex protocol + implementations like yamux +5. **Improve Service Boundaries**: Remove business logic from transport layers +6. **Resolve Extensive TODOs**: Systematic completion of deferred + implementations +7. **Standardize Error Handling**: Consistent error propagation patterns + +### Phase 4: Long-term Quality Improvements + +1. **Extract Configuration**: Make hard-coded values configurable +2. **Improve Testing Infrastructure**: Remove architectural compromises +3. **Documentation**: Ensure all patterns are documented and consistent +4. **Performance Optimization**: Address O(n) lookups and other performance + issues + +## Cross-references + +- **Service-layer technical debt**: See `services-technical-debt.md` +- **Component-specific issues**: See `summary.md` files in respective component + directories +- **P2P component technical debt**: See individual summaries in + `p2p/src/network/*/summary.md` +- **Architecture guidelines**: See `state-machine-structure.md` and + `state-machine-development-guide.md` +- **Specific refactoring plans**: See `*_refactoring.md` files in component + directories + +## Conclusion + +The state machine architecture is solid but needs work to achieve consistency +and maintainability. The biggest problems are blocking operations, missing error +handling, and panic-based error handling, which hurt system stability. +Completing the architectural migrations and reducer refactoring will make the +codebase easier to work with. diff --git a/docs/handover/testing-infrastructure.md b/docs/handover/testing-infrastructure.md new file mode 100644 index 000000000..e554af601 --- /dev/null +++ b/docs/handover/testing-infrastructure.md @@ -0,0 +1,526 @@ +# OpenMina Testing Infrastructure Handover Document + +## Overview + +The OpenMina testing infrastructure provides scenario-based testing for +multi-node blockchain scenarios. Tests are structured as sequences of steps that +can be recorded, saved, and replayed deterministically. + +## Architecture + +### Core Design Principles + +1. **Scenario-Based Testing**: Tests are structured as scenarios - sequences of + steps that can be recorded, saved, and replayed deterministically +2. **State Machine Architecture**: Follows the Redux-style pattern used + throughout OpenMina +3. **Multi-Implementation Support**: Tests both Rust (OpenMina) and OCaml + (original Mina) nodes +4. **Deterministic Replay**: All tests can be replayed exactly using recorded + scenarios + +### Key Components + +#### 1. Test Library (`node/testing/src/lib.rs`) + +- Provides test runtime setup and synchronization +- Manages global test gates to ensure sequential execution +- Initializes tracing and thread pools + +#### 2. Test Runner (`node/testing/src/main.rs`) + +The testing binary provides three commands, though the `server` command +currently has a clap configuration bug: + +```bash +# Generate new test scenarios (requires scenario-generators feature) +cargo run --bin openmina-node-testing --features=scenario-generators -- scenarios-generate + +# Replay recorded scenarios +cargo run --bin openmina-node-testing -- scenarios-run --name "ScenarioName" + +# Server command exists but has a clap argument conflict bug +``` + +Note: Most testing is done via standard `cargo test` commands rather than the +binary. + +#### 3. Scenario Framework (`node/testing/src/scenarios/mod.rs`) + +Contains extensive predefined test scenarios organized into categories: + +- **Solo Node Tests**: Single node sync and bootstrap tests +- **Multi-Node Tests**: Network connectivity and consensus tests +- **P2P Tests**: Connection handling and peer discovery +- **Simulation Tests**: Long-running network simulations +- **Record/Replay Tests**: Replaying recorded scenarios + +## Scenario System + +### Scenario Structure + +Each scenario consists of: + +- **ScenarioInfo**: Metadata (ID, description, parent scenario) +- **ScenarioSteps**: Ordered list of actions + +### Common Step Types + +```rust +- AddNode { config } // Add a new node to the cluster +- ConnectNodes { dialer, listener } // Connect two nodes +- AdvanceTime { by_nanos } // Advance monotonic time +- Event { node_id, event } // Dispatch specific events +- CheckTimeouts { node_id } // Process timeout events +- AdvanceNodeTime { node_id, by_nanos } // Advance time for specific node +``` + +### Scenario Inheritance + +Scenarios can have parent scenarios, allowing test composition: + +``` +ParentScenario (sets up base state) + └── ChildScenario (builds on parent state) +``` + +## Testing Infrastructure + +### NodeTestingService (`node/testing/src/service/mod.rs`) + +Wraps the real `NodeService` with testing capabilities: + +- **Event Management**: Tracks events with IDs for replay +- **Time Control**: Allows precise time advancement +- **Proof Mocking**: Supports dummy proofs for faster tests (ProofKind::Dummy) +- **Service Mocking**: Mock block production, SNARK verification, etc. + +### Cluster Management (`node/testing/src/cluster/mod.rs`) + +Manages multiple nodes in a test environment: + +```rust +let mut cluster = Cluster::new(cluster_config); +let mut runner = ClusterRunner::new(&mut cluster, |_step| {}); +let rust_node_id = runner.add_rust_node(config); +let ocaml_node_id = runner.add_ocaml_node(config); +``` + +Features: + +- Automatic port allocation +- Account key management +- Network topology control +- Debugger integration + +### OCaml Node Limitations + +When including OCaml nodes in test scenarios, there are several important +limitations compared to Rust nodes: + +**Time Control:** + +- OCaml nodes use real wall-clock time and cannot be controlled via + `AdvanceTime` or `AdvanceNodeTime` steps +- Only Rust nodes support deterministic time advancement +- This can cause timing-dependent test failures when OCaml and Rust nodes get + out of sync + +**Visibility and Debugging:** + +- OCaml nodes are "black boxes" - we cannot inspect their internal state like we + can with Rust nodes +- No access to OCaml node's internal execution, state changes, or data + structures +- Limited logging and debugging capabilities compared to Rust nodes +- Cannot use invariant checking on OCaml node state + +**Network Control:** + +- Cannot manually disconnect OCaml peers using test framework commands +- Network topology changes must be done externally or through OCaml node's own + mechanisms +- Limited control over OCaml node's P2P behavior and connection management + +**Behavioral Control:** + +- No control over OCaml node's internal execution flow or decision-making +- Cannot trigger specific OCaml node behaviors on demand +- Cannot guarantee that expected operations will be executed at all +- This limits the determinism of tests involving OCaml nodes + +**Testing Implications:** + +- Tests with OCaml nodes are inherently less deterministic +- Focus should be on testing interoperability rather than detailed protocol + behavior +- Use OCaml nodes primarily for cross-implementation validation +- Consider using Rust-only scenarios when precise control is needed + +### Potential OCaml Node Testing Improvements + +To improve OCaml node testing capabilities, the following changes could be made +to the OCaml implementation: + +**Deterministic Time Control:** + +- Add support for controllable time advancement instead of wall-clock time +- Implement time mocking or virtual time system that can be controlled by the + test framework +- This would enable synchronization between OCaml and Rust nodes in tests + +**Testing API:** + +- Expose internal state inspection endpoints for testing purposes +- Add hooks or callbacks for test frameworks to monitor internal execution +- Implement test-specific logging and debugging interfaces + +**Network Control:** + +- Add testing APIs to manually control P2P connections +- Implement test hooks for network topology manipulation +- Provide mechanisms to trigger specific network behaviors on demand + +**Behavioral Control:** + +- Add test-specific triggers for protocol operations +- Implement deterministic execution modes for testing +- Provide APIs to control or observe internal decision-making processes + +**Implementation Notes:** + +- These improvements would require coordination with the OCaml Mina team +- Changes should be designed to not affect production behavior +- Testing improvements could be implemented as optional testing-only features + +## Invariant Checking System + +### What are Invariants? + +Invariants are properties that must always hold true during execution. The +testing framework continuously checks these invariants to catch bugs early. + +### Invariant Interface (`node/invariants/src/lib.rs`) + +```rust +pub trait Invariant { + type InternalState: 'static + Send + Default; + + fn is_global(&self) -> bool { false } + fn triggers(&self) -> &[ActionKind]; + + fn check( + self, + internal_state: &mut Self::InternalState, + store: &Store, + action: &ActionWithMeta, + ) -> InvariantResult; +} +``` + +### Built-in Invariants + +1. **NoRecursion**: Prevents recursive action dispatching +2. **P2pStatesAreConsistent**: Ensures P2P state consistency across nodes +3. **TransitionFrontierOnlySyncsToBetterBlocks**: Validates blockchain + synchronization logic + +### Creating Custom Invariants + +```rust +impl Invariant for MyInvariant { + type InternalState = (); // Or custom state type + + fn triggers(&self) -> &[ActionKind] { + // Return actions that should trigger this check + &[ActionKind::SomeAction] + } + + fn check( + self, + _internal_state: &mut Self::InternalState, + store: &Store, + action: &ActionWithMeta, + ) -> InvariantResult { + // Check your invariant condition using store.state() + if condition_violated { + InvariantResult::Violated("Description of violation") + } else { + InvariantResult::Ok + } + } +} +``` + +## Test Patterns and Examples + +### 1. Single Node Bootstrap Test + +**File**: +[`solo_node/bootstrap.rs`](../../../node/testing/src/scenarios/solo_node/bootstrap.rs) + +**Testing Pattern**: Validates that a single Rust node can bootstrap against +real blockchain data from a replayer. + +**Key Techniques**: + +- **Replayer Integration**: Uses a host replayer with actual blockchain data + rather than synthetic test data +- **Multi-phase Validation**: Separately checks staking ledger sync, next epoch + ledger sync, and transition frontier sync +- **Time Coordination**: Carefully manages timestamp alignment with recorded + blockchain data to avoid validation failures + +### 2. Multi-Node Network Test + +**File**: +[`multi_node/sync_4_block_producers.rs`](../../../node/testing/src/scenarios/multi_node/sync_4_block_producers.rs) + +**Testing Pattern**: Tests consensus participation and synchronization across +multiple block-producing nodes. + +**Key Techniques**: + +- **Block Producer Configuration**: Creates nodes with actual block producer + keys and configs +- **Topology Control**: Explicitly connects nodes in controlled patterns rather + than full mesh +- **Consensus Validation**: Verifies that all nodes reach the same blockchain + state through consensus participation + +### 3. Cross-Implementation Test + +**File**: +[`multi_node/connection_discovery.rs`](../../../node/testing/src/scenarios/multi_node/connection_discovery.rs) + +**Testing Pattern**: Validates interoperability between Rust (OpenMina) and +OCaml (original Mina) implementations. + +**Key Techniques**: + +- **Implementation Bridging**: Tests communication between different protocol + implementations +- **Peer Discovery**: Validates Kademlia-based peer discovery across + implementation boundaries +- **Bidirectional Validation**: Ensures both implementations can discover and + communicate with each other + +## Advanced Testing Examples + +### P2P Connection Race Condition Testing + +**File**: +[`p2p/basic_connection_handling.rs`](../../../node/testing/src/scenarios/p2p/basic_connection_handling.rs) - +`SimultaneousConnections` + +**Testing Pattern**: Race conditions in P2P connection establishment where both +nodes initiate connections simultaneously. + +**Key Techniques**: + +- Tests **race conditions** in distributed systems +- Uses **proper async testing** with timeout handling +- Validates **connection deduplication** - system handles simultaneous + connections gracefully +- Employs **steady state verification** - waits for system to settle before + assertions + +### Deterministic Replay Testing + +**File**: +[`record_replay/block_production.rs`](../../../node/testing/src/scenarios/record_replay/block_production.rs) - +`RecordReplayBlockProduction` + +**Testing Pattern**: Deterministic execution validation for blockchain consensus +logic. + +**Key Techniques**: + +- **Determinism validation** - critical for blockchain consensus +- **Record-replay pattern** - captures and reproduces exact execution sequences +- **State comparison** - verifies identical outcomes across runs +- **Non-determinism detection** - catches sources of randomness that could break + consensus + +### VRF Epoch Boundary Testing + +**File**: +[`multi_node/vrf_epoch_bounds_evaluation.rs`](../../../node/testing/src/scenarios/multi_node/vrf_epoch_bounds_evaluation.rs) - +`MultiNodeVrfEpochBoundsEvaluation` + +**Testing Pattern**: VRF (Verifiable Random Function) evaluation across epoch +boundaries in blockchain consensus. + +**Key Techniques**: + +- **Time-dependent testing** - validates VRF evaluation across epoch boundaries +- **Block production control** - uses `produce_blocks_until` with custom + predicates +- **State inspection** - accesses actual `vrf_evaluator().latest_evaluated_slot` + state +- **Epoch transition validation** - tests critical blockchain timing at slot + boundaries + +### Large-Scale Network Testing + +**File**: +[`multi_node/pubsub_advanced.rs`](../../../node/testing/src/scenarios/multi_node/pubsub_advanced.rs) - +`MultiNodePubsubPropagateBlock` + +**Testing Pattern**: Block propagation through gossip networks at scale. + +**Key Techniques**: + +- **Scale testing** - validates behavior with 10+ nodes using Simulator +- **Action monitoring** - tracks P2P message propagation in real-time +- **Graph visualization** - generates DOT format network graphs for debugging +- **Deterministic recording** - captures all state transitions for replay +- **Blockchain simulation** - tests actual block production and gossip + propagation + +## Running Tests + +### Scenario Generation and Replay + +```bash +# Generate specific scenarios (requires scenario-generators feature) +cargo run --release --features scenario-generators --bin openmina-node-testing -- scenarios-generate --name record_replay_block_production + +# Generate WebRTC scenarios (requires additional p2p-webrtc feature) +cargo run --release --features scenario-generators,p2p-webrtc --bin openmina-node-testing -- scenarios-generate --name webrtc_p2p_signaling + +# Generate multi-node scenarios +cargo run --release --features scenario-generators --bin openmina-node-testing -- scenarios-generate --name multi-node-pubsub-propagate-block + +# Replay existing scenarios +cargo run --release --bin openmina-node-testing -- scenarios-run --name "ScenarioName" +``` + +## Best Practices + +### 1. Writing New Tests + +- Start with existing scenario as template +- Use parent scenarios for common setup +- Record all non-deterministic inputs +- Add invariants for critical properties + +### 2. Debugging Failed Tests + +- Use `--nocapture` to see all logs +- Enable network debugger for visualization +- Replay scenarios with added logging +- Check invariant violations first + +### 3. Performance Considerations + +- Use `ProofKind::Dummy` for logic tests +- Minimize time advancement steps +- Batch similar operations +- Clean up resources properly + +## Common Issues and Solutions + +### 1. Non-Deterministic Failures + +- **Issue**: Test passes sometimes but fails others +- **Solution**: Ensure all randomness uses fixed seeds, check for timing + dependencies + +### 2. Port Conflicts + +- **Issue**: "Address already in use" errors +- **Solution**: Use unique port ranges per test, ensure proper cleanup + +### 3. Slow Test Execution + +- **Issue**: Tests take too long +- **Solution**: Use dummy proofs, reduce node count, optimize wait conditions + +### 4. Invariant Violations + +- **Issue**: Invariant check fails during test +- **Solution**: Check logs for violation details, add debugging, fix state + inconsistency + +## Advanced Features + +### Proof Configuration + +Control proof generation for faster testing: + +```rust +// From cluster configuration - use dummy proofs to speed up tests +let mut cluster_config = ClusterConfig::new(None)?; +cluster_config.set_proof_kind(ProofKind::Dummy); // Fastest - no proof verification +// cluster_config.set_proof_kind(ProofKind::ConstraintsChecked); // Medium - check constraints only +// cluster_config.set_proof_kind(ProofKind::Full); // Slowest - full proof generation/verification +``` + +### Custom Action Monitoring + +Track specific network events during tests: + +```rust +// From pubsub_advanced.rs - monitor gossip message propagation +let factory = || { + move |_id, state: &node::State, _service: &NodeTestingService, action: &ActionWithMeta| { + match action.action() { + Action::P2p(P2pAction::Network(P2pNetworkAction::Pubsub( + P2pNetworkPubsubAction::OutgoingMessage { peer_id }, + ))) => { + // Track block propagation for visualization + let pubsub_state = &state.p2p.ready().unwrap().network.scheduler.broadcast_state; + // Process gossip messages... + false + } + _ => false, + } + } +}; +``` + +## Notes + +- Tests are scenarios that can be recorded and replayed +- Invariants continuously validate system properties +- Multi-node and cross-implementation testing is well-supported +- Debugging tools help diagnose complex issues + +For additional examples and patterns, refer to: + +- Test scenarios: `node/testing/src/scenarios/` +- Testing documentation: [`docs/testing/`](../../testing/) - Contains detailed + test descriptions, troubleshooting guides, and testing methodology + documentation + +## Appendix: Block Replayer Tool + +**Current Status**: An unfinished prototype exists in the +`tools/block-replayer/` directory on the `feat/basic-block-replayer` branch but +was never integrated into the testing workflow. + +**Purpose**: Sequential block replay from genesis to validate that the node's +transaction logic, ledger operations, and block processing are correct against +real blockchain data. Could also be used to test proof verification (blocks, +completed works) and on devnet, for blocks produced by accounts of which the +private key is available, to reproduce the proof and ensure the prover works +correctly by ensuring it should have been able to produce all those proofs by +itself. + +**Value Proposition**: + +- **Transaction Logic Validation**: Ensures transaction processing matches + expected behavior from historical blocks +- **Ledger Operation Testing**: Validates ledger state transitions and account + updates +- **Block Processing Verification**: Tests the complete block application + pipeline +- **Real-World Coverage**: Uses actual blockchain data rather than synthetic + test scenarios +- **Regression Testing**: Catches regressions in core blockchain logic + +**Recommendation**: The new team should complete or rewrite this tool as a +standalone testing utility. This would provide validation of core blockchain +logic that is complementary to but separate from the scenario-based testing +framework. diff --git a/docs/handover/webnode.md b/docs/handover/webnode.md new file mode 100644 index 000000000..7295f2ab3 --- /dev/null +++ b/docs/handover/webnode.md @@ -0,0 +1,188 @@ +# OpenMina Webnode Implementation + +This document covers the WebAssembly (WASM) build target of OpenMina located in +`node/web/`. + +## Overview + +The webnode compiles the full OpenMina node to WebAssembly for browser +execution. It includes block production, transaction processing, SNARK +verification, and WebRTC-based P2P networking. + +### Design Goals + +- Run the full node stack in browsers without plugins +- Maintain compatibility with the main OpenMina implementation +- Support block production with browser-based proving +- Provide JavaScript API for web applications + +## Architecture + +### WASM Target + +Builds as both `cdylib` and `rlib` crate types. Code is conditionally compiled +with `#[cfg(target_family = "wasm")]` guards. + +#### Build Process + +```bash +cd node/web +cargo +nightly build --release --target wasm32-unknown-unknown +wasm-bindgen --keep-debug --web --out-dir ../../frontend/src/assets/webnode/pkg ../../target/wasm32-unknown-unknown/release/openmina_node_web.wasm +``` + +Requires nightly toolchain and generates bindings for +`frontend/src/assets/webnode/pkg/`. + +For complete setup instructions including circuit downloads and frontend +configuration, see [local-webnode.md](../local-webnode.md). + +### Threading + +Browser threading constraints require specific adaptations: + +#### Rayon Setup + +`init_rayon()` in `rayon.rs` configures the thread pool using +`num_cpus.max(2) - 1` threads. Must be called before SNARK verification. + +#### Task Spawning + +- `P2pTaskSpawner`: Uses `wasm_bindgen_futures::spawn_local()` +- `P2pTaskRemoteSpawner`: Routes tasks to main thread via + `thread::start_task_in_main_thread()` because WebRTC APIs are main-thread only + +## Features + +Provides the same functionality as the native node: + +- Transaction validation and application +- Ledger state management +- SNARK verification using browser-compiled circuits +- Consensus participation +- RPC interface for web applications + +### Block Production + +Supports block production with: + +- Plain text or encrypted private keys (parsed in `parse_bp_key()`) +- Custom coinbase receivers +- Browser-based SNARK proving via `BlockProver::make()` + +### Networking + +#### WebRTC P2P Layer + +- **Transport**: WebRTC DataChannels for browser-to-browser communication +- **Protocol**: Pull-based networking (see [P2P README](../p2p/readme.md)) +- **Default Peer**: + `/2bjYBqn45MmtismsAYP9rZ6Xns9snCcNsN1eDgQZB5s6AzY2CR2/https/webrtc3.webnode.openmina.com/443` +- **Channels**: 8 distinct DataChannels for different protocol types (see + [P2P README](../p2p/readme.md#channels)) + +#### Network Configuration + +```rust +initial_peers: Vec +peer_discovery: !self.p2p_no_discovery +max_peers: Some(100) +``` + +## Implementation Details + +### Key Files + +#### `lib.rs` - Main Entry Point + +- **`main()`**: Automatic WASM initialization +- **`run()`**: Primary node startup function +- **`build_env()`**: Build information export +- **`parse_bp_key()`**: Block producer key parsing + +#### `node/builder.rs` - Node Construction + +- **`NodeBuilder`**: Node configuration +- **Configuration Methods**: P2P setup, block production, verification +- **Default Peers**: Single hardcoded WebRTC peer for bootstrap + +#### `node/mod.rs` - Type Definitions + +- **Type Aliases**: `Node = openmina_node_common::Node` +- **Task Spawners**: P2P-specific spawning implementations for browser + constraints + +#### `rayon.rs` - Threading Setup + +- **`init_rayon()`**: Required initialization for multi-threading +- **CPU Detection**: Automatic core count with minimum guarantees + +### JavaScript Interface + +#### Main Entry Point + +```javascript +const rpcSender = await run(blockProducerKey, seedNodesUrl, genesisConfigUrl); +``` + +- `blockProducerKey`: Optional string or `[encrypted, password]` array +- `seedNodesUrl`: Optional URL returning newline-separated peer addresses +- `genesisConfigUrl`: Optional URL returning binary genesis config (defaults to + `DEVNET_CONFIG`) + +#### Setup + +- `console_error_panic_hook` enables panic traces in browser console +- `keep_worker_alive_cursed_hack()` prevents worker termination (wasm-bindgen + issue #2945) + +### Performance + +- Parallel SNARK verification using `num_cpus.max(2) - 1` threads +- Circuit reuse for verification operations +- 100 peer connection limit configured in `P2pLimits` +- Statistics collection via `gather_stats()` when enabled + +## Dependencies + +**WASM-specific**: `wasm-bindgen`, `wasm-bindgen-futures`, `js-sys`, +`console_error_panic_hook`, `gloo-utils` + +**Core**: Standard OpenMina workspace crates plus `rayon` for threading + +## Known Issues + +- Worker lifecycle requires `keep_worker_alive_cursed_hack()` due to + wasm-bindgen issue #2945 +- WebRTC operations restricted to main thread +- Careful initialization ordering required + +## Technical Debt + +- TODO in `setup_node()` for seed nodes refactoring +- Single hardcoded default peer +- Commented HTTP client and peer loading code in `builder.rs` + +## Usage + +```javascript +// Basic startup +const rpc = await run(); + +// With block producer +const rpc = await run("private-key"); +const rpc = await run([encryptedKey, "password"]); + +// With custom configuration +const rpc = await run(key, peersUrl, genesisUrl); + +// RPC access +const peers = await rpc.state().peers(); +const stats = await rpc.stats().sync(); +``` + +## Future Work + +- Split prover to its own WASM heap + (https://github.com/openmina/openmina/issues/1128) +- API for zkApp integration diff --git a/ledger/src/proofs/summary.md b/ledger/src/proofs/summary.md new file mode 100644 index 000000000..06b638f67 --- /dev/null +++ b/ledger/src/proofs/summary.md @@ -0,0 +1,69 @@ +# Proofs Module Summary + +The proofs module handles zero-knowledge proof generation and verification for +the Mina protocol using the Kimchi proof system. The implementation has proven +to work reliably on devnet but contains many TODOs and probably cleanup. + +## Quick Reference + +**Core Proof Types** + +- `transaction.rs` - Transaction verification with witness generation +- `block.rs` - Blockchain state transitions with consensus validation +- `merge.rs` - Combining multiple transaction proofs +- `zkapp.rs` - Smart contract execution proofs with authorization types + +**Infrastructure** + +- `witness.rs` - Witness data management (primary/auxiliary) +- `caching.rs` - Verifier index and SRS caching +- `constants.rs` - Circuit sizes and domain configurations +- `step.rs`/`wrap.rs` - Step/wrap proof pattern for recursion + +## Implementation + +**Kimchi Integration** + +- Type aliases in `mod.rs` directly use Kimchi types for verifier/prover indices + and proofs +- Circuit constraints and field operations built on Kimchi foundations +- Maintains compatibility with Mina protocol proof formats +- Uses a fork of proof-systems that is based on an older version than currently + used by OCaml implementation +- Uses a fork of arkworks to considerably speed up field operations on WASM +- See [#1106](https://github.com/o1-labs/openmina/issues/1106) for details on + these forks + +**Performance Features** + +- Caching system stores verifier indices and SRS data in `$HOME/.cache/openmina` +- Circuit blobs for external circuit data fetching +- Precomputed verification indices for devnet/mainnet in `data/` directory + +**Pickles Recursive Proof System** + +- The entire proofs module implements Pickles (recursive proof composition + system) +- `step.rs`/`wrap.rs` provide the fundamental step/wrap recursion pattern +- `public_input/prepared_statement.rs` handles different recursion levels (N0, + N1, N2) + +**Witness vs Circuit Split** + +- Witness generation handled in `witness.rs` with comparison functionality for + OCaml compatibility testing +- Circuit logic implemented but lacks constraint declarations and + compilation/evaluation functionality +- Uses precomputed verification indices from `data/` directory + +For details on missing constraint functionality and circuit management, see +[`docs/handover/circuits.md`](../../../docs/handover/circuits.md). + +## Known Issues + +**Incomplete Functionality** + +- Joint combiner functionality has TODO items in `step.rs` +- Feature flag handling incomplete in some verification paths +- ZkApp call stack hash computation needs completion +- Some field type conversions marked as temporary hacks diff --git a/ledger/summary.md b/ledger/summary.md new file mode 100644 index 000000000..4e2ead7b8 --- /dev/null +++ b/ledger/summary.md @@ -0,0 +1,97 @@ +# Ledger Crate Summary + +The ledger crate is the most complex component in the codebase. For architecture +overview and design details, see +[docs/handover/ledger-crate.md](../docs/handover/ledger-crate.md). + +## Quick Reference + +**Core Ledger** + +- `src/base.rs` - BaseLedger trait (fundamental interface) +- `src/database/` - In-memory account storage +- `src/mask/` - Layered ledger views with Arc-based sharing +- `src/tree.rs` - Merkle tree operations + +**Transaction Processing** + +- `src/transaction_pool.rs` - Mempool with fee-based ordering +- `src/staged_ledger/` - Block validation and transaction application +- `src/scan_state/` - SNARK work coordination and parallel scan + +**Proof System** + +- `src/proofs/` - Transaction, block, and zkApp proof generation/verification +- `src/sparse_ledger/` - Minimal ledger representation for proofs +- `src/zkapps/` - zkApp transaction processing + +**Account Management** + +- `src/account/` - Account structures, balances, permissions + +## Status + +The ledger components have proven reliable on devnet despite technical debt +patterns. The implementation maintains the same battle-tested logic that runs +the Mina network. + +## Issues for Improvement + +**Error Handling** + +- Extensive use of `.unwrap()` and `.expect()` calls in code paths, particularly + in `scan_state/transaction_logic.rs`, `staged_ledger/staged_ledger.rs`, and + `transaction_pool.rs` +- These calls are generally in code paths with well-understood preconditions but + could benefit from explicit error propagation +- Inconsistent error handling patterns across modules +- Verification key lookup bug fix from upstream Mina Protocol needs to be ported + (https://github.com/MinaProtocol/mina/pull/16699) + +**Monolithic Structure** + +- Large files like `scan_state/transaction_logic.rs` and + `staged_ledger/staged_ledger.rs` mirror OCaml's structure and are difficult to + navigate +- Files contain embedded tests that are hard to discover +- When modifying these files, prefer small targeted changes over major + restructuring + +**Performance** + +- Excessive cloning of large structures in hot paths: + - `SparseLedger::of_ledger_subset_exn()` calls `oledger.copy()` creating + unnecessary deep clones for sparse ledger construction + - Transaction pool operations clone transaction objects with acknowledged TODO + comments about performance +- Performance monitoring infrastructure exists but is disabled +- No memory pooling or reuse strategies (could help with memory fragmentation in + WASM) + +**Memory Management** + +- Memory-only implementation, no persistence for production +- There's an unused `ondisk` implementation but we were planning a more + comprehensive global solution (see persistence.md) +- Thread-local caching holds memory indefinitely + +**Code Organization** + +- Multiple TODO/FIXME items throughout the codebase requiring attention +- Incomplete implementations in `sparse_ledger/mod.rs` with unimplemented trait + methods + +## Refactoring Plan + +**Phase 1: Safety** + +- Replace `.unwrap()` with proper error propagation in production code +- Reduce cloning in hot paths +- Standardize error types + +**Phase 2: Decomposition** Break into focused crates: `mina-account`, +`mina-ledger`, `mina-transaction-logic`, `mina-scan-state`, +`mina-transaction-pool`, `mina-proofs` + +Changes must maintain strict OCaml compatibility while improving performance for +production. diff --git a/node/src/block_producer/summary.md b/node/src/block_producer/summary.md new file mode 100644 index 000000000..1fed59c60 --- /dev/null +++ b/node/src/block_producer/summary.md @@ -0,0 +1,113 @@ +# Block Producer State Machine + +Orchestrates the complete block production pipeline from VRF evaluation to proof +generation and network broadcast. + +## Purpose + +- **Block Production Pipeline**: Coordinates the multi-phase process of + creating, proving, and broadcasting blocks +- **VRF Integration**: Uses VRF evaluator subcomponent to determine slot + leadership eligibility +- **Transaction Selection**: Retrieves and includes transactions from the + transaction pool based on fee priority +- **Proof Coordination**: Requests block proofs from external SNARK services and + handles proof completion +- **Network Broadcasting**: Injects completed blocks into the P2P network via + transition frontier + +## Architecture + +### State Structure + +- **BlockProducerState**: Optional wrapper enabling/disabling block production +- **BlockProducerEnabled**: Active state containing configuration, VRF + evaluator, current state, and injected block tracking +- **VRF Evaluator**: Subcomponent handling slot leadership determination (see + `vrf_evaluator/summary.md`) +- **Injected Blocks Tracking**: `BTreeSet` maintaining blocks pending + best tip transitions + +### Multi-Phase Block Production State Flow + +``` +Idle → WonSlot → WonSlotWait → WonSlotProduceInit → +WonSlotTransactionsGet → WonSlotTransactionsSuccess → +StagedLedgerDiffCreatePending → StagedLedgerDiffCreateSuccess → +BlockUnprovenBuilt → BlockProvePending → BlockProveSuccess → +BlockProduced → BlockInjected → Idle +``` + +### Action Types + +- **VRF Actions**: `VrfEvaluator(BlockProducerVrfEvaluatorAction)` for slot + leadership +- **Timing Actions**: `WonSlotSearch`, `WonSlot`, `WonSlotWait` for slot + coordination +- **Transaction Actions**: `WonSlotTransactionsGet/Success` for mempool + integration +- **Ledger Actions**: `StagedLedgerDiffCreate*` for state transition + construction +- **Proof Actions**: `BlockProve*` for external SNARK proof coordination +- **Network Actions**: `BlockInject/Injected` for P2P broadcast + +## Key Algorithms + +### Block Production Coordination + +1. **VRF Evaluation**: Delegates to VRF evaluator subcomponent for slot wins +2. **Transaction Collection**: Requests transactions from pool sorted by fee +3. **Staged Ledger Diff**: Creates valid state transitions including coinbase + and fees +4. **Block Construction**: Builds unproven blocks with all required components +5. **Proof Generation**: Coordinates with external services for block SNARK + proofs +6. **Network Injection**: Broadcasts completed blocks via transition frontier + +### Timing Management + +- **Slot Boundaries**: Ensures production occurs within 3-minute slot windows +- **Best Tip Tracking**: Maintains won slots valid for current blockchain state +- **Production Delays**: 1-second broadcast delay for network time + synchronization +- **Sync Awareness**: Pauses production during frontier synchronization + +## Service Integration + +### VRF Evaluator Subcomponent + +- **Slot Leadership**: Determines eligibility based on stake and VRF evaluation +- **Epoch Management**: Handles epoch transitions and staking ledger switches +- **Won Slot Caching**: Provides future slot wins for production scheduling + +### Ledger Service + +- **Transaction Retrieval**: Gets pending transactions sorted by fee priority +- **Staged Ledger Diff**: Creates valid state transitions with transaction + inclusion +- **State Validation**: Ensures proper coinbase, fee, and proof handling + +### External Proof Services + +- **Block Proving**: Requests SNARK proofs for complete block validation +- **Proof Integration**: Incorporates proofs into final block structure + +### P2P Network + +- **Block Broadcasting**: Injects blocks into gossip network via transition + frontier +- **Network Timing**: Coordinates broadcast timing for consistent propagation + +## Technical Debt + +### Critical Issues + +- **Block Proof Failure Handling**: `BlockProducerEvent::BlockProve` includes + error handling via `Result, String>`, but error + cases trigger `todo!()` panics in event source processing rather than proper + error sink service integration for graceful failure handling +- **Currency Overflow Handling**: `todo!("total_currency overflowed")` in block + production when currency calculations overflow, indicating incomplete edge + case handling +- **Missing Error Recovery**: No fallback mechanisms for proof failures - system + should integrate with error sink service instead of panicking diff --git a/node/src/block_producer/vrf_evaluator/summary.md b/node/src/block_producer/vrf_evaluator/summary.md new file mode 100644 index 000000000..179754f39 --- /dev/null +++ b/node/src/block_producer/vrf_evaluator/summary.md @@ -0,0 +1,106 @@ +# VRF Evaluator State Machine + +Evaluates Verifiable Random Function outputs to determine slot leadership +eligibility through epoch-aware evaluation processes. + +## Purpose + +- **VRF Slot Leadership**: Calculates VRF outputs to determine block production + eligibility based on stake delegation +- **Epoch Management**: Handles 7,140-slot epochs (≈3-minute slots) with + Current/Next/Waiting epoch context transitions +- **Stake Calculations**: Implements threshold calculations using delegated + stake percentages and epoch-specific staking ledgers +- **Won Slot Tracking**: Maintains `BTreeMap` for + efficient slot lookup and history retention + +## Architecture + +### Core State Structure + +- **BlockProducerVrfEvaluatorState**: Contains evaluation status, won slots + cache, epoch tracking, and context management +- **Won Slots Storage**: `BTreeMap` providing O(log n) + slot lookup with range querying +- **Epoch Context**: `EpochContext` enum handling Current/Next/Waiting states +- **Evaluation Tracking**: `latest_evaluated_slot` and `last_evaluated_epoch` + for incremental processing + +### Status State Flow + +``` +Idle → InitialisationPending → InitialisationComplete → ReadinessCheck → +ReadyToEvaluate → EpochDataPending → InitialSlotSelection → +DelegatorTableBuilding → SlotEvaluating → SlotEvalComplete +``` + +### Epoch Context Management + +- **Current(EpochData)**: Evaluating current epoch with current staking ledger +- **Next(EpochData)**: Evaluating next epoch with next staking ledger (requires + finalized epoch seed) +- **Waiting**: Waiting for epoch data availability or seed finalization + +## Key Algorithms + +### Epoch Boundary Detection (`evaluate_epoch_bounds`) + +```rust +if global_slot % SLOTS_PER_EPOCH == 0 { + SlotPositionInEpoch::Beginning +} else if (global_slot + 1) % SLOTS_PER_EPOCH == 0 { + SlotPositionInEpoch::End +} else { + SlotPositionInEpoch::Within +} +``` + +### Won Slot Lookup (`next_won_slot`) + +- **Range Query**: `won_slots.range(cur_global_slot..)` for future slots +- **Chain Validation**: Filters slots valid for extending current best tip +- **Genesis Timestamp**: Converts VRF slots to blockchain timestamps + +## Service Integration + +### Ledger Service + +- **Staking Data**: Reads delegation tables and account balances for VRF + calculations +- **Epoch Data**: Fetches epoch seeds, stake distributions, and delegation + mappings +- **Delegator Tables**: Builds stake lookup structures for VRF evaluation + +### External VRF Service + +- **Input Construction**: Creates `VrfEvaluatorInput` with epoch data and slot + parameters +- **Cryptographic Evaluation**: Delegates VRF computation to external proof + services +- **Result Processing**: Handles `VrfEvaluationOutput` to determine wins and + update caches + +## Implementation Details + +### Multi-Phase Evaluation + +1. **Readiness Check**: Validates epoch data and seed finalization +2. **Context Selection**: Chooses Current/Next/Waiting based on evaluation state +3. **Delegator Table Building**: Constructs stake lookup structures +4. **Incremental Evaluation**: Processes slots from `latest_evaluated_slot` +5. **Won Slot Validation**: Filters based on chain context and timing + +### Resource Management + +- **Incremental Processing**: Evaluates slots progressively rather than batch +- **Cache Maintenance**: Retains won slots across epochs with automatic cleanup +- **Context Switching**: Transitions between epoch evaluation contexts +- **Memory Bounds**: Prunes stale slots based on epoch transitions + +## Technical Debt + +### Critical Issues + +- **Unimplemented Epoch Context**: `todo!()` for `EpochContext::Waiting` state + in `SelectInitialSlot` action handler, indicating incomplete epoch transition + handling when epoch data is not yet available diff --git a/node/src/event_source/summary.md b/node/src/event_source/summary.md new file mode 100644 index 000000000..c6bf6a52a --- /dev/null +++ b/node/src/event_source/summary.md @@ -0,0 +1,166 @@ +# Event Source State Machine + +Central event aggregation and dispatch hub that bridges the asynchronous service +layer and synchronous Redux-style state machine using batch processing and +event-to-action translation. + +## Purpose + +- **Event System Backbone**: Acts as the central nervous system driving + OpenMina's state machine through continuous event processing +- **Async-Sync Bridge**: Converts asynchronous service events into synchronous + state machine actions with deterministic ordering +- **Batch Processing Hub**: Processes up to 1024 events per cycle with + integrated timeout management to prevent system hangs +- **System Orchestrator**: Maintains the main event loop that keeps the entire + node responsive and processing external inputs + +## Architecture and Implementation + +### Core Data Structures + +- **Event Enum** (`event.rs:13-22`): Contains all service events from P2P, + Ledger, SNARK, RPC, ExternalSnarkWorker, BlockProducer, and Genesis +- **EventSourceAction** (`event_source_actions.rs:6-24`): Defines the event + processing workflow with ProcessEvents, NewEvent, WaitForEvents, and + WaitTimeout actions +- **Event Processing Logic** (`event_source_effects.rs:36-50`): Central batch + processing with 1024-event limit and timeout injection + +### Multi-Phase Event Processing Algorithm + +``` +Service Queues → Batch Retrieval (1024 limit) → Event Translation → Action Dispatch → Timeout Check + ↓ ↓ ↓ ↓ ↓ +[Async World] → [Event Source] → [State Machine Actions] → [Domain State Machines] → [System Health] +``` + +### Processing Patterns + +- **Controlled Batching**: Processes exactly 1024 events before injecting + `CheckTimeoutsAction` to maintain system responsiveness + (`event_source_effects.rs:47-55`) +- **Event-to-Action Translation**: Each event type mapped to specific domain + actions with specialized error handling (`event_source_effects.rs:58+`) +- **Flow Control**: `WaitForEvents`/`WaitTimeout` states provide natural + backpressure and prevent resource exhaustion +- **Deterministic Ordering**: FIFO event processing ensures reproducible state + machine execution + +## Integration Points and Service Coordination + +### Multi-Service Event Aggregation + +- **P2P Events**: Connection lifecycle, channel management, peer communications, + network scheduling +- **Ledger Events**: Read/write operations, block applications, account state + changes +- **SNARK Events**: Block verification, work verification, user command + verification with specialized error handling +- **RPC Events**: All API requests with request IDs for complete request + lifecycle tracking +- **External Worker Events**: SNARK worker lifecycle, computation results, + capacity management +- **Block Producer Events**: VRF evaluation, block proving, consensus + participation + +### Cross-Component Communication Patterns + +- **Service → State Machine**: Async service results flow back through event + translation +- **External → Internal**: RPC requests and P2P messages converted to internal + actions +- **Background → Foreground**: Long-running processes (block production, SNARK + work) communicate results + +### System-Wide Coordination + +- **Main Event Loop Driver**: Continuously triggered from main effects dispatch + to maintain system activity +- **Timeout Management**: Regular `CheckTimeoutsAction` injection ensures + responsive behavior under high event load +- **Resource Monitoring**: Event queue monitoring provides system health + visibility and performance metrics + +## Technical Debt + +The event source currently centralizes **all domain-specific event handling +logic** in `event_source_effects.rs:58+`, creating significant architectural and +maintenance issues: + +### Current Centralization Problems + +- **Massive Effects File**: `event_source_effects.rs` contains hundreds of lines + of domain-specific event-to-action translations that should be distributed +- **Cross-Domain Coupling**: Changes to P2P event handling require touching the + same file as SNARK or Ledger event changes, creating unnecessary coupling +- **Import Pollution**: The effects file imports from every domain + (`p2p::channels::*`, `snark::*`, `ledger::*`, `rpc::*`, etc.) violating + separation of concerns +- **Single Point of Failure**: All event processing logic concentrated in one + location makes the system fragile and hard to maintain +- **Scalability Bottleneck**: Adding new service event types requires modifying + the central effects file instead of isolated domain modules + +### Specific Implementation Issues + +1. **Event Match Explosion** (`event_source_effects.rs:58-200+`): Giant match + statement handling: + - 30+ P2P event types with WebRTC/libp2p conditionals + - 10+ SNARK event types with specialized error handling + - 15+ RPC request types with individual dispatch logic + - Multiple service lifecycle events with different patterns + +2. **Domain Logic Leakage**: Event source knows intimate details of: + - P2P connection states and error types + - SNARK verification error classifications + - RPC request parameter structures + - Block producer VRF evaluation flows + +3. **Maintenance Complexity**: Any domain evolution requires: + - Updating the central Event enum + - Modifying the massive effects match statement + - Testing cross-domain impact from single file changes + +### Target Architecture (Distributed Event Handling) + +The intended architecture would **distribute domain expertise** while +maintaining central coordination: + +1. **Retain Core Event Source Responsibilities**: + - **Event Aggregation**: Batch processing (1024 events) and queue management + - **Flow Control**: `ProcessEvents`/`WaitForEvents`/`WaitTimeout` state + management + - **System Orchestration**: `CheckTimeoutsAction` injection and main loop + coordination + - **Generic Event Routing**: Forward events to appropriate domain handlers + +2. **Distribute Domain-Specific Logic**: + + ```rust + // Instead of central match in event_source_effects.rs: + Event::P2p(event) => p2p_effects::handle_event(store, event), + Event::Snark(event) => snark_effects::handle_event(store, event), + Event::Ledger(event) => ledger_effects::handle_event(store, event), + Event::Rpc(id, req) => rpc_effects::handle_event(store, id, req), + ``` + +3. **Domain Handler Pattern**: Each effectful state machine implements: + - **Action Handler**: Processes domain actions and makes service calls + (existing) + - **Event Handler**: Processes domain events and dispatches actions (NEW - + replaces central logic) + +4. **Benefits of Distribution**: + - **Modular Development**: Domain teams can modify event handling + independently + - **Reduced Coupling**: Changes isolated to relevant domain modules + - **Cleaner Abstractions**: Event source focuses on coordination, not domain + specifics + - **Easier Testing**: Domain event handling can be unit tested in isolation + - **Scalable Architecture**: New services add handlers without touching + central code + +This refactoring would transform the event source from a **monolithic event +processor** into a **lightweight coordination hub**, aligning with OpenMina's +Redux-style architecture principles of modular, predictable state management. diff --git a/node/src/external_snark_worker/summary.md b/node/src/external_snark_worker/summary.md new file mode 100644 index 000000000..6d313fd3f --- /dev/null +++ b/node/src/external_snark_worker/summary.md @@ -0,0 +1,42 @@ +# External SNARK Worker State Machine + +Manages a single external process that computes SNARK proofs for the node. + +## Purpose + +- Manages lifecycle of one external SNARK worker process +- Converts available SNARK jobs to worker specifications +- Handles work submission, cancellation, and timeout management +- Integrates worker results back into the SNARK pool + +## Worker State Machine + +- **None/Starting**: Initial states for worker startup with 120s timeout +- **Idle**: Ready to accept new work assignments +- **Working**: Processing a specific job with estimated duration timeout +- **WorkReady/WorkError**: Completed states awaiting result processing +- **Cancelling/Cancelled**: Work cancellation states +- **Killing/Error**: Shutdown and error states + +## Key Operations + +- **Work Specification**: Converts AvailableJobMessage to Mina snark worker + format +- **Base Jobs**: Transaction proofs with witness data and protocol state +- **Merge Jobs**: Combines two existing proofs into a single proof +- **Timeout Management**: Handles worker startup and work timeouts +- **Result Integration**: Adds completed SNARKs directly to snark pool + +## Interactions + +- **SNARK Pool**: Receives job assignments and submits completed work +- **Transition Frontier**: Provides protocol state data for work specifications +- **Config**: Uses snarker public key and fee configuration +- **P2P**: Integrates with network for SNARK propagation + +## Technical Notes + +- Single worker design with potential for future expansion +- Work cancellation supported but may not be immediately effective +- Completed SNARKs added directly to pool as trusted local work +- Protocol state lookup required for base transaction jobs diff --git a/node/src/ledger/read/summary.md b/node/src/ledger/read/summary.md new file mode 100644 index 000000000..8f39db0de --- /dev/null +++ b/node/src/ledger/read/summary.md @@ -0,0 +1,33 @@ +# Ledger Read State Machine + +Manages concurrent read operations on ledgers with cost-based throttling to +prevent service overload. + +## Purpose + +- Processes ledger read requests with cost limiting (MAX_TOTAL_COST = 256) +- Manages concurrent read access through PendingRequests system +- Provides account lookups and merkle proofs +- Handles request deduplication to avoid redundant work + +## Key Operations + +- Account queries and delegator tables +- Merkle tree traversal (num accounts, child hashes, account contents) +- Staged ledger auxiliary data and pending coinbases +- Scan state summaries for RPC + +## Interactions + +- **RPC Integration**: Serves account queries, ledger status, and scan state + summaries +- **P2P Integration**: Responds to ledger sync queries and staged ledger part + requests +- **VRF Integration**: Constructs delegator tables for block production +- **Service Integration**: Routes requests through LedgerManager with cost + tracking + +## Technical Debt + +- Request cost calculation not well documented +- Complex integration patterns with multiple callback types diff --git a/node/src/ledger/summary.md b/node/src/ledger/summary.md new file mode 100644 index 000000000..eb4b17640 --- /dev/null +++ b/node/src/ledger/summary.md @@ -0,0 +1,91 @@ +# Ledger State Machine + +Manages the blockchain's account ledger, balances, and ledger synchronization +through coordinated read/write operations. + +## Purpose + +- Maintains account states and balances in snarked and staged ledgers +- Applies transactions and blocks to update ledger state +- Provides merkle proofs and account lookups for various consumers +- Manages ledger synchronization with isolated sync state +- Tracks mask lifecycle and prevents memory leaks + +## Key Components + +- **Read Substate**: Cost-limited concurrent ledger queries with request + deduplication +- **Write Substate**: Sequential ledger operations (block apply, diff creation, + reconstruction, commits) +- **LedgerManager**: Thread-based service that handles actual ledger operations + asynchronously +- **Sync State**: Separate ledger storage for synchronization operations +- **Archive Support**: Additional data collection for archive nodes + +## Service Architecture + +- **LedgerManager**: Spawns dedicated "ledger-manager" thread with message + passing interface +- **Request Types**: Unified request enum covering read/write operations, + account management, and sync operations +- **Async/Sync Modes**: Supports both fire-and-forget calls and synchronous + blocking calls +- **State Machine Integration**: Routes responses back through event system to + update state machines + +## Storage Architecture + +- **Snarked Ledgers**: Finalized ledger states indexed by merkle root hash + (includes disk-loaded ledgers) +- **Staged Ledgers**: Working ledgers with pending transactions, indexed by + staged ledger hash +- **Sync Ledgers**: Temporary storage during ledger synchronization + +## Interactions + +- **Transition Frontier**: Block application and synchronization coordination +- **Block Producer**: Staged ledger diff creation for new blocks +- **RPC/P2P**: Account queries, ledger sync, and proof generation +- **VRF Evaluator**: Delegator table construction for consensus + +## Technical Debt + +The ledger implementation has several areas of technical debt: + +### Code Organization + +- **Large service files**: LedgerManager and LedgerService are complex and need + simplification/reorganization +- **Documentation gaps**: Core components like LedgerCtx need better + documentation of their responsibilities + +### Integration and Testing Issues + +- **Heavy coupling**: Deep integration with transition frontier, block producer, + and P2P makes isolated testing difficult +- **Mask leak detection**: Unreliable during testing scenarios (alive_masks + tracking) +- **Threading complexity**: Staged ledger reconstruction spawns additional + threads with callback patterns +- **Ad-hoc threading**: Manual thread spawning for reconstruction instead of + async patterns +- **Workaround patterns**: TODO comments about making services async to remove + threading workarounds + +### Error Handling and Reliability + +- **Inconsistent error handling**: Mix of panics, unwraps, and proper error + propagation throughout +- **Type safety gaps**: Request/response relationships "can't be expressed in + the Rust type system" +- **Debugging infrastructure**: Specialized dump-to-file functions suggest + frequent debugging needs +- **Hash mismatch handling**: Panics on staged ledger hash mismatches instead of + graceful recovery +- **Silent failures**: Some operations like commit fail silently with TODO + comments questioning this behavior + +### Configuration Issues + +- **Hardcoded constants**: Ledger depth tied to mainnet constants - will break + if networks use different depths diff --git a/node/src/ledger/write/summary.md b/node/src/ledger/write/summary.md new file mode 100644 index 000000000..594221aac --- /dev/null +++ b/node/src/ledger/write/summary.md @@ -0,0 +1,31 @@ +# Ledger Write State Machine + +Manages sequential write operations and state updates to the ledger. + +## Purpose + +- Applies blocks to staged ledgers with transaction validation +- Creates staged ledger diffs for block production +- Reconstructs staged ledgers from auxiliary data during sync +- Commits ledger state and manages mask lifecycle + +## Key Operations + +- **Block Application**: Applies transactions and updates account states +- **Diff Creation**: Generates staged ledger diffs for block production +- **Reconstruction**: Rebuilds staged ledgers from scan state and pending + coinbases +- **Commit**: Finalizes ledger state and prunes old masks + +## Interactions + +- **Transition Frontier**: Coordinates block application and sync operations +- **Block Producer**: Provides staged ledger diffs for new blocks +- **Service Layer**: Routes operations through LedgerManager for actual ledger + manipulation +- **Archive Integration**: Provides additional data for archive nodes + +## Technical Debt + +- Heavy coupling with transition frontier sync makes testing difficult +- Mask leak detection is unreliable during testing scenarios diff --git a/node/src/rpc/summary.md b/node/src/rpc/summary.md new file mode 100644 index 000000000..312c26995 --- /dev/null +++ b/node/src/rpc/summary.md @@ -0,0 +1,73 @@ +# RPC State Machine + +Provides JSON-RPC over HTTP interface exposing node functionality for external +clients. + +## Purpose + +- **External API Gateway**: Exposes blockchain state, transactions, and network + information via JSON-RPC over HTTP +- **Request Lifecycle Management**: Tracks requests through 4-phase lifecycle + with unique IDs and timestamps +- **State Query Interface**: Provides filtered access to node state and + blockchain data +- **Service Coordination**: Routes complex operations to appropriate backend + services + +## Architecture + +### Core State Management + +- **RpcState**: `BTreeMap` tracking active requests +- **Request Lifecycle**: Init → Pending → Success/Error states with timestamps +- **Request Correlation**: Unique `RpcId` for matching requests to responses +- **Extra Data Storage**: Optional request-specific data for complex operations + +### Request Processing Patterns + +- **Direct State Access**: Simple queries read directly from Redux state +- **Service Delegation**: Complex operations delegated to ledger, SNARK, or P2P + services +- **Async Coordination**: Callback system handles asynchronous service responses +- **Request Cleanup**: Automatic state cleanup after response delivery + +### API Categories + +- **Node Information**: Status, heartbeat, health checks +- **Blockchain Data**: Blocks, chains, genesis information, consensus parameters +- **Transaction Operations**: Injection, status queries, pool monitoring +- **Account/Ledger Queries**: Account information, balances, delegators +- **Statistics**: Action stats, sync progress, performance metrics +- **Network Operations**: Peer management, connection handling +- **SNARK Operations**: Proof jobs, worker coordination + +## Service Integration + +### Request Routing + +- **Ledger Service**: Account queries, scan state operations +- **Transaction Pool**: Transaction injection and pool queries +- **SNARK Pool**: Proof job management and worker coordination +- **P2P Network**: Peer information and connection management + +### HTTP Server Integration + +- **RESTful Endpoints**: Standard HTTP API for common operations +- **WebRTC Signaling**: P2P connection establishment support +- **JSON Serialization**: Comprehensive data type serialization +- **Error Handling**: Structured error responses with proper HTTP status codes + +## Technical Debt + +### Data Type Improvements + +- **Type System**: Several fields use `String` instead of proper typed + representations for hashes and memos +- **Command Handling**: Incomplete enum handling for all user command types +- **Hash Operations**: Missing error handling for hash conversion failures + +### Code Cleanup + +- **Legacy Code**: Commented transaction injection code marked for removal +- **TODO Items**: Various type refinements and error handling improvements + needed diff --git a/node/src/snark_pool/candidate/summary.md b/node/src/snark_pool/candidate/summary.md new file mode 100644 index 000000000..56a4da9e8 --- /dev/null +++ b/node/src/snark_pool/candidate/summary.md @@ -0,0 +1,86 @@ +# SNARK Pool Candidate State Machine + +Manages incoming SNARK work from peers through a multi-stage validation pipeline +before promoting verified work to the main pool. + +## Purpose + +- Coordinates P2P discovery and fetching of SNARK work from network peers +- Manages per-peer candidate work state tracking and progression +- Batches SNARK work for efficient verification processing +- Promotes verified work to main pool while removing inferior candidates +- Maintains quality control through fee-based prioritization + +## Multi-Stage Validation Pipeline + +``` +InfoReceived → WorkFetchPending → WorkReceived → WorkVerifyPending → WorkVerifySuccess/Error +``` + +1. **InfoReceived** - peer announces available SNARK work with job ID and fee + information +2. **WorkFetchPending** - requests full SNARK work from peer via P2P RPC +3. **WorkReceived** - complete SNARK work received and ready for verification +4. **WorkVerifyPending** - SNARK work submitted to verification service in + batches +5. **WorkVerifySuccess/Error** - verification completed with success or failure + result + +## Per-Peer State Tracking + +- **Dual indexing** - maintains work by peer (`by_peer`) and by job ID + (`by_job_id`) +- **Consistency checking** - validates index consistency with `check()` method +- **Peer lifecycle** - handles peer connections and disconnections gracefully +- **Work comparison** - only accepts better work (higher fees) for same job + +## Priority-Based Work Fetching + +- **Order-based prioritization** - fetches work based on job order (priority) +- **Fee-based secondary ordering** - higher fees take precedence within same + order +- **Deduplication** - only fetches one work per job order to avoid redundancy +- **Available peer filtering** - only requests from peers with RPC capacity + +## Batch Verification Processing + +- **Job ID ordering** - processes verification in priority order +- **Per-peer batching** - groups work from same peer for efficient verification +- **State coordination** - tracks verification requests and results across + batches +- **Error handling** - manages verification failures without affecting other + work + +## Quality Control Features + +- **Superior work filtering** - `remove_inferior_snarks()` removes lower-fee + work for same job +- **Fee validation** - ensures only competitive work progresses through pipeline +- **Work comparison** - implements `SnarkCmp` for consistent quality assessment +- **Retention policies** - supports custom filtering for stale or invalid + candidates + +## Integration Points + +- **P2P RPC system** - fetches complete SNARK work from peer announcements +- **SNARK verification service** - batches work for proof validation +- **Main pool coordination** - promotes verified work and removes inferior + candidates +- **Peer management** - integrates with P2P lifecycle for connection handling + +## State Management + +- **Deterministic progression** - clear state transitions with enabling + conditions +- **Concurrent peer handling** - manages work from multiple peers independently +- **Memory efficiency** - cleans up completed/failed candidates automatically +- **Verification coordination** - tracks verification IDs to correlate results + with requests + +## Key Features + +- **Two-way indexing** - efficient lookups by peer or job ID +- **Fee-based competition** - ensures highest quality work reaches main pool +- **Batch processing** - optimizes verification throughput through batching +- **Peer resilience** - handles peer disconnections without losing valid work +- **Priority ordering** - respects job priorities for systematic work processing diff --git a/node/src/snark_pool/summary.md b/node/src/snark_pool/summary.md new file mode 100644 index 000000000..dbc3d542a --- /dev/null +++ b/node/src/snark_pool/summary.md @@ -0,0 +1,153 @@ +# SNARK Pool State Machine + +Manages the distributed pool of SNARK work jobs required for blockchain +compression through coordination of local workers, P2P networking, and +competitive work selection. + +## Purpose + +- Maintains distributed pool of available SNARK computation jobs from scan state +- Coordinates commitment-based work assignment with external SNARK workers +- Manages P2P sharing of completed SNARK work across network peers +- Provides quality-controlled SNARK proofs for block production +- Implements competitive fee-based work selection and timeout management + +## Architecture Overview + +### Core Components + +- **Distributed Pool**: `DistributedPool` for indexed job + management and P2P synchronization +- **Candidate System**: Multi-stage validation pipeline for incoming work from + peers +- **Commitment System**: Time-bound work assignments with automatic timeout + handling +- **Priority Management**: Order-based job prioritization with external worker + coordination + +### Job Lifecycle Management + +``` +JobsUpdate → Available → Committed → Completed/TimedOut + ↘ WorkReceived → Verified → Promoted to Pool +``` + +1. **JobsUpdate** - receives available jobs from scan state, manages job + retention and ordering +2. **Available** - jobs ready for commitment by local or remote workers +3. **Committed** - jobs assigned to workers with timeout tracking +4. **Completed** - verified SNARK work ready for block production use + +## Features + +### Commitment-Based Work Assignment + +- **Auto-commitment creation** - automatically assigns jobs to available + external workers +- **Competitive commitments** - accepts better commitments (higher fees) from + network peers +- **Timeout management** - removes stale commitments that fail to deliver work +- **Strategy support** - configurable sequential vs random job selection + strategies + +### Distributed Pool Synchronization + +- **Indexed messaging** - uses DistributedPool indices for efficient P2P range + queries +- **Best tip validation** - only shares work with peers on compatible blockchain + state +- **Orphaned work handling** - reintegrates valuable SNARK work when jobs change +- **Range-based fetching** - enables efficient bulk synchronization with peers + +### Quality Control and Competition + +- **Fee-based selection** - prioritizes higher-fee work for same jobs +- **Work comparison** - implements SNARK quality assessment +- **Candidate validation** - multi-stage pipeline ensures only verified work + enters pool +- **Inferior work removal** - automatically removes lower-quality work for same + jobs + +## Integration Points + +### External SNARK Worker Coordination + +- **Automatic work assignment** - dispatches highest priority jobs to available + workers +- **Work cancellation** - cancels work when jobs become obsolete +- **Availability tracking** - coordinates with worker capacity management +- **Strategy implementation** - supports different work selection approaches + +### P2P Network Integration + +- **Work announcements** - broadcasts completed work to network peers +- **Commitment sharing** - announces work commitments to establish priority +- **Candidate fetching** - retrieves and validates work from peer announcements +- **Synchronization** - maintains pool consistency across network participants + +### Blockchain Integration + +- **Scan state coordination** - receives available jobs from blockchain state + changes +- **Block production support** - provides verified SNARK work for block creation +- **Transition frontier sync** - ensures work sharing only with synchronized + peers +- **Job prioritization** - maintains work order based on blockchain requirements + +## State Management + +### Job State Tracking + +- **Multi-field state** - tracks time, commitment, completed work, and priority + order +- **Duration estimation** - calculates expected completion time based on job + complexity +- **Status monitoring** - provides comprehensive job lifecycle visibility +- **Resource reporting** - tracks pool size, candidate status, and consistency + metrics + +### Timeout and Cleanup + +- **Periodic timeout checking** - regularly validates commitment freshness +- **Automatic cleanup** - removes expired commitments and completed jobs +- **Orphaned work recovery** - preserves valuable work across job updates +- **Candidate pruning** - removes obsolete candidate work based on pool state + +## Technical Implementation + +### Distributed Pool Implementation + +Uses `DistributedPool` for: + +- **Index-based synchronization** - enables efficient P2P range queries +- **Deterministic ordering** - maintains consistent job sequence across nodes +- **Update tracking** - supports incremental synchronization with peers +- **Message generation** - provides standardized commitment and work + announcements + +### Unified Reducer Implementation + +- **Single reducer pattern** - handles both state updates and action dispatching +- **Substate delegation** - coordinates with candidate subsystem via compatible + substates +- **Effect coordination** - minimal effectful actions for service integration + only +- **Deterministic execution** - ensures reproducible state transitions for + debugging + +## Technical Debt + +This component mostly follows new patterns but has minor issues: + +- **Incomplete Migration**: Still has a minimal `snark_pool_effects.rs` file + with one effectful action +- **Error Handling**: TODO comments indicate missing error propagation + (`// TODO: log or propagate`) +- **Pattern Consistency**: The presence of effects file suggests incomplete + adoption of new unified reducer pattern + +The remaining cleanup involves: + +1. Moving the single effectful action to follow the thin effects pattern +2. Implementing proper error handling and propagation +3. Removing the separate effects file once migration is complete diff --git a/node/src/summary.md b/node/src/summary.md new file mode 100644 index 000000000..6d0a25fe4 --- /dev/null +++ b/node/src/summary.md @@ -0,0 +1,17 @@ +# Main Node State Machine + +The top-level state machine that orchestrates all node operations. + +## Purpose + +- Coordinates all subsystems (P2P, consensus, storage, RPC) +- Manages node lifecycle and configuration +- Routes actions between state machine components +- Handles global node events and transitions + +## Key Interactions + +- Dispatches actions to all sub-state machines +- Aggregates state from all components +- Manages service initialization and shutdown +- Coordinates cross-component effects diff --git a/node/src/transaction_pool/candidate/summary.md b/node/src/transaction_pool/candidate/summary.md new file mode 100644 index 000000000..a76dfd023 --- /dev/null +++ b/node/src/transaction_pool/candidate/summary.md @@ -0,0 +1,45 @@ +# Transaction Pool Candidate State Machine + +Coordinates P2P transaction discovery and fetching before forwarding to main +transaction pool for validation. + +## Purpose + +- Manages per-peer transaction discovery and state tracking +- Coordinates fetching full transactions from transaction info received from + peers +- Collects and batches transactions for verification by main pool +- Prioritizes pubsub messages over direct peer requests + +## Transaction Flow + +1. **Info Received** - peer sends transaction info (hash + fee) +2. **Fetch Pending** - requests full transaction via RPC +3. **Received** - full transaction received from peer +4. **Verify Pending** - forwards batch to main pool via `StartVerify` action +5. **Verify Success/Error** - cleans up candidate state based on result + +## Key Features + +- **Per-Peer State Tracking** - maintains transaction states for each peer + independently +- **Priority Ordering** - orders transaction fetching by fee and arrival order +- **Batch Processing** - collects transactions and forwards batches rather than + individual transactions +- **Pubsub Priority** - processes pubsub messages before peer-specific requests +- **State Coordination** - integrates with main transaction pool without + duplicating validation logic + +## Interactions + +- Receives transaction info from P2P peers +- Fetches full transactions via P2P RPC requests +- Forwards transaction batches to main pool for validation +- Manages peer connection lifecycle (prunes disconnected peers) +- Coordinates with ledger service availability for verification timing + +## Note + +This component does NOT perform transaction validation itself - it only +coordinates P2P discovery and fetching. All validation (signatures, nonces, +balances, spam filtering) happens in the main transaction pool layer. diff --git a/node/src/transaction_pool/summary.md b/node/src/transaction_pool/summary.md new file mode 100644 index 000000000..cbb406ef7 --- /dev/null +++ b/node/src/transaction_pool/summary.md @@ -0,0 +1,79 @@ +# Transaction Pool State Machine + +Manages the mempool of pending transactions using a two-layer architecture with +significant technical debt. + +## Purpose + +- Collects user transactions from P2P network and RPC +- Validates transaction signatures and balances via SNARK verification +- Maintains transaction ordering and priorities by fee +- Provides ordered transactions for block production +- Handles transaction propagation across the network + +## Architecture + +### Two-Layer Design + +1. **Candidates Layer** (`TransactionPoolCandidatesState`) - manages incoming + transactions from P2P peers + - Tracks per-peer transaction states (info received → fetch pending → + received → verify pending) + - Prioritizes pubsub messages over direct peer requests + - Coordinates transaction fetching from peers +2. **Main Pool Layer** (`TransactionPoolState`) - contains the actual + transaction pool + - Uses `ledger::transaction_pool::TransactionPool` for core logic + - Maintains `DistributedPool` for P2P propagation + - Handles multi-step verification and application flows + +### Multi-Step Transaction Flows + +Complex flows requiring account fetching from ledger service: + +- **Verification**: `StartVerify` → fetch accounts → `StartVerifyWithAccounts` → + SNARK verification → `VerifySuccess` → `ApplyVerifiedDiff` +- **Application**: `ApplyVerifiedDiff` → fetch accounts → + `ApplyVerifiedDiffWithAccounts` → apply to pool +- **Best Tip Changes**: `BestTipChanged` → fetch accounts → revalidate existing + transactions +- **Transition Frontier**: Handle blockchain state changes affecting transaction + validity + +## Implementation Note + +The core transaction pool logic (validation, ordering, diff application) is +implemented in the `ledger` crate. This state machine wraps that functionality +and integrates it with the node's event-driven architecture, but uses +non-standard patterns that complicate the integration. + +## Interactions + +- Receives transactions from P2P network (pubsub/direct peer) and RPC +- Fetches account states from ledger service for all validation operations +- Requests SNARK verification for transaction signatures +- Provides fee-ordered transactions to block producer via + `CollectTransactionsByFee` +- Handles best tip changes and transition frontier diffs from blockchain state +- Broadcasts valid transactions to network peers +- Manages transaction rebroadcasting for locally generated transactions + +## Technical Debt + +### Major Issues Requiring Refactoring + +See [transaction_pool_refactoring.md](./transaction_pool_refactoring.md) for +details: + +1. **Pending Actions Anti-Pattern** - Stores actions in state instead of using + proper state transitions, violating Redux principles +2. **Blocking Service Calls** - Synchronous ledger service calls block the state + machine thread +3. **Global State Access** - Uses `unsafe_get_state()` to access global slot + information +4. **Complex Multi-Step Flows** - Implicit state transitions that are hard to + follow and test + +These patterns make the component difficult to test, debug, and maintain +compared to other OpenMina components that follow standard state machine +patterns. diff --git a/node/src/transaction_pool/transaction_pool_refactoring.md b/node/src/transaction_pool/transaction_pool_refactoring.md new file mode 100644 index 000000000..789f9bf5b --- /dev/null +++ b/node/src/transaction_pool/transaction_pool_refactoring.md @@ -0,0 +1,199 @@ +# Transaction Pool Refactoring Notes + +This document outlines architectural improvements needed to align the +transaction pool component with the standard state machine patterns used +throughout the OpenMina codebase. + +## Current Implementation Issues + +### 1. Pending Actions Pattern + +The component uses an unconventional pattern where actions are stored in +`pending_actions` and retrieved later: + +```rust +// Current pattern +let pending_id = substate.make_action_pending(action); +// ... later ... +let action = substate.pending_actions.remove(pending_id).unwrap() +``` + +This pattern appears in: + +- `StartVerify` → `StartVerifyWithAccounts` +- `ApplyVerifiedDiff` → `ApplyVerifiedDiffWithAccounts` +- `ApplyTransitionFrontierDiff` → `ApplyTransitionFrontierDiffWithAccounts` +- `BestTipChanged` → `BestTipChangedWithAccounts` + +**Issue**: This breaks the standard Redux pattern where state should represent +the current state, not store actions. + +### 2. Blocking Service Call + +The ledger service call in `transaction_pool_effects.rs` is synchronous: + +```rust +let accounts = match store + .service() + .ledger_manager() + .get_accounts(&ledger_hash, account_ids.iter().cloned().collect()) +``` + +**Issue**: This blocks the state machine thread, violating the principle of +async service interactions. + +### 3. Direct Global State Access + +Uses `unsafe_get_state()` to access global state: + +```rust +Self::global_slots(state.unsafe_get_state()) +``` + +**Issue**: Components should receive necessary data through actions or maintain +it in their local state. + +### 4. Complex Multi-Step Flows + +The current implementation has implicit multi-step flows that are hard to follow +and test. + +## Proposed Solution + +### 1. Replace Pending Actions with Explicit State Machine + +Model the verification flow as explicit states: + +```rust +pub enum VerificationState { + Idle, + FetchingAccounts { + commands: Vec, + from_source: TransactionPoolMessageSource, + request_id: LedgerRequestId, + }, + Verifying { + commands: Vec, + accounts: BTreeMap, + from_source: TransactionPoolMessageSource, + verify_id: SnarkUserCommandVerifyId, + }, +} + +pub enum DiffApplicationState { + Idle, + FetchingAccounts { + diff: DiffVerified, + best_tip_hash: LedgerHash, + from_source: TransactionPoolMessageSource, + request_id: LedgerRequestId, + }, + Applying { + diff: DiffVerified, + accounts: BTreeMap, + from_source: TransactionPoolMessageSource, + }, +} +``` + +### 2. Implement Async Ledger Service Pattern + +Convert to event-based pattern: + +```rust +// In effects: +TransactionPoolEffectfulAction::FetchAccounts { + request_id, + account_ids, + ledger_hash +} => { + store.service().ledger_fetch_accounts( + request_id, + ledger_hash, + account_ids, + ); +} + +// In event source (or future distributed event handling): +Event::Ledger(LedgerEvent::AccountsFetched { request_id, accounts }) => { + store.dispatch(TransactionPoolAction::AccountsFetched { + request_id, + accounts + }); +} +``` + +### 3. Update Reducer to Handle State Transitions + +Example for verification flow: + +```rust +TransactionPoolAction::StartVerify { commands, from_source } => { + let request_id = LedgerRequestId::new(); + + // Set state + substate.verification_state = VerificationState::FetchingAccounts { + commands: commands.clone(), + from_source: *from_source, + request_id, + }; + + // Dispatch async request + let account_ids = /* extract account ids */; + dispatcher.push(TransactionPoolEffectfulAction::FetchAccounts { + request_id, + account_ids, + ledger_hash: substate.best_tip_hash.clone().unwrap(), + }); +} + +TransactionPoolAction::AccountsFetched { request_id, accounts } => { + match &substate.verification_state { + VerificationState::FetchingAccounts { + commands, + from_source, + request_id: expected_id + } if request_id == expected_id => { + // Transition to verifying state + // Dispatch SNARK verification + } + _ => {} // Ignore if not expecting this response + } +} +``` + +### 4. Maintain Required State Locally + +```rust +pub struct TransactionPoolState { + // ... existing fields ... + current_global_slot: Option<(u32, u32)>, +} + +// Update via action when global slot changes +TransactionPoolAction::GlobalSlotChanged { slot, slot_since_genesis } => { + substate.current_global_slot = Some((*slot, *slot_since_genesis)); +} +``` + +## Benefits + +1. **Predictable State**: State represents what's happening, not stored actions +2. **Non-blocking**: Async ledger calls don't block the state machine +3. **Testable**: Each state transition can be tested independently +4. **Standard Pattern**: Aligns with other OpenMina components +5. **Clear Flow**: State machine makes the flow explicit and debuggable + +## Migration Strategy + +1. Start by converting the ledger service to async pattern +2. Introduce state enums alongside existing pending_actions +3. Gradually migrate each flow to use state transitions +4. Remove pending_actions once all flows are migrated +5. Remove unsafe_get_state usage + +## Related Files + +- `transaction_pool_reducer.rs` - Main reducer implementation +- `transaction_pool_effects.rs` - Effectful actions +- `transaction_pool_state.rs` - State definition diff --git a/node/src/transition_frontier/candidate/summary.md b/node/src/transition_frontier/candidate/summary.md new file mode 100644 index 000000000..d8ccf303c --- /dev/null +++ b/node/src/transition_frontier/candidate/summary.md @@ -0,0 +1,79 @@ +# Transition Frontier Candidate State Machine + +Manages incoming block candidates through multi-stage validation and +consensus-based ordering to determine the best verified block for transition +frontier updates. + +## Purpose + +- Receives and validates incoming candidate blocks from P2P network +- Orders candidates using consensus rules (worst to best) in priority queue +- Manages multi-stage validation pipeline for block verification +- Tracks chain proofs for fork validation and consensus decisions +- Maintains invalid block blacklist to prevent reprocessing + +## Multi-Stage Validation Pipeline + +``` +BlockReceived → Prevalidated → SnarkVerifyPending → SnarkVerifySuccess +``` + +1. **Received** - initial candidate block received from network +2. **Prevalidated** - basic block structure and consensus validation passed +3. **SnarkVerifyPending** - block SNARK proof verification in progress +4. **SnarkVerifySuccess** - SNARK proof verified successfully + +## Consensus-Based Ordering + +- **Priority queue** - maintains `BTreeSet` ordered by consensus rules (worst to + best) +- **Best candidate selection** - identifies highest priority verified candidate +- **Fork decision support** - provides ordered candidates for consensus + evaluation +- **Pruning** - removes candidates worse than best verified candidate + +## Chain Proof Management + +- **Chain proof collection** - gathers proof chains for fork validation +- **Automatic chain proof derivation** - constructs proofs from existing + transition frontier +- **Fork validation support** - provides necessary proofs for consensus fork + decisions + +## Invalid Block Tracking + +- **Blacklist maintenance** - tracks blocks that failed validation permanently +- **Memory optimization** - moves failed blocks to lightweight invalid tracking +- **Reprocessing prevention** - avoids re-validating known invalid blocks +- **Slot-based pruning** - removes old invalid blocks based on finality + +## Key Features + +- **Consensus ordering** - uses `consensus_take` function for accurate priority + ordering +- **Memory efficient** - prunes worse candidates and optimizes invalid block + storage +- **Fork decision ready** - provides best verified candidate for transition + frontier sync +- **Multi-peer resilience** - handles candidates from multiple peers + independently +- **Chain proof optimization** - derives chain proofs automatically when + possible + +## Integration Points + +- **SNARK verification** - coordinates with SNARK block verify service +- **Transition frontier sync** - triggers sync when better candidate is found +- **Block prevalidation** - integrates with block prevalidation logic +- **Consensus evaluation** - provides candidates for fork decision algorithms + +## State Management + +- **Deterministic ordering** - maintains consistent candidate priority across + restarts +- **Status tracking** - tracks validation progress for each candidate + independently +- **Chain proof caching** - stores and reuses chain proofs across validation + stages +- **Best candidate identification** - efficiently identifies best verified block + for sync initiation diff --git a/node/src/transition_frontier/genesis/summary.md b/node/src/transition_frontier/genesis/summary.md new file mode 100644 index 000000000..63f406102 --- /dev/null +++ b/node/src/transition_frontier/genesis/summary.md @@ -0,0 +1,78 @@ +# Transition Frontier Genesis State Machine + +Manages genesis block creation, proving, and chain initialization for fresh +blockchain startup. + +## Purpose + +- Loads genesis configuration data from external sources +- Produces genesis block with proper protocol state structure +- Generates real proofs for genesis block (block producers only) +- Provides genesis blocks for transition frontier initialization +- Supports both block producer and non-block-producer node configurations + +## Genesis Block Creation Flow + +``` +Idle → LedgerLoadPending → LedgerLoadSuccess → Produced → ProvePending → ProveSuccess +``` + +1. **LedgerLoadPending** - loads genesis configuration from genesis effectful + service +2. **LedgerLoadSuccess** - genesis configuration loaded successfully with ledger + data +3. **Produced** - genesis block structure created with protocol state and stake + proof +4. **ProvePending** - generating real blockchain proof for genesis block (block + producers only) +5. **ProveSuccess** - genesis block with real proof completed + +## Dual Proof Support + +- **Real proofs** - full blockchain proofs generated for block producers +- **Dummy proofs** - placeholder proofs for non-block-producer nodes +- **Flexible access** - `block_with_real_or_dummy_proof()` provides appropriate + block type + +## Genesis Block Structure + +- **Protocol state creation** - constructs negative-one and genesis protocol + states +- **Stake proof generation** - creates producer stake proof for consensus + validation +- **Empty block body** - genesis blocks contain no transactions (empty staged + ledger diff) +- **Chain proof setup** - establishes genesis block as root for future chain + proofs + +## Key Features + +- **Service integration** - delegates heavy genesis computation to effectful + service +- **Block producer awareness** - only generates real proofs when node is + producing blocks +- **Memory efficiency** - provides lightweight dummy proofs for non-producing + nodes +- **Protocol state management** - handles negative-one and genesis protocol + state creation +- **Stake proof support** - generates appropriate stake proofs for consensus + operation + +## Integration Points + +- **Genesis effectful service** - loads configuration and performs heavy genesis + computation +- **Transition frontier** - provides genesis block for chain initialization +- **Block production** - supplies real genesis proofs when node becomes block + producer +- **Consensus** - provides stake proofs and protocol states for consensus + validation + +## Configuration Support + +- **Genesis ledger loading** - loads initial account distribution and + configuration +- **Protocol state setup** - establishes initial consensus parameters +- **Epoch configuration** - sets up staking and next epoch ledger information +- **Chain initialization** - provides foundation for transition frontier + bootstrap diff --git a/node/src/transition_frontier/summary.md b/node/src/transition_frontier/summary.md new file mode 100644 index 000000000..a7bd18d6b --- /dev/null +++ b/node/src/transition_frontier/summary.md @@ -0,0 +1,136 @@ +# Transition Frontier State Machine + +Manages the blockchain's transition frontier through multi-component +coordination for genesis initialization, candidate evaluation, synchronization, +and chain maintenance. + +## Purpose + +- Maintains the active blockchain state from root to best tip +- Orchestrates complex synchronization with network peers through multi-phase + process +- Evaluates incoming block candidates using consensus-based ordering +- Handles chain reorganizations and fork decisions +- Manages genesis block creation and chain initialization +- Provides chain diffs for transaction pool updates + +## Architecture Overview + +The transition frontier coordinates several components in a hierarchical state +machine: + +### Core State Components + +- **Best Chain**: Vector of applied blocks from transition frontier root to + current best tip +- **Protocol States**: Cached protocol states needed for scan state operations +- **Chain Diff**: Transaction pool updates when chain changes +- **Blacklist**: Invalid blocks that failed application after SNARK verification + +### Sub-Component State Machines + +#### 1. Genesis Component + +Handles genesis block creation and proving: + +- Loads genesis configuration from external sources +- Produces genesis blocks with protocol state structure +- Generates real proofs for block producers, dummy proofs for others +- Provides foundation for chain initialization + +#### 2. Candidates Component + +Manages incoming block candidates with consensus-based ordering: + +- Multi-stage validation pipeline (Received → Prevalidated → SnarkVerifyPending + → SnarkVerifySuccess) +- Consensus-ordered priority queue using `consensus_take` function +- Chain proof management for fork validation +- Invalid block blacklisting and memory optimization + +#### 3. Sync Component + +Orchestrates blockchain synchronization: + +- **Phase 1 (Bootstrap)**: Sequential ledger sync (Staking → NextEpoch → Root) +- **Phase 2 (Catchup)**: Block fetching and sequential application +- **Phase 3 (Commit)**: Chain commitment and finalization +- Multi-peer resilience with retry logic and error recovery + +#### 4. Ledger Sync Sub-Component (within Sync) + +Coordinates snarked and staged ledger synchronization: + +- **Snarked Sync**: BFS Merkle tree reconstruction with optimized account + fetching +- **Staged Sync**: Parts fetching and reconstruction with empty ledger + optimization +- Sequential coordination ensuring snarked completion before staged begins + +## Synchronization Process + +### Bootstrap Phase (Ledger Synchronization) + +``` +StakingLedgerSync → NextEpochLedgerSync → RootLedgerSync +``` + +Each ledger sync uses multi-phase algorithms: + +- **Snarked ledgers**: NumAccounts query → BFS Merkle tree sync → Success +- **Staged ledgers**: Parts fetching → Multi-peer validation → Reconstruction → + Success + +### Catchup Phase (Block Synchronization) + +``` +BlocksPending → BlocksSuccess → CommitPending → CommitSuccess → Synced +``` + +- Parallel multi-peer block fetching with retry logic +- Sequential block application maintaining order +- Root snarked ledger update tracking for proper reconstruction + +## Key Features + +- **Multi-phase sync strategy** - bootstrap → catchup → commit process +- **Consensus-based candidate ordering** - maintains worst-to-best candidate + priority queue +- **Chain proof management** - supports fork validation and consensus decisions +- **Protocol state caching** - optimizes scan state operations with needed + protocol states +- **Transaction pool integration** - provides chain diffs for efficient pool + updates +- **Memory optimization** - prunes candidates and invalid blocks based on + consensus rules + +## Integration Points + +- **P2P Network**: Receives blocks and handles multi-peer synchronization +- **SNARK Verification**: Coordinates block and transaction proof verification +- **Transaction Pool**: Provides chain diffs for pool updates and revalidation +- **Block Production**: Supplies genesis proofs and current chain state +- **Ledger Service**: Delegates heavy computation for ledger operations + +## Technical Debt + +This component uses the old-style state machine pattern and requires significant +refactoring: + +- **Old Architecture**: Uses separate reducer and effects files instead of + unified reducers +- **Direct State Access**: Effects directly access state via `state.get()` and + `store.state()` +- **Service Interactions**: Service calls not properly abstracted through thin + effectful actions +- **Error Handling**: Multiple TODO comments indicate missing error propagation +- **Refactoring TODOs**: Several inline comments indicate code that needs to be + moved to proper locations + +The migration to new-style patterns would involve: + +1. Merging effects logic into unified reducers +2. Using proper substate contexts for state access +3. Converting service interactions to thin effectful actions +4. Implementing proper error handling throughout +5. Moving misplaced logic to appropriate modules (especially sync-related code) diff --git a/node/src/transition_frontier/sync/ledger/snarked/summary.md b/node/src/transition_frontier/sync/ledger/snarked/summary.md new file mode 100644 index 000000000..873f1cef7 --- /dev/null +++ b/node/src/transition_frontier/sync/ledger/snarked/summary.md @@ -0,0 +1,53 @@ +# Snarked Ledger Sync State Machine + +Synchronizes the fully verified (snarked) ledger using a two-phase BFS Merkle +tree reconstruction algorithm. + +## Purpose + +- Downloads snarked ledger from peers using optimized multi-phase approach +- Verifies Merkle tree integrity with hash consistency checks +- Reconstructs verified ledger state from fetched components +- Provides progress tracking and error recovery with peer retry logic + +## Two-Phase Synchronization Process + +### Phase 1: NumAccounts Query + +- Queries peers for total account count and content hash +- Validates responses from multiple peers for consistency +- Establishes the scope and root hash for Merkle tree sync + +### Phase 2: BFS Merkle Tree Sync + +- Uses breadth-first search to traverse the Merkle tree +- Fetches child hashes for internal nodes (depth < + `LEDGER_DEPTH - ACCOUNT_SUBTREE_HEIGHT`) +- Optimized account fetching at subtree level (`ACCOUNT_SUBTREE_HEIGHT = 6`) +- Fetches up to 64 accounts per request when reaching account subtrees + +## Key Features + +- **Multi-peer retry logic** - tracks per-peer RPC states with error recovery +- **Progress estimation** - provides detailed sync progress based on tree + structure +- **Address-based querying** - systematically fetches tree components by ledger + address +- **Peer availability checking** - validates peers have required ledger data + before querying +- **Hash validation** - verifies all received hashes match expected Merkle tree + structure + +## State Flow + +``` +NumAccountsPending → NumAccountsSuccess → MerkleTreeSyncPending → MerkleTreeSyncSuccess → Success +``` + +## Interactions + +- Requests account counts and Merkle tree data via P2P RPC +- Validates all received hashes against expected tree structure +- Coordinates with ledger service for tree reconstruction +- Provides progress updates for UI/monitoring +- Integrates with peer management for retry logic diff --git a/node/src/transition_frontier/sync/ledger/staged/summary.md b/node/src/transition_frontier/sync/ledger/staged/summary.md new file mode 100644 index 000000000..2370f934c --- /dev/null +++ b/node/src/transition_frontier/sync/ledger/staged/summary.md @@ -0,0 +1,61 @@ +# Staged Ledger Sync State Machine + +Synchronizes the staged ledger containing pending transactions and scan state +through parts fetching and reconstruction. + +## Purpose + +- Downloads staged ledger auxiliary data and pending coinbases from peers +- Reconstructs staged ledger from snarked ledger base plus fetched components +- Validates scan state integrity and transaction ordering +- Handles both empty and non-empty staged ledger cases + +## Two-Path Reconstruction Process + +### Path 1: Empty Staged Ledger (`ReconstructEmpty`) + +- Detects when staged ledger is empty (aux_hash and pending_coinbase_aux are + zero) +- Directly uses snarked ledger as the staged ledger +- Bypasses parts fetching for efficiency + +### Path 2: Non-Empty Staged Ledger (`ReconstructPending`) + +- Fetches `StagedLedgerAuxAndPendingCoinbases` from peers via RPC +- Validates fetched parts against expected hashes +- Delegates heavy reconstruction to staged ledger service +- Collects needed protocol states during reconstruction process + +## Multi-Peer Validation + +- **Parts fetching** - requests auxiliary data from multiple peers +- **Validation** - verifies fetched parts match expected structure and hashes +- **Error recovery** - retries with different peers on validation failures +- **Consensus** - requires valid parts from at least one peer to proceed + +## State Flow + +``` +PartsFetchPending → PartsFetchSuccess → ReconstructPending → ReconstructSuccess → Success + ↘ ReconstructEmpty ↗ +``` + +## Key Features + +- **Selective reconstruction** - optimizes empty staged ledger case +- **Service delegation** - uses specialized service for complex reconstruction + work +- **Hash validation** - ensures reconstructed ledger matches expected target + hash +- **Protocol state collection** - gathers protocol states needed for transaction + validation +- **Multi-peer resilience** - validates parts from multiple sources for + reliability + +## Interactions + +- Fetches staged ledger parts via P2P RPC from multiple peers +- Validates auxiliary data and pending coinbase structures +- Delegates reconstruction to staged ledger service for heavy computation +- Coordinates with snarked ledger sync (requires snarked completion first) +- Provides reconstructed staged ledger for transition frontier use diff --git a/node/src/transition_frontier/sync/ledger/summary.md b/node/src/transition_frontier/sync/ledger/summary.md new file mode 100644 index 000000000..6b2f9dcad --- /dev/null +++ b/node/src/transition_frontier/sync/ledger/summary.md @@ -0,0 +1,64 @@ +# Ledger Sync State Machine + +Coordinates sequential synchronization of snarked and staged ledgers during +transition frontier sync. + +## Purpose + +- Orchestrates multi-phase ledger synchronization process +- Manages sequential flow from snarked to staged ledger sync +- Handles target updates and sync interruptions gracefully +- Collects protocol states needed for transaction validation + +## Sequential Synchronization Flow + +### Phase 1: Snarked Ledger Sync + +- Delegates to snarked ledger sync component for Merkle tree reconstruction +- Synchronizes the fully verified base ledger state +- Required foundation for staged ledger reconstruction + +### Phase 2: Staged Ledger Sync (if needed) + +- Only triggered if target includes staged ledger data +- Builds upon completed snarked ledger to reconstruct pending transactions +- Delegates to staged ledger sync component for parts fetching and + reconstruction + +## State Flow + +``` +Init → Snarked(NumAccountsPending → ... → Success) → Staged(PartsFetchPending → ... → Success) → Success + ↘ (direct to Success if no staged ledger required) +``` + +## Target Management + +- **Flexible targeting** - supports different ledger sync scenarios (staking, + next epoch, root) +- **Target updates** - handles best tip changes during sync with intelligent + restart logic +- **Compatibility checking** - validates that ledger hashes are compatible for + incremental sync + +## Key Features + +- **Sequential coordination** - ensures snarked completes before staged begins +- **Smart restarts** - only restarts from beginning if target hashes change + incompatibly +- **Protocol state aggregation** - collects needed protocol states from staged + sync +- **Target validation** - ensures sync targets match expected ledger structure + +## Sub-Components + +- **Snarked Sync** - BFS Merkle tree reconstruction with multi-peer validation +- **Staged Sync** - Parts fetching and reconstruction with empty ledger + optimization + +## Interactions + +- Coordinates between snarked and staged sync sub-components +- Reports aggregate sync progress to parent transition frontier sync +- Handles target updates from parent when best tip changes during sync +- Provides completed ledger state and protocol states to transition frontier diff --git a/node/src/transition_frontier/sync/summary.md b/node/src/transition_frontier/sync/summary.md new file mode 100644 index 000000000..2aad2cd83 --- /dev/null +++ b/node/src/transition_frontier/sync/summary.md @@ -0,0 +1,90 @@ +# Transition Frontier Sync State Machine + +Orchestrates the complete transition frontier synchronization process through +sequential ledger sync phases followed by block fetching, application, and +commitment. + +## Purpose + +- Synchronizes transition frontier to a new best tip through multi-phase process +- Downloads and reconstructs all required ledger states (staking, next epoch, + root) +- Fetches missing blocks between current root and new best tip +- Applies fetched blocks sequentially to build new chain +- Commits synchronized state to become the new transition frontier + +## Sequential Synchronization Phases + +### Phase 1: Bootstrap (Ledger Synchronization) + +Sequential ledger synchronization for consensus operation: + +1. **Staking Ledger Sync** - synchronizes staking epoch ledger (snarked only) +2. **Next Epoch Ledger Sync** - synchronizes next epoch ledger if different + (snarked only) +3. **Root Ledger Sync** - synchronizes ledger at transition frontier root + (snarked + staged) + +Each ledger sync delegates to the ledger sync coordinator which handles snarked +and staged components. + +### Phase 2: Catchup (Block Synchronization) + +``` +BlocksPending → BlocksSuccess → CommitPending → CommitSuccess → Synced +``` + +- **BlocksPending**: Fetches missing blocks from peers and applies them + sequentially +- **BlocksSuccess**: All blocks fetched and applied successfully +- **CommitPending**: Commits the synchronized chain to become new transition + frontier +- **CommitSuccess**: Commitment completed successfully +- **Synced**: Synchronization complete, transition frontier updated + +## Multi-Peer Block Fetching + +- **Parallel fetching** - requests blocks from multiple peers simultaneously +- **Retry logic** - retries failed block fetches with different peers +- **Sequential application** - applies blocks in order even if fetched out of + order +- **Error recovery** - handles block application errors gracefully + +## Root Snarked Ledger Updates + +Tracks snarked ledger transitions that occur during sync to enable proper ledger +reconstruction: + +- Maps new snarked ledger hashes to parent ledger and staged ledger information +- Enables reconstruction of intermediate snarked ledger states +- Required when root block changes during synchronization process + +## State Flow + +``` +Idle → Init → StakingLedgerPending → StakingLedgerSuccess + → NextEpochLedgerPending → NextEpochLedgerSuccess + → RootLedgerPending → RootLedgerSuccess + → BlocksPending → BlocksSuccess + → CommitPending → CommitSuccess → Synced +``` + +## Key Features + +- **Three-phase sync strategy** - bootstrap (ledgers) → catchup (blocks) → + commit +- **Multi-peer resilience** - fetches from multiple peers with error recovery +- **Sequential application** - maintains block order during application process +- **Protocol state collection** - gathers needed protocol states throughout sync +- **Root update tracking** - handles snarked ledger changes during sync +- **Sync phase identification** - distinguishes bootstrap vs catchup vs synced + states + +## Interactions + +- Delegates ledger synchronization to ledger sync coordinator +- Fetches missing blocks via P2P RPC from multiple peers +- Applies blocks sequentially using transition frontier application logic +- Commits synchronized state to update the transition frontier +- Collects and manages protocol states needed for consensus validation +- Coordinates with block application service for heavy computation diff --git a/node/src/watched_accounts/summary.md b/node/src/watched_accounts/summary.md new file mode 100644 index 000000000..99c641cc6 --- /dev/null +++ b/node/src/watched_accounts/summary.md @@ -0,0 +1,55 @@ +# Watched Accounts State Machine + +**Originally designed for when the web node was a light client without full node +capabilities - fully integrated into the state machine but currently has no +entry points to trigger its functionality.** + +Tracks specific account public keys to monitor their transaction activity and +ledger state changes. + +## Purpose + +- Maintains a registry of accounts to monitor by public key +- Detects when watched accounts appear in block transactions +- Retrieves account state from ledger after transaction inclusion +- Provides transaction history and account state tracking per block + +## Account State Tracking + +- **Initial State**: Fetches account data from current best tip ledger +- **Block Updates**: Tracks relevant transactions in each new block +- **Ledger Queries**: Retrieves updated account state after block inclusion +- **State Transitions**: Idle → Pending → Success/Error for both initial and + per-block queries + +## Transaction Detection + +- **Diff Analysis**: Scans staged ledger diffs for account references +- **Account Matching**: Matches transactions by public key in fee payer or + receiver roles +- **Transaction Storage**: Stores relevant transactions ordered by nonce + +## Data Structure + +- **WatchedAccountsState**: Map of public keys to account monitoring state +- **WatchedAccountState**: Per-account initial state + block history queue +- **WatchedAccountBlockState**: Per-block transaction list + ledger account data +- **Transaction Filtering**: Extracts UserCommands affecting the watched account + +## Interactions + +- **Transition Frontier**: Monitors new blocks for relevant transactions +- **P2P Network**: Queries peers for account data (currently TODO/disabled) +- **Ledger System**: Retrieves account state from specific ledger hashes + +## Current Status + +- **Historical context**: Built for light client use case before the web node + had full node capabilities +- **Fully integrated**: Complete state machine implementation with proper + reducers +- **No entry points**: No RPC endpoints, CLI commands, or triggers to activate + functionality +- **P2P queries disabled**: Ledger query logic marked TODO in reducer +- **Ready for activation**: Could be enabled by adding appropriate triggers +- **Limited scope**: Only handles UserCommand transactions, no ZkApp support diff --git a/p2p/src/channels/best_tip/summary.md b/p2p/src/channels/best_tip/summary.md new file mode 100644 index 000000000..871e78ced --- /dev/null +++ b/p2p/src/channels/best_tip/summary.md @@ -0,0 +1,67 @@ +# Best Tip Channel State Machine + +Transport-agnostic best blockchain tip exchange channel that abstracts over both +libp2p gossip and WebRTC request protocols. + +## Purpose + +- **Transport abstraction** - Provides unified interface for best tip + propagation over libp2p (gossip/pubsub) and WebRTC (request/response) +- **Protocol adaptation** - Handles push-based broadcasting for libp2p and + pull-based requests for WebRTC +- **Consensus coordination** - Helps network converge on canonical best chain + through tip sharing +- **Sync decision support** - Enables informed sync decisions based on peer tip + quality + +## State Flow + +``` +Disabled/Enabled → Init → Pending → Ready → (WaitingForRequest ↔ Requested → Responded) +``` + +## Key Features + +- **Dual transport support** - Seamlessly operates over libp2p gossip and WebRTC + connections +- **Simple request protocol** - GetNext/BestTip message flow for single block + exchange +- **Tip tracking** - Maintains last sent/received tip information per peer +- **Bidirectional state management** - Separate local/remote state machines + within Ready state + +## Integration Points + +- **libp2p pubsub** - Broadcasts best tips via gossip protocol for libp2p peers +- **WebRTC data channels** - Sends tip requests/responses over WebRTC + connections +- **P2pChannelsEffectfulAction** - Transport-agnostic channel initialization and + message sending +- **Transition frontier coordination** - Sources and evaluates tips from/to + blockchain state + +## Technical Implementation + +- **Transport detection** - Adapts behavior based on peer connection type + (libp2p vs WebRTC) +- **Block exchange** - Handles `ArcBlock` serialization and transmission +- **State synchronization** - Coordinates tip propagation across heterogeneous + network +- **Channel abstraction** - Encapsulates transport-specific details behind + unified interface + +## Technical Debt + +### Minor Issues + +- **TODO Comments**: Some incomplete functionality noted in comments regarding + tip comparison and consensus enforcement logic +- **State Methods**: Could benefit from additional helper methods to reduce + pattern matching in reducer + +### Note on Architecture + +This channel provides an abstraction layer between inner logic components and +libp2p/WebRTC transports. The channel is functioning correctly within its +intended scope - validation and consensus logic properly belong in the inner +logic components that use this abstraction. diff --git a/p2p/src/channels/rpc/summary.md b/p2p/src/channels/rpc/summary.md new file mode 100644 index 000000000..601fc23b0 --- /dev/null +++ b/p2p/src/channels/rpc/summary.md @@ -0,0 +1,74 @@ +# RPC Channel State Machine + +Transport-agnostic RPC communication channel that abstracts blockchain data +requests/responses over both libp2p and WebRTC. + +## Purpose + +- **Transport abstraction** - Provides unified RPC interface over libp2p (native + RPC) and WebRTC (data channels) +- **Request/response coordination** - Manages bidirectional blockchain data + exchanges with ID tracking +- **Transport-aware routing** - Routes requests based on transport capabilities + (some RPCs WebRTC-only) +- **Timeout management** - Implements configurable timeouts per RPC type across + transports + +## State Flow + +``` +Disabled/Enabled → Init → Pending → Ready → (WaitingForRequest ↔ Requested → Responded) +``` + +## Key Features + +- **Dual transport support** - Seamlessly operates over libp2p RPC streams and + WebRTC data channels +- **RPC type awareness** - Different RPC types (BestTip, Ledger, Block, Snark, + etc.) with transport-specific support +- **ID-based correlation** - Matches requests to responses using `P2pRpcId` + across async operations +- **Concurrent request handling** - Manages multiple pending requests per peer + with flow control +- **Timeout coordination** - Per-RPC-type timeouts adapted to transport + characteristics + +## Integration Points + +- **libp2p RPC streams** - Native request/response over libp2p protocol streams +- **WebRTC data channels** - RPC message serialization over WebRTC connections +- **Blockchain services** - Routes to ledger, block store, SNARK pool, + transaction pool +- **P2pChannelsEffectfulAction** - Transport-agnostic RPC initialization and + message routing + +## Technical Implementation + +- **Transport detection** - Uses `supported_by_libp2p()` to route requests + appropriately +- **Request queuing** - `VecDeque` for managing concurrent remote requests per + peer +- **Response correlation** - ID-based matching of async responses to original + requests +- **Channel abstraction** - Encapsulates libp2p vs WebRTC RPC differences behind + unified interface + +## Technical Debt + +### Moderate Issues + +- **Incomplete Request/Response Validation**: TODO comments indicate missing + validation for matching requests to responses + +### Minor Issues + +- **Hard-coded Concurrency Limits**: Maximum of 5 concurrent remote requests per + peer is not configurable +- **No Remote Request Cleanup**: Remote requests lack timeout mechanism and are + only removed on ResponseSend. However, this is not a significant issue since: + - Each peer has its own isolated channel + - Maximum 5 requests at ~160 bytes each = ~800 bytes per peer + - A peer that doesn't respond only blocks their own channel, not affecting + other peers + - **Suggested improvement**: Add a static assertion to verify request size + remains small, and document in code why cleanup isn't critical diff --git a/p2p/src/channels/signaling/discovery/summary.md b/p2p/src/channels/signaling/discovery/summary.md new file mode 100644 index 000000000..6921eb67d --- /dev/null +++ b/p2p/src/channels/signaling/discovery/summary.md @@ -0,0 +1,66 @@ +# Signaling Discovery Channel State Machine + +Facilitates WebRTC peer discovery by exchanging information about available +WebRTC-capable peers in the network. + +## Purpose + +- **Peer capability discovery** - Requests and shares information about + WebRTC-capable peers +- **Connection bootstrapping** - Enables discovery of peers that can accept + WebRTC connections +- **Rate-limited discovery** - Manages discovery request frequency to prevent + spam +- **Bidirectional state tracking** - Tracks both local (outgoing) and remote + (incoming) discovery flows + +## State Flow + +``` +Disabled/Enabled → Init → Pending → Ready → (RequestSend ↔ DiscoveryRequestReceived) + ↘ (DiscoveredSend → DiscoveredAccepted/Rejected) +``` + +## Key Features + +- **GetNext message protocol** - Sends `GetNext` requests to discover + WebRTC-capable peers +- **Discovered message responses** - Responds with `Discovered` messages + containing peer public keys +- **Rate limiting** - Enforces 60-second minimum interval between discovery + requests +- **Accept/reject handling** - Manages responses to discovery offers from other + peers + +## Integration Points + +- **P2pChannelsEffectfulAction::InitChannel** - Initializes the discovery + channel +- **P2pChannelsEffectfulAction::MessageSend** - Sends GetNext and Discovered + messages +- **webrtc_discovery_respond_with_availble_peers** - Generates responses with + available WebRTC peers +- **P2pChannelsSignalingExchangeAction** - Coordinates with exchange channel for + connection setup + +## Technical Implementation + +- **Dual state tracking** - Separate local/remote state machines within Ready + state +- **Message-based protocol** - Uses `SignalingDiscoveryChannelMsg::GetNext` and + `::Discovered` +- **Time-based rate limiting** - Checks elapsed time since last request in + enabling conditions +- **Public key exchange** - Shares target peer public keys for WebRTC connection + establishment + +## Technical Debt + +- TODO: Make 60-second discovery interval configurable in `RequestSend` enabling + conditions +- TODO: Add interval constraints between incoming discovery requests to prevent + spam +- TODO: Implement custom error handling instead of generic errors in discovery + response handling +- TODO: Address potential enabling condition issues in discovery message + processing diff --git a/p2p/src/channels/signaling/exchange/summary.md b/p2p/src/channels/signaling/exchange/summary.md new file mode 100644 index 000000000..a8a13cbcf --- /dev/null +++ b/p2p/src/channels/signaling/exchange/summary.md @@ -0,0 +1,53 @@ +# Signaling Exchange Channel State Machine + +Relays encrypted WebRTC signaling messages between peers to establish direct +connections through intermediary signaling servers. + +## Purpose + +- **Encrypted offer/answer relay** - Routes encrypted SDP offers and answers + between connecting peers +- **Signaling server coordination** - Uses intermediate peers as signaling + relays for WebRTC connection setup +- **Connection request management** - Handles GetNext requests to receive + pending connection offers +- **Bidirectional state tracking** - Manages both local (outgoing) and remote + (incoming) signaling flows + +## State Flow + +``` +Disabled/Enabled → Init → Pending → Ready → (RequestSend ↔ RequestReceived) + ↘ (OfferSend → Offered → Answered) +``` + +## Key Features + +- **GetNext protocol** - Requests pending connection offers from signaling + relays +- **Encrypted message handling** - Processes `EncryptedOffer` and + `EncryptedAnswer` messages +- **Offer relay service** - Acts as signaling server for other peers attempting + connections +- **Answer coordination** - Handles optional encrypted answers (can be None for + rejection) + +## Integration Points + +- **P2pChannelsEffectfulAction::InitChannel** - Initializes the signaling + exchange channel +- **P2pChannelsEffectfulAction::MessageSend** - Sends GetNext, OfferToYou, and + Answer messages +- **P2pConnectionIncomingAction::Init** - Initiates incoming WebRTC connections + from received offers +- **P2pChannelsSignalingDiscoveryAction** - Coordinates with discovery channel + for peer advertisement + +## Technical Implementation + +- **Dual state tracking** - Separate local/remote state machines within Ready + state +- **Three-message protocol** - GetNext → OfferToYou → Answer message flow +- **Public key tracking** - Associates offers with offerer public keys for + authentication +- **Optional answer handling** - Supports None answers for connection rejection diff --git a/p2p/src/channels/snark/summary.md b/p2p/src/channels/snark/summary.md new file mode 100644 index 000000000..bc4606326 --- /dev/null +++ b/p2p/src/channels/snark/summary.md @@ -0,0 +1,46 @@ +# SNARK Channel State Machine + +Transport-agnostic SNARK work distribution channel that abstracts over both +libp2p gossip and WebRTC pull-based protocols. + +## Purpose + +- **Transport abstraction** - Provides unified interface for SNARK work + propagation over libp2p (gossip/pubsub) and WebRTC (request/response) +- **Protocol adaptation** - Handles push-based broadcasting for libp2p and + pull-based requests for WebRTC +- **Flow control** - Implements request/response protocol for WebRTC peers with + count-based limits +- **Network-wide distribution** - Ensures SNARK work reaches all network + participants regardless of transport + +## Key Components + +- **Request Handler**: Manages GetNext/WillSend/Snark protocol flow +- **Distribution Manager**: Tracks work distribution state per peer +- **Broadcast Handler**: Manages work propagation via gossip +- **State Tracker**: Maintains request/response state for local and remote peers + +## Interactions + +- Broadcasts new SNARK work when available in pool +- Forwards work received from other peers +- Deduplicates work for libp2p transport (but not WebRTC) +- Integrates with local SNARK pool for work availability +- Handles both push (broadcast) and pull (request) distribution models + +## Technical Debt + +### Minor Issues + +- **Inconsistent Deduplication**: WebRTC path lacks duplicate check while libp2p + has it (different transport requirements may explain this difference) +- **State Methods**: Could benefit from additional helper methods to reduce + pattern matching in reducer + +### Note on Architecture + +This channel provides a transport abstraction layer between inner logic +components and libp2p/WebRTC transports for SNARK work distribution. Validation +and security concerns are properly handled by the SNARK pool and other inner +logic components that use this abstraction. diff --git a/p2p/src/channels/snark_job_commitment/summary.md b/p2p/src/channels/snark_job_commitment/summary.md new file mode 100644 index 000000000..cb7cb3a43 --- /dev/null +++ b/p2p/src/channels/snark_job_commitment/summary.md @@ -0,0 +1,12 @@ +# SNARK Job Commitment Channel State Machine + +**Legacy Component**: This channel is a legacy component from a time before +OpenMina was a complete node. OpenMina has since implemented a more advanced +SNARK worker synchronization mechanism, making this commitment-based approach +obsolete. + +## Purpose + +- Originally designed to broadcast SNARK job commitments to coordinate work + assignments +- Intended to prevent duplicate work among distributed SNARK workers diff --git a/p2p/src/channels/streaming_rpc/summary.md b/p2p/src/channels/streaming_rpc/summary.md new file mode 100644 index 000000000..05c951a4c --- /dev/null +++ b/p2p/src/channels/streaming_rpc/summary.md @@ -0,0 +1,58 @@ +# Streaming RPC Channel State Machine + +Pull-based P2P specific streaming channel for progressive staged ledger +synchronization in web nodes. + +## Purpose + +- **Pull-based P2P only** - Designed specifically for pull-based P2P layer, not + implemented for libp2p +- **Staged ledger synchronization** - Enables web nodes to progressively + download staged ledger data +- **Progressive data transfer** - Handles large ledger state through incremental + streaming with flow control +- **Web node support** - Specialized for browser-based nodes that need efficient + ledger sync + +## State Flow + +``` +Disabled/Enabled → Init → Pending → Ready → (WaitingForRequest ↔ Requested → Responded) +``` + +## Key Features + +- **Progressive streaming** - Request/Response/Next message flow for incremental + data transfer +- **Staged ledger specialization** - Dedicated support for ledger parts + streaming during sync +- **Progress monitoring** - Tracks upload/download progress with receive/send + progress states +- **Flow control** - Uses Next messages to control data flow rate and prevent + overwhelming +- **Web node optimization** - Tailored for browser environments with limited + resources + +## Integration Points + +- **Pull-based P2P data channels** - Chunked streaming over pull-based P2P + connections only +- **Staged ledger sync** - Primary integration with ledger synchronization for + web nodes +- **Progress tracking** - Provides sync progress feedback for user interfaces +- **P2pChannelsEffectfulAction** - Channel initialization and message routing + +## Technical Implementation + +- **Pull-based P2P specific** - Only operates over pull-based P2P connections, + not libp2p +- **Chunked streaming** - Uses `Next` messages to request subsequent data chunks +- **Progress state tracking** - Maintains receive/send progress for long-running + transfers +- **Staged ledger focus** - Specialized for efficient ledger synchronization in + resource-constrained environments + +## Technical Debt + +- TODO: Use configuration system instead of hard-coded values in RPC handling +- TODO: Complete error handling implementations for some error paths diff --git a/p2p/src/channels/summary.md b/p2p/src/channels/summary.md new file mode 100644 index 000000000..6b9e71167 --- /dev/null +++ b/p2p/src/channels/summary.md @@ -0,0 +1,65 @@ +# P2P Channels State Machine + +Provides transport abstraction layer for Mina-specific protocols over dual P2P +transports. + +## Purpose + +- Abstracts communication over both libp2p and WebRTC transports +- Implements Mina-specific protocols with transport-agnostic interfaces +- Handles protocol adaptation between push-based (libp2p) and pull-based + (WebRTC) paradigms +- Manages message serialization, routing, and validation across transport layers + +## Transport Abstraction + +- **Unified Interface**: Single API for both libp2p gossip/pubsub and WebRTC + request/response +- **Protocol Adaptation**: Automatically adapts between push-based broadcasting + and pull-based requests +- **Transport Detection**: Routes messages based on peer connection type and + capabilities +- **Message Size Management**: Handles different size limits per channel (1KB to + 256MB) + +## Key Channels + +- **Best Tip**: Propagates chain head information and blockchain state updates +- **RPC**: Handles peer-to-peer RPC calls with request correlation +- **SNARK**: Distributes SNARK work assignments and proof submissions +- **Transaction**: Propagates pending transactions across the network +- **SNARK Job Commitment**: Manages SNARK work commitments (legacy, being phased + out) +- **Signaling Discovery/Exchange**: WebRTC connection establishment and peer + discovery +- **Streaming RPC**: Long-lived data streams for large responses (ledger sync, + etc.) + +## Architecture + +- **State Machine Pattern**: Consistent Disabled → Enabled → Init → Pending → + Ready flow +- **Effectful Actions**: Transport-agnostic operations dispatched to appropriate + services +- **Bidirectional Tracking**: Separate local/remote state management within + channels +- **Service Abstraction**: Clean separation between channel logic and transport + implementation + +## Interactions + +- Multiplexes over P2P connections with channel identification +- Routes messages to appropriate business logic handlers (transaction pool, + SNARK pool, etc.) +- Coordinates with network layer for connection management and transport + capabilities +- Manages protocol versioning and backward compatibility +- Handles message validation, rate limiting, and error recovery + +## Integration Points + +- **Business Logic**: Connects to transaction pool, SNARK pool, blockchain state + management +- **Transport Layer**: Interfaces with libp2p pubsub and WebRTC data channels +- **Connection Management**: Coordinates with P2P connection lifecycle and peer + discovery diff --git a/p2p/src/channels/transaction/summary.md b/p2p/src/channels/transaction/summary.md new file mode 100644 index 000000000..cd41bffe4 --- /dev/null +++ b/p2p/src/channels/transaction/summary.md @@ -0,0 +1,61 @@ +# Transaction Channel State Machine + +Transport-agnostic transaction propagation channel that abstracts over both +libp2p gossip and WebRTC pull-based protocols. + +## Purpose + +- **Transport abstraction** - Provides unified interface for transaction + propagation over libp2p (gossip/pubsub) and WebRTC (request/response) +- **Protocol adaptation** - Handles push-based broadcasting for libp2p and + pull-based requests for WebRTC +- **Flow control** - Implements request/response protocol for WebRTC peers with + count-based limits +- **Network-wide propagation** - Ensures transactions reach all network + participants regardless of transport + +## State Flow + +``` +Disabled/Enabled → Init → Pending → Ready → (Transport-specific propagation patterns) +``` + +## Key Features + +- **Dual transport support** - Seamlessly operates over libp2p gossip and WebRTC + connections +- **WebRTC request protocol** - GetNext/WillSend/Transaction message flow for + pull-based propagation +- **libp2p gossip integration** - Broadcast/subscription model for push-based + propagation +- **Index tracking** - Maintains propagation state across different transport + mechanisms +- **Unified channel interface** - Same API for transaction propagation + regardless of underlying transport + +## Integration Points + +- **libp2p pubsub** - Broadcasts transactions via gossip protocol for libp2p + peers +- **WebRTC data channels** - Sends transaction requests/responses over WebRTC + connections +- **P2pChannelsEffectfulAction** - Transport-agnostic channel initialization and + message sending +- **Transaction pool coordination** - Sources and deposits transactions from/to + local transaction pool + +## Technical Implementation + +- **Transport detection** - Adapts behavior based on peer connection type + (libp2p vs WebRTC) +- **Protocol multiplexing** - Handles both push (gossip) and pull + (request/response) paradigms +- **State synchronization** - Coordinates transaction propagation across + heterogeneous network +- **Channel abstraction** - Encapsulates transport-specific details behind + unified interface + +## Technical Debt + +- TODO: Propagate transaction info received to transaction pool for proper + integration diff --git a/p2p/src/connection/incoming/summary.md b/p2p/src/connection/incoming/summary.md new file mode 100644 index 000000000..762b6fd1b --- /dev/null +++ b/p2p/src/connection/incoming/summary.md @@ -0,0 +1,74 @@ +# Incoming Connection State Machine + +Manages incoming WebRTC and libp2p connection establishment from offer receipt +through connection finalization. + +## Purpose + +- **WebRTC connection handling** - Processes incoming WebRTC offers and + generates encrypted SDP answers +- **Dual transport support** - Handles both WebRTC (browser-based) and libp2p + (backend) incoming connections +- **Answer generation workflow** - Creates SDP answers, encrypts them, and sends + via signaling channels +- **Connection finalization** - Completes connection setup and transitions to + ready state + +## State Flow + +``` +Init → AnswerSdpCreatePending → AnswerSdpCreateSuccess → AnswerReady → AnswerSendSuccess → FinalizePending → Success + → FinalizePendingLibp2p → Libp2pReceived (libp2p path) + → Error (failure cases) +``` + +## Key Features + +- **SDP answer creation** - Generates WebRTC Session Description Protocol + answers for incoming offers +- **Signaling method support** - Handles HTTP signaling and P2P signaling + channel routing +- **Duplicate connection handling** - Manages close_duplicates for libp2p + connections +- **Timeout management** - Configurable timeouts for incoming connection + establishment +- **RPC request tracking** - Associates connections with optional RPC request + IDs + +## Integration Points + +- **P2pConnectionIncomingEffectfulAction::Init** - Initiates WebRTC answer + creation process +- **P2pChannelsSignalingExchangeAction** - Coordinates with signaling exchange + for answer transmission +- **P2pNetworkSchedulerAction** - Integrates with libp2p connection scheduler +- **P2pDisconnectionAction** - Handles connection failures and cleanup + +## Technical Implementation + +- **Encrypted answer handling** - Uses `Box` for encrypted SDP + answers +- **Signaling method abstraction** - Supports `IncomingSignalingMethod::Http` + and `::P2p` +- **Peer state coordination** - Creates and updates `P2pPeerState` with + connection information +- **Error state management** - Comprehensive error handling with typed error + enums + +## Technical Debt + +### Major Issues + +- **Missing Resource Management**: Basic capacity checks but no comprehensive + resource cleanup or bounded collections + +### Moderate Issues + +- **Code Duplication**: Repetitive state validation logic across actions in + enabling conditions - opportunity to extract common patterns to state methods +- **Scattered Feature Flags**: libp2p conditional compilation logic scattered + throughout reducer makes maintenance difficult +- **Poor Error Context**: Error handling loses context by early string + conversion in error handling paths +- TODO: Move `IncomingSignalingMethod` to `crate::webrtc` module for better + organization diff --git a/p2p/src/connection/outgoing/summary.md b/p2p/src/connection/outgoing/summary.md new file mode 100644 index 000000000..d2dfada15 --- /dev/null +++ b/p2p/src/connection/outgoing/summary.md @@ -0,0 +1,88 @@ +# Outgoing Connection State Machine + +Manages outgoing WebRTC and libp2p connection establishment from offer creation +through connection finalization. + +## Purpose + +- **WebRTC connection initiation** - Creates and sends encrypted SDP offers to + target peers +- **Dual transport support** - Handles both WebRTC (browser-based) and libp2p + (backend) outgoing connections +- **Offer creation workflow** - Generates SDP offers, encrypts them, and sends + via signaling channels +- **Connection completion** - Processes answers and finalizes connection + establishment + +## State Flow + +``` +Init → OfferSdpCreatePending → OfferSdpCreateSuccess → OfferReady → OfferSendSuccess → AnswerRecvPending → AnswerRecvSuccess → FinalizePending → Success + → Error (failure cases) +``` + +## Key Features + +- **SDP offer creation** - Generates WebRTC Session Description Protocol offers + for outgoing connections +- **Signaling method coordination** - Routes offers through HTTP signaling or + P2P signaling channels +- **Answer processing** - Receives and processes encrypted SDP answers from + target peers +- **Callback support** - Executes success callbacks with peer ID and RPC ID upon + connection establishment +- **Timeout management** - Configurable timeouts for outgoing connection + attempts + +## Integration Points + +- **P2pConnectionOutgoingEffectfulAction::Init** - Initiates WebRTC offer + creation process +- **P2pNetworkSchedulerAction::OutgoingConnect** - Integrates with libp2p + connection scheduler +- **Signaling channels** - Coordinates with discovery and exchange channels for + offer/answer routing +- **P2pPeerAction** - Updates peer state upon successful connection + establishment + +## Technical Implementation + +- **Encrypted offer handling** - Uses `Box` for encrypted SDP + offers +- **Callback mechanism** - Redux callbacks for decoupled success notification +- **Connection options** - Supports `P2pConnectionOutgoingInitOpts::WebRTC` and + `::LibP2P` +- **Error state management** - Comprehensive error handling with rejection + reasons + +## Technical Debt + +### Critical Issues + +- **Missing DNS Resolution**: libp2p connections lack DNS resolution + implementation in `Init` and `Reconnect` actions, causing connection failures + for domain-based addresses + +### Major Issues + +- **Complex State Machine**: Large monolithic reducer with multiple + responsibilities makes maintenance difficult +- **Insufficient Input Validation**: String parsing for peer addresses lacks + robust validation in address conversion + +### Moderate Issues + +- **Resource Management**: Extensive use of `Box<>` allocations without clear + cleanup patterns may cause memory leaks +- **Incomplete Timeout Logic**: Timeout handling only applies to non-error + states in timeout checking +- **Basic Error Mapping**: Simplistic error conversion loses debugging detail in + error handling +- TODO: Replace hard-coded signaling server host with actual address (WebRTC + offers currently use 127.0.0.1) +- TODO: Remove host field from offers and use ICE candidates instead for + signaling server identification +- TODO: Rename `Init` and `Reconnect` actions to `New` and `Connect` for clearer + semantics +- TODO: Move outgoing connection types to `crate::webrtc` module for better + organization diff --git a/p2p/src/connection/summary.md b/p2p/src/connection/summary.md new file mode 100644 index 000000000..d5f744b54 --- /dev/null +++ b/p2p/src/connection/summary.md @@ -0,0 +1,53 @@ +# Connection State Machine + +Coordinates peer connection establishment for both WebRTC and libp2p transports +through incoming and outgoing connection management. + +## Purpose + +- **Connection lifecycle coordination** - Manages complete connection + establishment process from initiation to finalization +- **Dual transport abstraction** - Provides unified interface for WebRTC and + libp2p connection types +- **State delegation** - Routes connection actions to appropriate incoming or + outgoing state machines +- **Connection validation** - Enforces connection acceptance rules and capacity + limits + +## State Flow + +``` +P2pConnectionState::Outgoing(OutgoingState) +P2pConnectionState::Incoming(IncomingState) +``` + +## Key Features + +- **Transport abstraction** - Unified connection state enum supporting both + WebRTC and libp2p connections +- **Directional state management** - Delegates to specialized incoming/outgoing + state machines +- **Timeout handling** - Configurable timeouts for both connection types +- **RPC integration** - Associates connections with optional RPC request + tracking +- **Success/error detection** - Common interface for checking connection status + +## Integration Points + +- **P2pConnectionOutgoingAction** - Delegates to outgoing connection state + machine +- **P2pConnectionIncomingAction** - Delegates to incoming connection state + machine +- **P2pPeerState** - Updates peer connection status and metadata +- **P2pTimeouts** - Applies configurable timeout policies + +## Technical Implementation + +- **Enum-based state delegation** - Uses `P2pConnectionState` enum to route to + appropriate handler +- **Common interface methods** - Provides `is_success()`, `is_error()`, + `rpc_id()`, and `time()` accessors +- **Timeout coordination** - Delegates timeout checking to specific connection + types +- **State machine composition** - Composes incoming and outgoing state machines + into unified interface diff --git a/p2p/src/disconnection/summary.md b/p2p/src/disconnection/summary.md new file mode 100644 index 000000000..ab20cbf7e --- /dev/null +++ b/p2p/src/disconnection/summary.md @@ -0,0 +1,46 @@ +# Disconnection State Machine + +Manages peer disconnection, cleanup, and automated peer space management. + +## Purpose + +- Handles graceful peer disconnections with comprehensive reason tracking +- Implements automated peer space management when connection limits are exceeded +- Manages cleanup for both libp2p and WebRTC transport layers +- Coordinates system-wide disconnection notifications via callbacks + +## Key Components + +- **Automated Space Management**: Randomly selects and disconnects peers when + exceeding `max_stable_peers` +- **Stability Protection**: Prevents disconnection of peers connected for less + than 90 seconds +- **Reason Categorization**: Tracks disconnection causes (timeouts, protocol + violations, space management, etc.) +- **Dual Transport Handling**: Separate cleanup logic for libp2p vs WebRTC + connections +- **Memory Management**: Removes oldest disconnected peer entries to prevent + unbounded growth + +## State Flow + +1. **RandomTry**: Periodic check for peer space management (every 10 seconds) +2. **Init**: Begin disconnection with specific reason and peer identification +3. **PeerClosed**: Handle peer-initiated disconnections +4. **FailedCleanup**: Recovery for failed disconnection attempts +5. **Finish**: Complete disconnection with cleanup and system notifications + +## Interactions + +- Processes disconnect events from various P2P components (channels, network + protocols, etc.) +- Cleans up protocol states and connection resources +- Notifies dependent systems through callback mechanism +- Updates peer registry and connection status +- Integrates with transport services for actual I/O operations + +## Important Notes + +- Does **not** trigger reconnections - only handles disconnection and cleanup +- Critical for preventing memory leaks and maintaining connection limits +- Used extensively across the P2P layer (13+ components depend on it) diff --git a/p2p/src/network/identify/stream/summary.md b/p2p/src/network/identify/stream/summary.md new file mode 100644 index 000000000..ea58e65d0 --- /dev/null +++ b/p2p/src/network/identify/stream/summary.md @@ -0,0 +1,61 @@ +# Identify Stream State Machine + +Manages individual libp2p identify protocol streams for peer capability +discovery and address exchange. + +## Purpose + +- **Bidirectional stream management** - Handles both incoming (send identify) + and outgoing (receive identify) streams +- **Protocol message exchange** - Serializes and deserializes protobuf identify + messages with length prefixing +- **Chunked data handling** - Reassembles partial messages across multiple data + frames +- **Peer information propagation** - Updates peer metadata based on received + identify information + +## State Flow + +``` +Default → RecvIdentify → IdentifyReceived (outgoing streams) +Default → SendIdentify → closed (incoming streams) +Default → IncomingPartialData → IdentifyReceived (chunked messages) +``` + +## Key Features + +- **Message Chunking** - Handles partial data reception via + `IncomingPartialData` state with incremental reassembly +- **Size Validation** - Enforces identify message size limits to prevent + resource exhaustion attacks +- **Stream Direction Logic** - Incoming streams send identify info, outgoing + streams receive it +- **Automatic Cleanup** - Dispatches stream closure and pruning actions after + successful exchange + +## Integration Points + +- **P2pIdentifyAction::UpdatePeerInformation** - Propagates peer capabilities + and addresses to identify component +- **P2pNetworkYamuxAction::OutgoingData** - Sends serialized identify messages + via yamux multiplexer +- **P2pNetworkSchedulerAction::Error** - Reports stream errors to connection + scheduler +- **P2pNetworkIdentifyStreamEffectfulAction::GetListenAddresses** - Retrieves + local addresses for outbound identify messages + +## Technical Implementation + +- **Length-delimited protobuf encoding** - Uses varint32 length prefix for + message framing +- **Memory-safe chunking** - Buffers partial data in `Vec` until complete + message received +- **Error propagation** - Converts protobuf decode errors to + `P2pNetworkStreamProtobufError` + +## Technical Debt + +- TODO: Enabling conditions not implemented (`is_enabled` always returns `true`) +- TODO: Error state handling incomplete +- TODO: External address configuration hardcoded +- TODO: Observed address reporting not implemented diff --git a/p2p/src/network/identify/summary.md b/p2p/src/network/identify/summary.md new file mode 100644 index 000000000..c00872288 --- /dev/null +++ b/p2p/src/network/identify/summary.md @@ -0,0 +1,42 @@ +# Identify State Machine + +Implements libp2p identify protocol for peer information exchange. + +## Purpose + +- Exchanges peer identity information using libp2p identify protocol +- Shares supported protocols and capabilities +- Discovers peer addresses and network information +- Maintains peer metadata and version information + +## Key Components + +- **Stream**: Manages identify protocol streams and state transitions +- **Protocol Handler**: Processes protobuf messages and peer information +- **Version Management**: Handles protocol version compatibility + +## Interactions + +- Sends identify requests to newly connected peers +- Processes incoming peer information and capabilities +- Updates peer registry with discovered protocols +- Shares local protocol support and agent information +- Handles protocol version negotiation + +## Technical Debt + +This component is well-implemented but has several incomplete features: + +- **TODO Comments**: Multiple incomplete implementations (enabling conditions, + configuration options, error handling) +- **Hard-coded Values**: Protocol version "ipfs/0.1.0" and agent version + "openmina" should be configurable +- **Missing Features**: Observed address reporting always returns None, build + information not included +- **Large Stream Reducer**: 443-line reducer with some code duplication in state + handling +- **Configuration**: Message size limits and other parameters are not + configurable + +These are minor maintainability issues that should be addressed over time to +improve flexibility and completeness. diff --git a/p2p/src/network/kad/bootstrap/summary.md b/p2p/src/network/kad/bootstrap/summary.md new file mode 100644 index 000000000..1a345cf82 --- /dev/null +++ b/p2p/src/network/kad/bootstrap/summary.md @@ -0,0 +1,63 @@ +# Kademlia Bootstrap State Machine + +Manages the iterative FIND_NODE process to discover peers closest to the local +node's key for initial DHT integration. + +## Purpose + +- **Iterative peer discovery** - Executes FIND_NODE requests to discover peers + closest to local node key +- **Concurrent request management** - Maintains up to 3 concurrent requests with + rate limiting +- **Statistics collection** - Tracks success/failure rates and timing for + bootstrap requests +- **Routing table population** - Processes discovered peers to populate Kademlia + routing table + +## State Flow + +``` +CreateRequests → AppendRequest (×3) → FinalizeRequests → RequestDone/RequestError → CreateRequests +``` + +## Key Features + +- **Batched request processing** - Groups up to 3 requests per batch with + completion synchronization +- **Closest peer selection** - Uses routing table to select unprocessed peers + closest to local Kademlia key +- **Request deduplication** - Tracks processed peers in `BTreeSet` to avoid + redundant requests +- **Success threshold** - Continues until 20 successful requests or peer + exhaustion +- **Fallback address handling** - Stores backup addresses for connection retry + logic + +## Integration Points + +- **P2pNetworkKadEffectfulAction::MakeRequest** - Initiates connection attempts + to discovered peers +- **P2pNetworkKadRequestAction::New** - Creates FIND_NODE requests for connected + peers +- **P2pNetworkKademliaAction::BootstrapFinished** - Signals completion when no + more requests available +- **Routing table access** - Queries closest peers and updates with discovered + nodes + +## Technical Implementation + +- **Kademlia key mapping** - Converts PeerId to SHA256-based Kademlia key for + distance calculations +- **Multi-phase processing** - CreateRequests → AppendRequest → FinalizeRequests + cycle +- **Concurrent limiting** - Enabling conditions enforce maximum 3 concurrent + requests +- **Statistics tracking** - Records ongoing, successful, and failed request + metrics with timestamps + +## Technical Debt + +- TODO: Replace BTreeMap-based request tracking with lightweight alternative for + 3 concurrent requests +- TODO: Generalize to DNS addresses instead of just SocketAddr +- TODO: Use Multiaddr instead of SocketAddr for address handling diff --git a/p2p/src/network/kad/request/summary.md b/p2p/src/network/kad/request/summary.md new file mode 100644 index 000000000..ce22bd739 --- /dev/null +++ b/p2p/src/network/kad/request/summary.md @@ -0,0 +1,62 @@ +# Kademlia Request State Machine + +Manages individual FIND_NODE request lifecycle from connection establishment +through response processing. + +## Purpose + +- **Connection-aware request handling** - Manages connection state before + issuing FIND_NODE queries +- **Multi-phase state tracking** - Tracks progression from connection to stream + creation to response +- **Protobuf message serialization** - Handles encoding/decoding of Kademlia + protocol messages +- **Peer discovery integration** - Processes responses to populate routing table + and bootstrap + +## State Flow + +``` +New → WaitingForConnection → WaitingForKadStream → Request → WaitingForReply → Reply + ↘ Disconnected ↘ Error +``` + +## Key Features + +- **Connection lifecycle management** - Handles peer connection establishment + before request dispatch +- **Stream multiplexing coordination** - Waits for yamux multiplexing before + opening Kademlia streams +- **Callback-based integration** - Uses Redux callbacks for decoupled response + handling +- **Bootstrap integration** - Automatically notifies bootstrap component of + request completion +- **Automatic cleanup** - Prunes completed requests and closes streams + +## Integration Points + +- **P2pConnectionOutgoingAction::Init** - Initiates connections to target peers + with success callbacks +- **P2pNetworkYamuxAction::OpenStream** - Opens Kademlia protocol streams over + established connections +- **P2pNetworkKadBootstrapAction::RequestDone/RequestError** - Reports bootstrap + request results +- **P2pNetworkKadEffectfulAction::Discovered** - Processes discovered peer + addresses from responses +- **P2pNetworkKademliaStreamAction::Close** - Closes streams after successful + response processing + +## Technical Implementation + +- **State-driven connection handling** - Different logic based on peer + connection state (none, connecting, ready) +- **Stream ID management** - Coordinates with yamux for stream allocation and + lifecycle +- **Protobuf serialization** - Uses quick_protobuf for message encoding with + error handling +- **Peer filtering** - Supports local address filtering for discovered peers + +## Technical Debt + +- TODO: Add callbacks for stream operations +- TODO: Error handling for invalid request keys needs improvement diff --git a/p2p/src/network/kad/stream/summary.md b/p2p/src/network/kad/stream/summary.md new file mode 100644 index 000000000..91eec1de9 --- /dev/null +++ b/p2p/src/network/kad/stream/summary.md @@ -0,0 +1,66 @@ +# Kademlia Stream State Machine + +Manages bidirectional Kademlia protocol streams for FIND_NODE request/response +message exchange. + +## Purpose + +- **Bidirectional stream management** - Handles both incoming (server) and + outgoing (client) Kademlia streams +- **Length-delimited message handling** - Processes varint32-prefixed protobuf + messages with chunking support +- **FIND_NODE protocol implementation** - Handles key lookup requests and + closest peer responses +- **Stream lifecycle coordination** - Manages stream states from creation + through closure + +## State Flow + +**Incoming Streams (Server):** + +``` +Default → WaitingForRequest → PartialRequestReceived → RequestIsReady → WaitingForReply → ResponseBytesAreReady → Closing → Closed +``` + +**Outgoing Streams (Client):** + +``` +Default → WaitingForRequest → RequestBytesAreReady → WaitingForReply → PartialReplyReceived → ResponseIsReady → Closing → Closed +``` + +## Key Features + +- **Message chunking** - Handles partial message reception via dedicated partial + states with incremental reassembly +- **Size validation** - Enforces Kademlia message size limits to prevent + resource exhaustion +- **Protobuf serialization** - Uses quick_protobuf for message encoding/decoding + with error handling +- **Directional state separation** - Distinct state machines for incoming vs + outgoing stream handling + +## Integration Points + +- **P2pNetworkKademliaStreamAction::WaitOutgoing** - Triggers response + generation for incoming FIND_NODE requests +- **P2pNetworkKadRequestAction::ReplyReceived** - Processes FIND_NODE responses + from outgoing streams +- **P2pNetworkYamuxAction::OutgoingData** - Sends serialized messages and FIN + flags via yamux +- **P2pNetworkSchedulerAction::Error** - Reports stream errors to connection + scheduler + +## Technical Implementation + +- **Varint32 length prefixing** - Uses protobuf varint32 encoding for message + length headers +- **Incremental parsing** - Buffers partial data until complete messages are + received +- **Error state handling** - Converts protobuf and parsing errors to stream + error states +- **Stream direction awareness** - Different message flow patterns for client vs + server streams + +## Technical Debt + +- TODO: Use enum for errors instead of string-based error handling diff --git a/p2p/src/network/kad/summary.md b/p2p/src/network/kad/summary.md new file mode 100644 index 000000000..2a7aef359 --- /dev/null +++ b/p2p/src/network/kad/summary.md @@ -0,0 +1,55 @@ +# Kademlia State Machine + +Implements Kademlia DHT for peer discovery and routing. + +## Purpose + +- Maintains distributed hash table +- Discovers network peers +- Routes queries through network +- Manages k-buckets and routing table + +## Key Components + +- **Bootstrap**: Initial network join +- **Request**: Query handling +- **Stream**: Kademlia protocol streams + +## Interactions + +- Finds peers by ID +- Stores peer addresses +- Handles DHT queries +- Maintains network topology + +## Technical Debt + +### Major Issues + +- **Large Internals File (912 lines)**: `p2p_network_kad_internals.rs` mixes + routing table, distance calculations, K-buckets, and iterators - should be + split into separate modules for maintainability +- **Large Stream Reducer (633 lines)**: Complex state transitions handling both + incoming/outgoing streams could benefit from moving logic to state methods +- **Missing Error Reporting**: Silently ignores multiaddr parsing errors + (kad_effectful_effects.rs:87) making debugging difficult + +### Moderate Issues + +- **Incomplete Functionality**: Missing callbacks for stream operations + (request_reducer.rs:94,159) and incomplete error handling with string-based + errors (stream_state.rs:45) +- **Suboptimal Data Structures**: Bootstrap uses heavy `BTreeMap` for request + tracking (bootstrap_state.rs:26) and inconsistent address handling between + `SocketAddr` and `Multiaddr` +- **Hard-coded Values**: Magic numbers for bootstrap thresholds (20) and batch + sizes (3) should be configurable + +### Refactoring Plan + +1. **Extract modules** from internals file: routing_table.rs, distance.rs, + bucket.rs +2. **Move complex logic to state methods** to simplify reducers +3. **Implement structured error types** instead of string errors +4. **Add error reporting** for failed operations +5. **Standardize on Multiaddr** for consistent address handling diff --git a/p2p/src/network/noise/p2p_network_noise_refactoring.md b/p2p/src/network/noise/p2p_network_noise_refactoring.md new file mode 100644 index 000000000..50bd02395 --- /dev/null +++ b/p2p/src/network/noise/p2p_network_noise_refactoring.md @@ -0,0 +1,280 @@ +# P2P Network Noise Refactoring Notes + +This document outlines security and maintainability issues in the Noise +cryptographic handshake component that require careful attention due to their +security-critical nature. + +## Current Implementation Issues + +### 1. Complex Security-Critical State Machine + +The Noise handshake state machine has concerning complexity for a security +component: + +```rust +// Complex nested state structure +pub enum P2pNetworkNoiseStateInner { + Initiator { + static_key: StaticKey, + ephemeral_key: EphemeralKey, + state: Option, + // Complex nested logic + }, + Responder { /* similar complexity */ }, + Done { /* encryption state */ }, + Error(NoiseError), +} +``` + +**Security Concerns**: + +- Complex state transitions increase attack surface +- Hard-to-audit handshake logic (494-line reducer) +- Potential for invalid state transitions +- Multiple places where sensitive data could leak + +### 2. Incomplete Implementation (Critical TODOs) + +Two critical TODO comments in security-sensitive code: + +```rust +// Line 455: In handshake parsing +// TODO: refactor obscure arithmetics +let payload_len = u16::from_be_bytes([buf[1], buf[2]]) as usize; + +// Line 392: In error handling +// TODO: report error +return Err(NoiseError::ParseError("failed to parse noise message".to_owned())); +``` + +**Issues**: + +- "Obscure arithmetics" in cryptographic parsing suggests unclear/potentially + unsafe code +- Missing error reporting could hide security issues +- Incomplete implementation in production security code + +### 3. Memory Safety for Cryptographic Material + +Mixed handling of sensitive data: + +```rust +// Good: Proper zeroization +impl Drop for StaticKey { + fn drop(&mut self) { + self.0.zeroize(); + } +} + +// Problematic: DataSized<32> doesn't implement Zeroize +#[derive(Clone, Serialize, Deserialize)] +pub struct DataSized(pub [u8; N]); + +// Concerning: Multiple clone operations on sensitive state +let state = noise_state.clone(); +``` + +**Security Risks**: + +- Keys in `DataSized<32>` not securely erased from memory +- `clone()` operations create temporary copies of sensitive data +- Serializable keys could accidentally persist + +### 4. Deprecated Cryptographic Functions + +Use of deprecated crypto APIs: + +```rust +#[allow(deprecated)] +let scalar = Scalar::from_bits(*static_key.as_bytes()); +``` + +**Risks**: + +- Deprecated functions may have known security vulnerabilities +- Future removal could break compilation +- Security patches may not be applied to deprecated APIs + +### 5. Error Handling Security Issues + +Inconsistent error handling patterns: + +```rust +// Information leakage via debug output +dbg!("failed to decrypt noise message"); + +// Generic error messages that hide important security information +Err(NoiseError::ParseError("failed to parse noise message".to_owned())) +``` + +**Problems**: + +- Debug output could leak sensitive information to logs +- Generic error messages make security debugging difficult +- Inconsistent error reporting across the component + +## Security-Focused Refactoring Plan + +### Phase 1: Critical Security Fixes + +1. **Complete TODO Items**: + + ```rust + // Replace "obscure arithmetics" with clear, auditable parsing + fn parse_noise_message_length(buf: &[u8]) -> Result { + if buf.len() < 3 { + return Err(NoiseError::InsufficientData); + } + let length = u16::from_be_bytes([buf[1], buf[2]]) as usize; + if length > MAX_NOISE_MESSAGE_SIZE { + return Err(NoiseError::MessageTooLarge(length)); + } + Ok(length) + } + ``` + +2. **Fix Memory Safety for Keys**: + + ```rust + #[derive(Clone)] + pub struct SecureDataSized([u8; N]); + + impl Zeroize for SecureDataSized { + fn zeroize(&mut self) { + self.0.zeroize(); + } + } + + impl Drop for SecureDataSized { + fn drop(&mut self) { + self.zeroize(); + } + } + ``` + +3. **Remove Deprecated Crypto Usage**: + ```rust + // Replace deprecated Scalar::from_bits with recommended alternative + let scalar = Scalar::from_bytes_mod_order(*static_key.as_bytes()); + ``` + +### Phase 2: State Machine Simplification + +1. **Extract Handshake Logic**: + + ```rust + struct NoiseHandshake { + state: HandshakeState, + role: HandshakeRole, + } + + impl NoiseHandshake { + fn process_message(&mut self, message: &[u8]) -> Result { + // Clear, auditable handshake logic + } + } + ``` + +2. **Simplify State Enum**: + ```rust + pub enum NoiseState { + Handshaking(NoiseHandshake), + Connected(NoiseTransport), + Failed(NoiseError), + } + ``` + +### Phase 3: Secure Error Handling + +1. **Create Security-Aware Logging**: + + ```rust + fn log_security_error(error: &NoiseError) { + // Log error without leaking sensitive information + match error { + NoiseError::AuthenticationFailed => { + warn!("Noise authentication failed - no sensitive data logged"); + } + // ... other secure logging patterns + } + } + ``` + +2. **Implement Proper Error Reporting**: + ```rust + pub enum NoiseError { + AuthenticationFailed, + HandshakeFailed { stage: HandshakeStage }, + MessageTooLarge(usize), + InsufficientData, + CryptographicError, // Generic for internal crypto errors + } + ``` + +### Phase 4: Memory Management Improvements + +1. **Reduce Clone Operations**: + + ```rust + // Use references instead of cloning sensitive state + fn process_handshake_message(&mut self, message: &[u8]) -> Result, NoiseError> { + // Work with references to avoid copying sensitive data + } + ``` + +2. **Explicit Key Lifecycle Management**: + + ```rust + struct KeyManager { + static_key: Option, + ephemeral_key: Option, + } + + impl KeyManager { + fn clear_ephemeral_key(&mut self) { + if let Some(mut key) = self.ephemeral_key.take() { + key.zeroize(); + } + } + } + ``` + +## Security Testing Requirements + +1. **Memory Safety Tests**: + - Verify sensitive data is zeroized after use + - Test for memory leaks of cryptographic material + - Validate no sensitive data in debug output + +2. **State Machine Security Tests**: + - Test invalid state transitions are rejected + - Verify error conditions don't leak information + - Test handshake failure scenarios + +3. **Cryptographic Correctness Tests**: + - Verify compatibility with standard Noise implementations + - Test edge cases in message parsing + - Validate authentication failures are handled correctly + +## Performance Considerations + +Security is the primary concern, but the refactoring should also address: + +1. **Reduce Allocations**: Minimize temporary allocations for sensitive data +2. **Buffer Reuse**: Implement secure buffer reuse patterns +3. **Constant-Time Operations**: Ensure cryptographic comparisons are + constant-time + +## Migration Strategy + +1. **Security Audit**: Review all changes with security experts +2. **Incremental Updates**: Make changes in small, auditable chunks +3. **Comprehensive Testing**: Test against known-good Noise implementations +4. **Documentation**: Document security invariants and assumptions + +## Conclusion + +The Noise component implements security-critical functionality. While the +current implementation uses good cryptographic libraries, the complex state +machine and incomplete features create security risks. The refactoring should +prioritize security and auditability over performance or complexity. diff --git a/p2p/src/network/noise/summary.md b/p2p/src/network/noise/summary.md new file mode 100644 index 000000000..0c827b7b8 --- /dev/null +++ b/p2p/src/network/noise/summary.md @@ -0,0 +1,62 @@ +# Noise State Machine + +Implements Noise protocol for encrypted P2P communication using +ChaCha20Poly1305. + +## Purpose + +- Establishes encrypted channels between peers +- Performs cryptographic handshake with key exchange +- Manages ephemeral and static session keys +- Encrypts and decrypts all P2P communication +- Provides forward secrecy and authentication + +## Key Components + +- **Handshake Manager**: Handles Noise protocol handshake state machine +- **Key Manager**: Manages static and ephemeral cryptographic keys +- **Transport Cipher**: Encrypts/decrypts messages after handshake +- **Message Parser**: Parses Noise protocol messages securely + +## Interactions + +- Initiates and responds to cryptographic handshakes +- Exchanges ephemeral keys for forward secrecy +- Authenticates peers using static keys +- Establishes secure transport layer for all P2P communication +- Manages key rotation and session lifecycle + +## Technical Debt + +### Security Issues + +- **Session Key Cleanup**: Ephemeral session keys in `DataSized<32>` not + securely zeroized after use + - **Risk**: Limited to active session compromise via memory forensics + - **Context**: These are per-connection transport keys, not long-term identity + keys + - **Impact**: Provides forward secrecy, but current session could be + compromised if memory is accessed + - **Solution**: Implement `Zeroize` trait for `DataSized` or create secure + key wrapper +- **Information Leakage**: Debug output in error paths could leak timing + information or internal state + +### Other Issues + +- **Deprecated Crypto**: Usage of deprecated `Scalar::from_bits` function needs + migration to current curve25519-dalek API +- **Code Clarity**: "Obscure arithmetics" TODO refers to standard Noise protocol + parsing that could benefit from better documentation + +### Additional Issues + +- **Missing Error Reporting**: TODO for error reporting in edge cases + +### Implementation Notes + +The 494-line reducer and nested state machine structure implement the full Noise +XX handshake protocol. The crypto protocol complexity is inherent to the Noise +specification. See +[p2p_network_noise_refactoring.md](./p2p_network_noise_refactoring.md) for +detailed analysis. diff --git a/p2p/src/network/pnet/p2p_network_pnet_refactoring.md b/p2p/src/network/pnet/p2p_network_pnet_refactoring.md new file mode 100644 index 000000000..bedaacd10 --- /dev/null +++ b/p2p/src/network/pnet/p2p_network_pnet_refactoring.md @@ -0,0 +1,290 @@ +# P2P Network PNet Refactoring Notes + +This document outlines technical debt and implementation issues in the PNet +(Private Network) component that require attention for code quality and +maintainability improvements. + +## Current Implementation Analysis + +### Protocol Understanding + +The PNet component implements libp2p's Private Network protocol for Mina, where: + +- **PSK (Pre-Shared Key) reuse is by design**: All nodes on the same network + share the same PSK derived from the chain ID +- **Network isolation**: The PSK prevents unauthorized nodes from joining the + network +- **XSalsa20 encryption**: Provides stream encryption after handshake with + per-connection nonces + +### Legitimate Technical Debt + +### 1. State Machine Architecture Issues + +**Mixed Concerns in Half State**: + +```rust +pub enum Half { + Buffering { buffer: [u8; 24], offset: usize }, + Done { cipher: XSalsa20, to_send: Vec }, +} +``` + +**Issues**: + +- Single enum handles both buffering and encryption concerns +- Complex state transitions increase cognitive load +- Could benefit from separate buffer and cipher management + +**Complex Reducer Logic**: The reducer in `Half::reduce()` handles multiple +concerns: + +- Buffer management during nonce collection +- Cipher initialization after receiving 24-byte nonce +- Data encryption/decryption +- State transitions + +This makes the logic dense and harder to maintain. + +### 2. Buffer Management Complexity + +**Buffer Handling Logic**: + +```rust +fn reduce(&mut self, shared_secret: &[u8; 32], data: &[u8]) { + match self { + Half::Buffering { buffer, offset } => { + if *offset + data.len() < 24 { + buffer[*offset..(*offset + data.len())].clone_from_slice(data); + *offset += data.len(); + } else { + if *offset < 24 { + buffer[*offset..24].clone_from_slice(&data[..(24 - *offset)]); + } + let nonce = *buffer; + let remaining = data[(24 - *offset)..].to_vec().into_boxed_slice(); + // ... transition to Done state + } + } + } +} +``` + +**Issues**: + +- Complex arithmetic for buffer management +- Multiple array indexing operations in single function +- Mixed concerns: buffer management and state transitions +- Could benefit from helper methods to improve readability + +### 3. Code Organization and Maintainability + +**Large Reducer Function**: The reducer function in +`p2p_network_pnet_reducer.rs` is substantial and could benefit from: + +- Moving more logic to state methods (as noted in CLAUDE.md) +- Breaking down complex operations into smaller, focused functions +- Clearer separation of concerns + +**Hard-coded Values**: + +- 24-byte nonce size is hard-coded throughout +- Could benefit from named constants for magic numbers + +## Refactoring Plan + +### Phase 1: Code Organization Improvements + +**1. Move Logic to State Methods**: + +```rust +impl Half { + fn append_data(&mut self, data: &[u8]) -> Result>, PNetError> { + // Move buffer management logic here + } + + fn is_ready(&self) -> bool { + matches!(self, Half::Done { .. }) + } + + fn encrypt_data(&mut self, data: &[u8]) -> Result, PNetError> { + // Move encryption logic here + } +} +``` + +**2. Extract Constants**: + +```rust +const NONCE_SIZE: usize = 24; +const PNET_PROTOCOL_PREFIX: &[u8] = b"/coda/0.0.1/"; +``` + +**3. Add Helper Methods**: + +```rust +impl P2pNetworkPnetState { + fn process_nonce_data(&mut self, data: &[u8], incoming: bool) -> Result { + // Extract nonce processing logic + } + + fn setup_cipher(&mut self, nonce: [u8; 24]) -> Result<(), String> { + // Extract cipher setup logic + } +} +``` + +### Phase 2: State Machine Clarity + +**1. Separate Buffer and Cipher State**: + +```rust +pub struct Half { + state: HalfState, +} + +enum HalfState { + CollectingNonce { + buffer: [u8; NONCE_SIZE], + bytes_received: usize, + }, + Ready { + cipher: XSalsa20, + pending_data: Vec, + }, +} +``` + +**2. Cleaner State Transitions**: + +```rust +impl Half { + fn process_data(&mut self, shared_secret: &[u8; 32], data: &[u8]) -> Result, PNetError> { + match &mut self.state { + HalfState::CollectingNonce { buffer, bytes_received } => { + self.append_to_nonce_buffer(buffer, bytes_received, data, shared_secret) + } + HalfState::Ready { cipher, pending_data } => { + self.encrypt_decrypt_data(cipher, pending_data, data) + } + } + } +} +``` + +### Phase 3: Error Handling and Robustness + +**1. Proper Error Types**: + +```rust +#[derive(Debug, thiserror::Error)] +pub enum PNetError { + #[error("Buffer overflow: attempted to write {attempted} bytes, {available} available")] + BufferOverflow { attempted: usize, available: usize }, + + #[error("Invalid nonce length: expected {expected}, got {actual}")] + InvalidNonceLength { expected: usize, actual: usize }, + + #[error("Cipher initialization failed: {details}")] + CipherInitializationFailed { details: String }, +} +``` + +**2. Bounds Checking**: + +```rust +fn safe_buffer_append(buffer: &mut [u8], offset: &mut usize, data: &[u8]) -> Result<(), PNetError> { + if *offset + data.len() > buffer.len() { + return Err(PNetError::BufferOverflow { + attempted: data.len(), + available: buffer.len() - *offset, + }); + } + buffer[*offset..(*offset + data.len())].copy_from_slice(data); + *offset += data.len(); + Ok(()) +} +``` + +### Phase 4: Testing and Documentation + +**1. Unit Tests for Buffer Management**: + +```rust +#[cfg(test)] +mod tests { + #[test] + fn test_nonce_buffer_management() { + // Test various scenarios of nonce data reception + } + + #[test] + fn test_partial_nonce_reception() { + // Test receiving nonce in multiple chunks + } + + #[test] + fn test_buffer_overflow_protection() { + // Test bounds checking + } +} +``` + +**2. Property-Based Testing**: + +```rust +use proptest::prelude::*; + +proptest! { + #[test] + fn test_buffer_safety(data in prop::collection::vec(any::(), 0..100)) { + let mut half = Half::new(); + let result = half.process_data(&[0u8; 32], &data); + // Should never panic or corrupt memory + } +} +``` + +## Migration Strategy + +### Immediate Actions (Code Quality) + +1. **Extract Constants**: Replace magic numbers with named constants +2. **Add Helper Methods**: Break down complex reducer logic +3. **Improve Error Handling**: Add proper bounds checking + +### Short Term (1-2 weeks) + +1. **Refactor State Machine**: Separate buffer and cipher concerns +2. **Move Logic to State Methods**: Follow architectural guidelines +3. **Add Unit Tests**: Verify refactored code works correctly + +### Medium Term (1-2 months) + +1. **Performance Optimization**: Profile and optimize crypto operations +2. **Documentation**: Add comprehensive code documentation +3. **Integration Testing**: Add tests for complete handshake scenarios + +## Important Notes + +**What This Document Does NOT Address**: + +- PSK reuse (this is by design for network isolation) +- `bug_condition!` usage (correct usage for unreachable code paths) +- Security vulnerabilities (the current implementation follows the protocol + correctly) + +**Focus Areas**: + +- Code organization and maintainability +- State machine clarity +- Buffer management safety +- Following OpenMina architectural patterns + +## Conclusion + +The PNet component implements the libp2p Private Network protocol correctly but +needs improvement in code organization and maintainability. The refactoring +should focus on making the code more readable, testable, and aligned with the +project's architectural guidelines while preserving the correct protocol +behavior. diff --git a/p2p/src/network/pnet/summary.md b/p2p/src/network/pnet/summary.md new file mode 100644 index 000000000..6b34eb057 --- /dev/null +++ b/p2p/src/network/pnet/summary.md @@ -0,0 +1,48 @@ +# Private Network State Machine + +Implements private network support with pre-shared key authentication using +XSalsa20 encryption. + +## Purpose + +- Restricts network access to authorized peers only +- Validates pre-shared keys during connection setup +- Creates isolated private networks using chain ID derivation +- Encrypts all communication with XSalsa20 stream cipher +- Filters and drops unauthorized connection attempts + +## Key Components + +- **Key Manager**: Uses pre-computed PSK (derived from chain ID elsewhere in the + system) +- **Nonce Handler**: Manages 24-byte nonce exchange for cipher setup +- **Buffer Manager**: Handles nonce data buffering during handshake +- **Cipher Manager**: Establishes XSalsa20 encryption after authentication + +## Interactions + +- Uses shared secrets derived from blockchain chain ID (derivation happens + outside this component) +- Exchanges nonces during connection establishment +- Validates PSK authentication on handshake +- Establishes encrypted communication channel +- Drops unauthorized connections that fail PSK validation + +## Technical Debt + +This component has implementation issues. See +[p2p_network_pnet_refactoring.md](./p2p_network_pnet_refactoring.md) for details +on: + +- **State Machine Organization**: Mixed buffering and encryption concerns in + single state machine +- **Code Structure**: Complex reducer logic that needs helper methods and better + organization +- **Buffer Management**: Complex arithmetic and array indexing that needs + simplification +- **Constants**: Hard-coded values that should be extracted as named constants +- **Architecture Alignment**: Opportunity to move more logic to state methods + per project guidelines + +These improvements are needed for better code readability, testability, and +maintainability while preserving correct protocol behavior. diff --git a/p2p/src/network/pubsub/p2p_network_pubsub_refactoring.md b/p2p/src/network/pubsub/p2p_network_pubsub_refactoring.md new file mode 100644 index 000000000..690ce8db0 --- /dev/null +++ b/p2p/src/network/pubsub/p2p_network_pubsub_refactoring.md @@ -0,0 +1,291 @@ +# P2P Network PubSub Refactoring Notes + +This document outlines significant complexity and maintainability issues in the +PubSub (gossip protocol) component that require systematic refactoring. + +## Current Implementation Issues + +### 1. Massive Reducer Complexity + +The main reducer is **963 lines** with excessive complexity: + +```rust +// Current: Single massive function handling all concerns +pub fn reducer(mut state_context: crate::Substate, action: P2pNetworkPubsubActionWithMetaRef<'_>) { + // 963 lines of complex logic + match action { + P2pNetworkPubsubAction::IncomingData { .. } => { /* 45 lines */ } + P2pNetworkPubsubAction::IncomingValidatedMessage { .. } => { /* 180 lines */ } + P2pNetworkPubsubAction::BroadcastMessage { .. } => { /* 75 lines */ } + // ... many more complex handlers + } +} +``` + +**Issues**: + +- Each action handler is doing multiple unrelated things +- Message validation, peer management, and routing all mixed together +- Hard to test individual components +- High cognitive load for understanding any single flow + +### 2. Complex Message Pipeline + +The message validation and routing pipeline is unclear: + +```rust +// Multiple validation states create confusion +enum P2pNetworkPubsubMessageCacheMessage { + Init(PubsubMessage), + PreValidated { ... }, + PreValidatedBlockMessage { ... }, + PreValidatedSnark { ... }, + Validated { ... }, +} +``` + +**Issues**: + +- Complex state transitions between validation stages +- Unclear when messages transition between states +- Mixed concerns between message content and validation status +- No clear documentation of the pipeline flow + +### 3. Error Handling Problems + +Excessive use of `bug_condition!` macro (18 instances): + +```rust +// Examples of problematic error handling +bug_condition!("IncomingMessage, incoming data: invalid peer"); +bug_condition!("Cannot deserialize message from pubsub peer"); +bug_condition!("Cannot find peer for graft: {peer_id}"); +``` + +**Issues**: + +- `bug_condition!` should be used for truly impossible conditions +- Many of these are recoverable errors that should be handled gracefully +- Makes debugging difficult when errors are buried in logs +- Indicates defensive programming where proper error types should be used + +### 4. Performance and Scalability Issues + +**Linear Search Problems**: + +```rust +// O(n) search through all cached messages +pub fn get_message_from_raw_message_id(&self, raw_message_id: &RawMessageId) -> Option<&PubsubMessage> { + for message in self.mcache.values() { + // Linear iteration through all messages + } +} +``` + +**Memory Growth**: + +```rust +// Growing collections without proper bounds +pub struct P2pNetworkPubsubState { + pub mcache: BTreeMap, + pub seen: BTreeMap, + pub iwant: BTreeMap)>, + // No clear cleanup strategy +} +``` + +### 5. Hard-coded Constants + +Magic numbers scattered throughout: + +```rust +// Non-configurable constants +const IWANT_TIMEOUT_DURATION: Duration = Duration::from_secs(5); +const MAX_MESSAGE_KEEP_DURATION: Duration = Duration::from_secs(300); + +// Buffer management with magic numbers +if self.buffers.len() < 50 { + // Hard-coded buffer limits +} +``` + +### 6. Mixed Responsibilities + +**State Struct Doing Too Much**: + +```rust +pub struct P2pNetworkPubsubState { + // Message caching + pub mcache: BTreeMap, + pub seen: BTreeMap, + + // Peer management + pub peers: BTreeMap, + pub mesh: BTreeMap>, + + // Protocol state + pub subscriptions: BTreeSet, + pub iwant: BTreeMap)>, + + // Buffer management + pub buffers: VecDeque>, + // ... more fields +} +``` + +## Architectural Improvements + +### 1. Decompose the Massive Reducer + +```rust +impl P2pNetworkPubsubState { + fn handle_incoming_message(&mut self, msg: IncomingMessage) -> Vec { } + fn handle_message_validation(&mut self, validation: ValidationResult) -> Vec { } + fn handle_peer_management(&mut self, peer_action: PeerAction) -> Vec { } + fn handle_subscription_change(&mut self, sub: SubscriptionChange) -> Vec { } +} +``` + +### 2. Separate Message Pipeline Concerns + +```rust +// Clear separation of concerns +struct MessageCache { + // Pure message storage and retrieval +} + +struct MessageValidator { + // Validation logic only +} + +struct MeshManager { + // Peer topology management +} + +struct SubscriptionManager { + // Topic subscription handling +} +``` + +### 3. Proper Error Handling + +```rust +#[derive(Debug, thiserror::Error)] +pub enum PubsubError { + #[error("Invalid peer {peer_id} for operation")] + InvalidPeer { peer_id: PeerId }, + + #[error("Message deserialization failed: {reason}")] + DeserializationError { reason: String }, + + #[error("Rate limit exceeded for peer {peer_id}")] + RateLimitExceeded { peer_id: PeerId }, +} + +// Use Result types instead of bug_condition! +fn validate_message(&self, msg: &Message) -> Result { } +``` + +### 4. Performance Optimizations + +```rust +// Indexed message cache +struct MessageCache { + messages: BTreeMap, + raw_id_index: HashMap, // O(1) lookup + expiry_queue: BTreeMap>, // Efficient cleanup +} + +// Rate limiting +struct RateLimiter { + peer_limits: HashMap, + global_limit: TokenBucket, +} +``` + +### 5. Configuration Management + +```rust +#[derive(Debug, Clone)] +pub struct PubsubConfig { + pub iwant_timeout: Duration, + pub message_keep_duration: Duration, + pub max_cache_size: usize, + pub max_buffer_count: usize, + pub rate_limit_per_peer: u32, +} +``` + +### 6. Clear Message Pipeline + +```rust +// Explicit pipeline stages +enum MessageStage { + Received(RawMessage), + Deserialized(PubsubMessage), + Validated(ValidatedMessage), + Routed(RoutedMessage), +} + +struct MessagePipeline { + fn process_stage(&mut self, stage: MessageStage) -> Result { } +} +``` + +## Refactoring Strategy + +### Phase 1: Extract Message Handling + +1. Move message validation logic to separate module +2. Create proper error types +3. Replace `bug_condition!` calls with proper error handling + +### Phase 2: Split the Reducer + +1. Extract peer management logic +2. Separate subscription handling +3. Create focused action handlers + +### Phase 3: Performance Improvements + +1. Add indexing for message lookups +2. Implement proper rate limiting +3. Add bounded collections with cleanup + +### Phase 4: Configuration + +1. Make constants configurable +2. Add runtime configuration support +3. Create sensible defaults + +### Phase 5: Testing and Validation + +1. Add unit tests for each extracted component +2. Create integration tests for message pipeline +3. Performance testing for scalability + +## Benefits + +1. **Maintainability**: Smaller, focused modules are easier to understand and + modify +2. **Performance**: Proper indexing and rate limiting prevent scalability issues +3. **Reliability**: Proper error handling instead of panic-prone defensive + programming +4. **Testability**: Individual components can be tested in isolation +5. **Configurability**: Runtime configuration for different deployment scenarios + +## TODO Comments to Address + +Current TODO comments indicate known issues: + +- Message cache organization needs improvement +- Unresolved bugs need investigation +- Missing source tracking for transaction proofs +- Platform compatibility concerns (wasm32) + +## Conclusion + +The PubSub component is a critical part of the P2P network but has accumulated +technical debt. The 963-line reducer and complex state management make it +difficult to maintain and extend. Refactoring will improve performance, +reliability, and maintainability while preserving the existing functionality. diff --git a/p2p/src/network/pubsub/summary.md b/p2p/src/network/pubsub/summary.md new file mode 100644 index 000000000..0e44edf79 --- /dev/null +++ b/p2p/src/network/pubsub/summary.md @@ -0,0 +1,56 @@ +# PubSub State Machine + +Implements gossip protocol for message broadcasting across the P2P network. + +## Purpose + +- Manages topic subscriptions for blockchain data +- Routes messages to subscribers using mesh topology +- Implements flood-fill gossip with deduplication +- Handles message validation and caching + +## Key Components + +- **Message Cache**: Stores and manages message lifecycle and validation states +- **Mesh Manager**: Maintains peer topology for efficient gossip propagation +- **Subscription Manager**: Handles topic subscriptions and peer interests +- **Message Validator**: Validates incoming messages and handles routing + +## Interactions + +- Subscribes to blockchain topics (blocks, transactions, SNARKs) +- Broadcasts blocks and transactions to network +- Forwards messages from peers based on subscriptions +- Manages gossip mesh topology and peer relationships +- Handles message deduplication and validation + +## Technical Debt + +This component has significant complexity and performance issues. See +[p2p_network_pubsub_refactoring.md](./p2p_network_pubsub_refactoring.md) for +detailed analysis. + +### Major Issues + +- **Massive Reducer (963 lines)**: Single file handling multiple concerns that + should be moved to state methods for better maintainability +- **Performance Problems**: O(n) message lookups (state.rs:395-406) and + unbounded memory growth +- **Mixed Responsibilities**: State struct handles caching, peer management, and + protocol logic simultaneously + +### Moderate Issues + +- **Incomplete Functionality**: Missing source tracking for messages + (reducer.rs:300), platform compatibility concerns (state.rs:253) +- **Hard-coded Constants**: Non-configurable timeouts (5s, 300s) and magic + numbers (3, 10, 50, 100) scattered throughout +- **Suboptimal Data Structures**: TODO to separate storage by message type + (state.rs:214) would improve efficiency + +### Refactoring Plan + +1. **Move message handling logic to state methods** to reduce reducer complexity +2. **Implement proper indexing** to eliminate O(n) lookups +3. **Extract separate managers** for caching, peer management, and subscriptions +4. **Make constants configurable** through P2P configuration system diff --git a/p2p/src/network/rpc/summary.md b/p2p/src/network/rpc/summary.md new file mode 100644 index 000000000..eeff5b599 --- /dev/null +++ b/p2p/src/network/rpc/summary.md @@ -0,0 +1,44 @@ +# Network RPC State Machine + +Low-level RPC protocol implementation for P2P communication using binprot +serialization. + +## Purpose + +- Implements RPC wire protocol with length-prefixed framing +- Manages request/response flow and correlation via query IDs +- Handles protocol framing and message parsing from byte streams +- Tracks RPC sessions with heartbeat and timeout mechanisms +- Provides foundation for higher-level channel RPC functionality + +## Key Components + +- **Message Parser**: Handles binprot deserialization and length framing +- **Request Tracker**: Manages pending queries and response correlation +- **Heartbeat Manager**: Implements keepalive mechanism for RPC sessions +- **Protocol Handler**: Routes messages between network layer and RPC channels + +## Interactions + +- Receives raw byte streams and parses RPC protocol messages +- Encodes and decodes RPC messages using binprot serialization +- Routes queries and responses to appropriate RPC channel handlers +- Manages session timeouts and heartbeat intervals +- Handles protocol errors and connection state recovery + +## Technical Debt + +This component is well-architected but has some minor maintainability issues: + +- **TODO Comments**: Known limitations around heartbeat queueing behavior and + multiple message assumptions +- **Buffer Management**: Complex parsing logic with manual offset tracking that + could be simplified +- **Protocol Coupling**: Hard-coded protocol versions and type conversions + throughout the code +- **Large Dispatch Functions**: `dispatch_rpc_query` and `dispatch_rpc_response` + are lengthy and could be broken down +- **Memory Calculation**: Unimplemented malloc size calculation for monitoring + +These are minor issues that don't affect functionality but could improve code +maintainability over time. diff --git a/p2p/src/network/scheduler/summary.md b/p2p/src/network/scheduler/summary.md new file mode 100644 index 000000000..be2828ff5 --- /dev/null +++ b/p2p/src/network/scheduler/summary.md @@ -0,0 +1,47 @@ +# Network Scheduler State Machine + +Manages network connections and protocol negotiation (despite its name, it +doesn't actually schedule tasks). + +## Purpose + +- Manages P2P connection lifecycle and state transitions +- Coordinates protocol selection and negotiation between peers +- Handles connection establishment, maintenance, and cleanup +- Routes messages between different network protocols +- Manages connection limits and resource allocation + +## Key Components + +- **Connection Manager**: Tracks connection states and peer relationships +- **Protocol Coordinator**: Handles protocol selection and handshakes +- **Stream Manager**: Manages individual protocol streams within connections +- **Resource Manager**: Enforces connection limits and handles cleanup + +## Interactions + +- Establishes and tears down peer connections +- Coordinates protocol negotiations (Noise, Yamux, Kademlia, etc.) +- Routes messages between network protocols and higher-level channels +- Manages connection state transitions and error recovery +- Enforces network-level resource limits and quotas + +## Technical Debt + +This component has significant naming and architectural issues: + +- **Identity Crisis**: Named "scheduler" but actually manages connections, not + task scheduling +- **Missing Features**: Summary claims bandwidth allocation, rate limiting, and + task scheduling but none are implemented +- **Large Reducer**: 650-line monolithic reducer handling multiple concerns + (connection management, protocol selection, error handling) +- **Mixed Responsibilities**: Single component handles too many different + network concerns +- **TODO Comments**: Multiple incomplete features (connection state handling, + error logging, async DNS resolution) +- **Documentation Mismatch**: Summary describes functionality that doesn't exist + +The component should be renamed to "Network Connection Manager" and either +implement the promised scheduling features or update documentation to reflect +actual functionality. diff --git a/p2p/src/network/select/summary.md b/p2p/src/network/select/summary.md new file mode 100644 index 000000000..70a9fc3e7 --- /dev/null +++ b/p2p/src/network/select/summary.md @@ -0,0 +1,48 @@ +# Protocol Select State Machine + +Implements multistream-select protocol for P2P protocol negotiation and +selection. + +## Purpose + +- Negotiates protocols with peers using multistream-select protocol +- Handles protocol compatibility checking and selection +- Manages initiator vs responder negotiation flows +- Provides protocol selection for establishing P2P streams +- Handles simultaneous connection scenarios + +## Key Components + +- **Token Parser**: Parses protocol negotiation tokens from byte streams +- **Protocol Registry**: Manages supported protocol definitions (hardcoded) +- **Negotiation State Machine**: Handles initiator/responder negotiation flows +- **Selection Logic**: Determines protocol compatibility and selection + +## Interactions + +- Exchanges protocol lists with connected peers +- Negotiates protocol selection through multistream-select handshake +- Handles protocol version compatibility checking +- Manages negotiation timeouts and error scenarios +- Resolves simultaneous connection attempts + +## Technical Debt + +This component has moderate maintainability and extensibility issues: + +- **Hardcoded Protocol Registry**: `Token::ALL` array must be manually updated + for new protocols, limiting extensibility +- **TODO Comments**: Incomplete implementations for alternative protocol + proposals and simultaneous connection handling +- **Complex State Machine**: Mixed error and negotiation states making + transitions unclear +- **String-based Errors**: Simple string errors instead of structured error + types limit debugging and recovery +- **Buffer Management**: Raw buffer manipulation in parsing logic is error-prone +- **Limited Protocol Flexibility**: No systematic approach to protocol + versioning or fallback mechanisms +- **Performance Issues**: Linear search through protocol list and frequent + memory allocations + +These issues make adding new protocols cumbersome and limit the robustness of +protocol negotiation. diff --git a/p2p/src/network/summary.md b/p2p/src/network/summary.md new file mode 100644 index 000000000..ae80e5a0b --- /dev/null +++ b/p2p/src/network/summary.md @@ -0,0 +1,48 @@ +# P2P Network State Machine + +Low-level networking protocols and transport management. + +## Purpose + +- Implements libp2p protocol stack for server-to-server communication +- Manages transport layer (TCP, WebRTC) with dual transport support +- Handles protocol negotiation and stream multiplexing +- Provides encrypted networking primitives for higher-level channels + +## Key Components + +- **Scheduler**: Connection orchestrator and protocol coordinator (misnamed - + manages connections, not scheduling) +- **Select**: Protocol negotiation using multistream-select (hardcoded protocol + registry limits extensibility) +- **Kad**: Kademlia DHT for peer discovery (complex bootstrap and routing table + issues) +- **Pubsub**: Gossip protocol for broadcasts (963-line monolithic reducer with + performance issues) +- **Identify**: Peer identification and capability exchange +- **Yamux**: Stream multiplexing over connections +- **Noise**: Encryption protocol (security hardening opportunities - session + keys not zeroized, debug output leaks, deprecated crypto functions) +- **Pnet**: Private network support with pre-shared key authentication +- **RPC**: Low-level request-response protocol with binprot serialization + +## Technical Debt + +- **Performance**: Pubsub has O(n) message lookups and unbounded memory growth +- **Architecture**: Large monolithic reducers across multiple components +- **Extensibility**: Hardcoded protocol registry in Select component + +## Interactions + +- Establishes encrypted connections with peer authentication +- Discovers peers via Kademlia DHT with bootstrap coordination +- Multiplexes multiple protocol streams over single connections +- Handles protocol negotiation with version compatibility +- Coordinates connection lifecycle and cleanup +- Provides transport abstraction for higher-level channels + +## Additional Issues + +- Several components need refactoring (Pubsub, Kad internals, Scheduler naming) +- Hard-coded values throughout that should be configurable +- Ongoing refactoring work in Yamux component (PR #1085) diff --git a/p2p/src/network/yamux/p2p_network_yamux_refactoring.md b/p2p/src/network/yamux/p2p_network_yamux_refactoring.md new file mode 100644 index 000000000..fc39ffbda --- /dev/null +++ b/p2p/src/network/yamux/p2p_network_yamux_refactoring.md @@ -0,0 +1,239 @@ +# P2P Network Yamux Refactoring Notes + +This document outlines the complexity issues in the Yamux component and tracks +ongoing refactoring efforts. + +## Current Implementation Issues + +### 1. Reducer Complexity + +The main reducer is a **387-line function** with complexity issues: + +- **Deep Nesting**: 4-5 levels of nesting in match statements +- **Large Action Handlers**: `IncomingFrame` handler spans 172 lines +- **Mixed Concerns**: Frame parsing, state management, and dispatching all mixed + together + +Example of deep nesting: + +```rust +match &frame.inner { + YamuxFrameInner::Data(_) => { + if let Some(stream) = yamux_state.streams.get_mut(&frame.stream_id) { + if stream.window_ours < stream.max_window_size / 2 { + if frame.flags.contains(YamuxFlags::FIN) { + // Complex logic buried here + } + } + } + } +} +``` + +### 2. State Management Complexity + +**Boolean Flag Explosion**: + +```rust +struct YamuxStreamState { + pub incoming: bool, + pub syn_sent: bool, + pub established: bool, + pub readable: bool, + pub writable: bool, + // Multiple flags create implicit state combinations +} +``` + +**Issue**: These boolean combinations create an implicit state machine that's +hard to reason about. + +**Nested Error Types**: + +```rust +pub terminated: Option, YamuxFrameParseError>> +``` + +**Issue**: Triple-nested types make error handling complex and error-prone. + +### 3. Buffer Management Complexity + +The buffer management includes complex optimization logic: + +```rust +fn shift_and_compact_buffer(&mut self, offset: usize) { + if self.buffer.capacity() > INITIAL_RECV_BUFFER_CAPACITY * 2 + && new_len < INITIAL_RECV_BUFFER_CAPACITY / 2 + { + // Reallocate and copy + let mut new_buffer = Vec::with_capacity(INITIAL_RECV_BUFFER_CAPACITY); + new_buffer.extend_from_slice(&old_buffer[offset..]); + self.buffer = new_buffer; + } else { + // In-place shift + self.buffer.copy_within(offset.., 0); + self.buffer.truncate(new_len); + } +} +``` + +**Issue**: Performance optimizations have made the code difficult to understand +and maintain. + +### 4. Flow Control Complexity + +Window management uses saturating arithmetic throughout: + +```rust +stream.window_theirs = stream.window_theirs.saturating_add(*difference); +stream.window_ours = stream.window_ours.saturating_sub(frame.len_as_u32()); +``` + +**Issue**: Scattered window management logic makes it hard to verify +correctness. + +### 5. Frame Processing Pipeline + +The frame parsing function is 88 lines with deep nesting: + +```rust +pub fn try_parse_frame(&mut self, offset: usize) -> Option { + match buf[1] { + 0 => { /* Data frame - 17 lines */ } + 1 => { /* Window Update - 8 lines */ } + 2 => { /* Ping - 8 lines */ } + 3 => { /* GoAway - 16 lines */ } + unknown => { /* Error handling */ } + } +} +``` + +## Recent Improvements + +### Main Branch Fixes + +Recent commits have addressed specific issues: + +- **9d07084a**: Fixed pending queue overflow vulnerabilities +- **ef1868f1**: Abstracted incoming state reduction, managed recv buffer size + growth +- **d297e059**: Implemented buffer reuse +- **3afc60b8**: Refactored window size update to prevent underflow +- **6024078c**: Updated types from `i32` to `u32` for safety +- **9de67703**: Removed unnecessary frame cloning + +### Ongoing Refactoring (PR #1085) + +The `tweaks/yamux` branch contains significant refactoring work (9 commits, ++933/-182 lines): + +1. **6bd36e8f**: Simplified reducer +2. **3e05cdae**: Further reducer simplification +3. **6cdc357b**: Split incoming frame handling into multiple actions +4. **9955f49e**: Added comprehensive tests (592 lines) +5. **d0366e9e**: Moved state update logic to state methods +6. **328fa371**: Fixed tests +7. **90cdc883**: Additional refactoring +8. **2af4a09a**: Fixed clippy warnings + +## Proposed Architecture Improvements + +### 1. Replace Boolean Flags with Explicit State Machine + +```rust +enum StreamState { + Closed, + SynSent, + SynReceived, + Established, + FinWait, + CloseWait, + Closing, + TimeWait, +} + +struct YamuxStreamState { + state: StreamState, + flow_control: FlowController, + // Other non-state fields +} +``` + +### 2. Extract Specialized Frame Handlers + +```rust +impl P2pNetworkYamuxState { + fn handle_data_frame(&mut self, stream_id: StreamId, frame: DataFrame) -> Vec { } + fn handle_window_update(&mut self, stream_id: StreamId, update: WindowUpdate) -> Vec { } + fn handle_ping_frame(&mut self, frame: PingFrame) -> Vec { } + fn handle_goaway_frame(&mut self, frame: GoAwayFrame) -> Vec { } +} +``` + +### 3. Create Buffer Management Abstraction + +```rust +struct FrameBuffer { + buffer: Vec, + read_position: usize, + capacity_policy: CapacityPolicy, +} + +impl FrameBuffer { + fn parse_next_frame(&mut self) -> Result, ParseError> { } + fn compact(&mut self) { } + fn append(&mut self, data: &[u8]) { } +} +``` + +### 4. Encapsulate Flow Control + +```rust +struct FlowController { + window_size: u32, + max_window_size: u32, + pending_frames: VecDeque, + + fn can_send(&self, size: u32) -> bool { } + fn consume_window(&mut self, size: u32) { } + fn update_window(&mut self, delta: u32) { } +} +``` + +### 5. Simplify Error Handling + +```rust +enum YamuxError { + ParseError(YamuxFrameParseError), + SessionError(YamuxSessionError), + FlowControlError { stream_id: StreamId, reason: String }, +} + +// Single result type +type YamuxResult = Result; +``` + +## Benefits of Refactoring + +1. **Readability**: Explicit state machines are easier to understand than + boolean combinations +2. **Maintainability**: Specialized handlers isolate concerns +3. **Testability**: Smaller, focused functions are easier to test +4. **Performance**: Better abstractions don't sacrifice performance +5. **Correctness**: Clearer flow control logic reduces bugs + +## Migration Strategy + +1. **Phase 1**: Complete PR #1085 work (action splitting, state method + extraction) +2. **Phase 2**: Introduce state enum alongside boolean flags +3. **Phase 3**: Extract buffer and flow control abstractions +4. **Phase 4**: Migrate to specialized frame handlers +5. **Phase 5**: Remove legacy boolean flags + +## Conclusion + +The Yamux component has accidental complexity where performance optimizations +and edge case handling have obscured the core multiplexing logic. The ongoing +refactoring in PR #1085 is a good start, but further architectural improvements +are needed to make the component more maintainable and understandable. diff --git a/p2p/src/network/yamux/summary.md b/p2p/src/network/yamux/summary.md new file mode 100644 index 000000000..a100f164a --- /dev/null +++ b/p2p/src/network/yamux/summary.md @@ -0,0 +1,45 @@ +# Yamux State Machine + +Implements Yamux stream multiplexing protocol for the P2P network layer. + +## Purpose + +- Multiplexes multiple streams over a single connection +- Manages stream lifecycle (creation, establishment, closure) +- Handles flow control with window-based backpressure +- Provides stream isolation and data routing +- Implements frame parsing and buffering + +## Key Components + +- **Frame Parser**: Parses incoming Yamux protocol frames +- **Stream Manager**: Tracks per-stream state and flow control +- **Buffer Management**: Handles incoming data buffering and optimization +- **Flow Controller**: Manages window sizes and backpressure + +## Interactions + +- Receives raw data from transport layer +- Parses Yamux frames from buffered data +- Creates and manages substreams +- Routes stream data to appropriate handlers +- Manages window sizes and flow control +- Handles stream closure and cleanup + +## Technical Debt + +This component has significant complexity issues. See +[p2p_network_yamux_refactoring.md](./p2p_network_yamux_refactoring.md) for +details on: + +- **Reducer Complexity**: 387-line reducer with deep nesting (4-5 levels) +- **State Management**: Complex boolean flag combinations that represent the + yamux protocol but need better documentation for clarity +- **Buffer Management**: Complex optimization logic mixing performance and + correctness concerns +- **Flow Control**: Scattered window management using saturating arithmetic +- **Error Handling**: Nested error types making error handling complex + +**Ongoing Work**: PR #1085 (`tweaks/yamux` branch) contains 9 commits with +refactoring to address these issues, including action splitting, state method +extraction, and comprehensive testing. diff --git a/p2p/src/summary.md b/p2p/src/summary.md new file mode 100644 index 000000000..a1f2b4fb4 --- /dev/null +++ b/p2p/src/summary.md @@ -0,0 +1,76 @@ +# P2P State Machine + +Core peer-to-peer networking layer providing dual transport abstraction for node +communication. + +## Purpose + +- Provides unified networking interface over libp2p and WebRTC transports +- Manages peer connections, discovery, and lifecycle across dual transports +- Implements transport abstraction for Mina-specific blockchain protocols +- Handles network topology maintenance and peer space management +- Coordinates encrypted communication and protocol negotiation + +## Architecture Layers + +- **Network Layer**: Low-level protocol implementations (Noise encryption, Yamux + multiplexing, Kademlia DHT, etc.) +- **Channels Layer**: Transport abstraction for Mina-specific protocols + (transactions, SNARKs, RPC, signaling) +- **Connection Layer**: Peer connection lifecycle management for + incoming/outgoing connections +- **P2P Orchestration**: Top-level coordination, peer management, and system + integration + +## Dual Transport Support + +- **libp2p Backend**: Server-to-server communication with gossip/pubsub + broadcasting +- **WebRTC Browser**: Direct peer connections with request/response patterns +- **Transport Abstraction**: Unified API that adapts between push-based and + pull-based paradigms +- **Protocol Adaptation**: Automatic routing based on peer capabilities and + connection types + +## Key Components + +- **Connection Management**: Handles incoming/outgoing connection state machines + and lifecycle +- **Channels**: Transport-agnostic communication for transactions, SNARKs, RPC, + and blockchain data +- **Network Protocols**: Encryption (Noise), multiplexing (Yamux), discovery + (Kademlia), messaging (Pubsub) +- **Disconnection**: Automated peer space management with stability protection + and cleanup +- **Peer Management**: Individual peer state tracking across multiple transport + connections + +## Interactions + +- Connects to bootstrap and peer nodes via libp2p and WebRTC +- Propagates blocks, transactions, and SNARK data across the network +- Handles peer discovery via Kademlia DHT with bootstrap coordination +- Manages connection limits with automated peer space management +- Routes protocol messages to appropriate business logic handlers +- Provides callbacks for decoupled integration with node components + +## Integration + +- **Business Logic**: Interfaces with transaction pool, SNARK pool, blockchain + state, and block production +- **Transport Services**: Abstracts libp2p and WebRTC operations through service + traits +- **Configuration**: Supports private networks, connection limits, and protocol + versioning + +## Technical Debt + +- **Security**: Noise encryption component needs session key zeroization for + defense-in-depth +- **Performance**: Pubsub component has O(n) lookups and monolithic 963-line + reducer +- **Architecture**: Several components have large monolithic reducers that need + refactoring +- **Partial Migration**: P2P components are partially migrated to new patterns - + some components still have effects files with business logic beyond just + service invocations diff --git a/snark/src/block_verify/summary.md b/snark/src/block_verify/summary.md new file mode 100644 index 000000000..ca386bb3c --- /dev/null +++ b/snark/src/block_verify/summary.md @@ -0,0 +1,18 @@ +# Block Verify State Machine + +Manages block proof verification workflows. Does not perform actual +cryptographic verification - that's handled by services using the ledger crate. + +## Purpose + +- Orchestrates block proof verification requests +- Tracks verification job lifecycle (Init → Pending → Success/Error) +- Manages verification queue with callbacks +- Coordinates with block verification services + +## Interactions + +- Receives block verification requests from other components +- Dispatches effectful actions to verification services +- Tracks pending verification jobs +- Executes callbacks when verification completes or fails diff --git a/snark/src/summary.md b/snark/src/summary.md new file mode 100644 index 000000000..d57a30363 --- /dev/null +++ b/snark/src/summary.md @@ -0,0 +1,33 @@ +# SNARK State Machine + +Orchestrates proof verification workflows for the Mina protocol. This crate +contains state machine logic only - actual cryptographic verification is +performed by the ledger crate. + +## Purpose + +- Manages proof verification workflows and job queuing +- Tracks verification request lifecycle (Init → Pending → Success/Error) +- Coordinates with verification services +- Handles verification results and callbacks + +## Key Components + +- **Block Verify**: Manages block proof verification workflows +- **User Command Verify**: Manages transaction and zkApp proof verification + workflows +- **Work Verify**: Manages SNARK work proof verification workflows + +## Technical Details + +- Imports actual verifiers from `ledger::proofs::verifiers` +- Effectful actions call verification services via thin service layer +- State machines track pending requests with callbacks for async results +- No cryptographic verification logic - pure workflow orchestration + +## Interactions + +- Receives verification requests from other components +- Dispatches effectful actions to verification services +- Tracks verification job status and results +- Executes callbacks when verification completes diff --git a/snark/src/user_command_verify/summary.md b/snark/src/user_command_verify/summary.md new file mode 100644 index 000000000..832065e33 --- /dev/null +++ b/snark/src/user_command_verify/summary.md @@ -0,0 +1,37 @@ +# User Command Verify State Machine + +Coordinates SNARK verification for zkApp transactions. + +## Purpose + +- Orchestrates SNARK verification requests for user commands +- Manages verification job queue and lifecycle +- Tracks verification status through state transitions +- Coordinates callbacks for verification results + +## Key Components + +- **Request Queue**: Manages pending verification jobs using `PendingRequests` +- **Status Tracking**: Tracks jobs through Init → Pending → Success/Error → + Finish states +- **Callback System**: Handles success/error callbacks for decoupled + communication +- **Verifier Resources**: Maintains references to `TransactionVerifier` and + `VerifierSRS` + +## Interactions + +- Receives zkApp transaction verification requests +- Dispatches to effectful actions for actual SNARK verification work +- Manages verification job lifecycle and cleanup +- Executes callbacks to report verification results +- Integrates with transaction pool for validated transactions + +## Technical Debt + +### Minor Issues + +- **Missing Error Callback**: TODO to dispatch error callbacks + (snark_user_command_verify_reducer.rs:95) +- **Debug Display**: TODO to display hashes instead of full content + (snark_user_command_verify_state.rs:37) diff --git a/snark/src/work_verify/summary.md b/snark/src/work_verify/summary.md new file mode 100644 index 000000000..b3ea69af3 --- /dev/null +++ b/snark/src/work_verify/summary.md @@ -0,0 +1,18 @@ +# Work Verify State Machine + +Manages SNARK work proof verification workflows. Does not perform actual +cryptographic verification - that's handled by services using the ledger crate. + +## Purpose + +- Orchestrates SNARK work verification requests +- Tracks verification job lifecycle (Init → Pending → Success/Error) +- Manages verification queue with batch processing +- Coordinates with work verification services + +## Interactions + +- Receives work verification requests from SNARK pool +- Dispatches effectful actions to verification services +- Tracks pending verification jobs with batch support +- Executes callbacks when verification completes or fails