diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 000000000..15010d532 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,200 @@ + +# Fabric-X Committer Architecture + +## Table of Contents + +1. [Overview](#1-overview) +2. [Architecture Diagram](#2-architecture-diagram) +3. [Service Components](#3-service-components) +4. [State Management](#4-state-management) +5. [Failure Recovery](#5-failure-recovery) + +## 1. Overview + +The Fabric-X Committer is built around six core services, each designed with specific responsibilities in the transaction processing pipeline. The services are categorized into stateful and stateless components, with careful consideration given to scalability and fault tolerance. + +These six services are: **Sidecar**, **Coordinator**, **Verifier**, **Validator-Committer (VC)**, **Query Service**, and **Database Cluster**. + +The architecture achieves high throughput through a sophisticated pipelined design that enables parallel processing of conflict-free transactions while maintaining deterministic outcomes and strong consistency guarantees. Each service plays a specific role: the Sidecar acts as the entry point for blocks from the Ordering Service, the Coordinator orchestrates the validation flow using dependency graph analysis, Verifier services perform parallel signature verification, Validator-Committer (VC) services execute final MVCC validation and commit operations against the database, and the Query Service provides read-only access to the committed world state for clients and endorsers. + +## 2. Architecture Diagram + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Ordering Service │ +│ (External Component) │ +└────────────────────────────────┬────────────────────────────────┘ + │ + │ Fetch Blocks + ▼ + ┌────────────────┐ + │ Sidecar │ (1 instance) + │ (Stateful) │ + │ - Block Store │ + │ - Query API │ + └────────┬───────┘ + │ + │ Relay Blocks + ▼ + ┌────────────────┐ + │ Coordinator │ (1 instance) + │ (Stateless) │ + │ - Dep Graph │ + │ - Dispatcher │ + └───┬────────┬───┘ + │ │ + ┌────────────┘ └────────────┐ + │ │ + │ Verify Signatures │ Validate & Commit + ▼ ▼ + ┌─────────────┐ ┌─────────────┐ + │ Verifier │ (2-3 instances) │ VC │ (3-6 instances) + │ (Stateless) │ │ (Stateless) │ + │ - Policy │ │ - Prepare │ + │ Cache │ │ - Validate │ + └─────────────┘ │ - Commit │ + └──────┬──────┘ + │ + Clients/Endorsers │ Read/Write + │ │ + │ State Queries │ + ▼ │ + ┌──────────────────┐ │ + │ Query Service │ (2+ inst.) │ + │ (Stateless) │ │ + └────────┬─────────┘ │ + │ │ + │ Read Only │ + ▼ ▼ + ┌─────────────────────────────────────┐ + │ Database Cluster │ (6-9 nodes) + │ (Stateful) │ + │ - World State - Tx Status │ + │ - Policies - Configuration │ + └─────────────────────────────────────┘ +``` + +### Component Interactions + +1. **Sidecar** fetches blocks from the Ordering Service and relays them to the Coordinator. +2. **Coordinator** dispatches transactions to Verifiers (signature verification) and VCs (MVCC validation and commit) in a pipelined fashion — verification and commit of independent transaction sets proceed concurrently. +3. **VC services** read/write the database cluster with client-side load balancing, then return statuses to the Coordinator. +4. **Coordinator** relays statuses back to the Sidecar, which delivers committed blocks and notifications to clients. + +The **Query Service** operates independently of the commit pipeline. Clients and endorsers query world state directly through the Query Service, which reads from the database. + +## 3. Service Components + +### 3.1 Sidecar Middleware + +The Sidecar acts as the critical bridge between the Hyperledger Fabric-X Ordering Service and the Committer's internal processing pipeline. It is responsible for fetching blocks sequentially from the Ordering Service, performing initial transaction validation to filter out malformed transactions, and maintaining a durable local block store on the file system. + +Beyond block ingestion, the Sidecar serves as the delivery endpoint for clients, providing committed blocks to registered applications and offering a notification service for transaction status updates. It also exposes query APIs that allow clients to fetch historical blocks and transactions directly from the block store. + +**Key Characteristics:** + +- **State**: Stateful - maintains a local block store on the file system +- **Cardinality**: Single instance per deployment (critical single point) +- **Scalability**: Vertical scaling only + +**Reference**: See [sidecar.md](sidecar.md) for detailed documentation. + +### 3.2 Coordinator Service + +The Coordinator serves as the central orchestrator of the entire validation and commit pipeline. Its primary task is the use of a transaction dependency graph to identify conflict-free transactions that can be processed in parallel, dramatically improving throughput while maintaining deterministic outcomes. + +The Coordinator receives blocks from the Sidecar, constructs and maintains the dependency graph, and intelligently dispatches transactions to downstream services. It manages connections to all Verifier and VC service instances, performing load balancing and automatic failover when services become unavailable. As transactions complete validation and commit, the Coordinator aggregates their final statuses and relays them back to the Sidecar for client delivery. + +**Key Characteristics:** + +- **State**: Stateless - all persistent state stored in the database via VC service +- **Cardinality**: Single instance per deployment (critical single point) +- **Scalability**: Vertical scaling only + +**Reference**: See [coordinator.md](coordinator.md) for detailed documentation. + +### 3.3 Verifier Service (Signature Verification) + +The Verifier service specializes in validating transaction signatures against namespace endorsement policies. It maintains an in-memory cache of policies for fast lookup, with the database serving as the authoritative source. The service performs parallel signature verification across multiple worker threads, making it highly efficient for high-throughput scenarios. Note that transaction structure validation is performed by the Sidecar. + +**Key Characteristics:** + +- **State**: Stateless - policies cached in memory, source of truth in database +- **Cardinality**: Multiple instances (typically 2-3 for redundancy and load distribution) +- **Scalability**: Horizontal and vertical scaling + +**Reference**: See [verification-service.md](verification-service.md) for detailed documentation. + +### 3.4 Validator-Committer (VC) Service + +The VC service executes the final stages of transaction processing through a sophisticated three-stage pipeline: Prepare, Validate, and Commit. During the Prepare stage, transactions are structured for efficient validation. The Validate stage performs Multi-Version Concurrency Control (MVCC) checks against the database to detect read-write conflicts. Finally, the Commit stage writes valid transactions to the database and persists their final status. + +This service is stateless, with all state maintained in the database. The three-stage pipeline operates entirely on transient in-memory data structures. VC services are designed to be horizontally scalable, with typical deployments running 3-6 instances. These instances are often co-located with database nodes to minimize network latency for database operations. + +Each VC instance connects to all database nodes in the cluster and performs client-side load balancing, ensuring efficient utilization of database resources. + +**Key Characteristics:** + +- **State**: Stateless - all state persisted in the database +- **Cardinality**: Multiple instances (typically 3-6, often co-located with database nodes) +- **Scalability**: Horizontal and vertical scaling + +**Reference**: See [validator-committer.md](validator-committer.md) for detailed documentation. + +### 3.5 Query Service + +The Query Service provides read-only access to the committed world state stored in the database. It serves key-value queries for each namespace, enabling clients and endorsers to read the current state without going through the commit pipeline. This is distinct from the Sidecar's query APIs, which serve block and transaction data from the local block store. + +The Query Service connects directly to the database cluster to read namespace tables. It does not need to be co-located with database nodes since it performs read-only operations and should not impact the performance of the commit path. + +**Key Characteristics:** + +- **State**: Stateless - reads world state from the database +- **Cardinality**: Multiple instances (minimum 2, scale based on endorser count and query throughput) +- **Scalability**: Horizontal and vertical scaling + +**Reference**: See [query-service.md](query-service.md) for detailed documentation. + +### 3.6 Database Cluster + +The database cluster serves as the system's persistent storage layer, maintaining the world state for all namespaces and the final status of every transaction. The system supports two database options, each optimized for different deployment scenarios. + +**YugabyteDB** is recommended for production deployments on commodity hardware. It is a distributed SQL database built on top of PostgreSQL, designed specifically for cloud-native deployments. YugabyteDB can achieve throughput exceeding 100,000 transactions per second on standard server hardware through its distributed architecture and automatic sharding capabilities. + +**PostgreSQL** is also supported for production use, but requires high-performance flash storage arrays to achieve comparable throughput. Without flash storage, PostgreSQL cannot match the 100k+ TPS performance of YugabyteDB due to I/O limitations. However, for organizations with existing PostgreSQL infrastructure and flash storage investments, it remains a viable option. + +**Key Characteristics:** + +- **State**: Stateful - stores all committed world state and transaction statuses +- **Cardinality**: Multiple nodes (typically 6-9 for high availability) +- **Scalability**: Horizontal and vertical scaling +- **Consistency**: Raft consensus (YugabyteDB) or configured replication mode (PostgreSQL) + +## 4. State Management + +State management in Fabric-X Committer is carefully designed to minimize persistent state while ensuring correctness and enabling efficient recovery. The architecture distinguishes between persistent state that must survive service restarts and transient state that can be safely discarded. + +**Persistent State** is maintained in only two locations. The Sidecar maintains a block store on the local file system, which serves as a durable record of all blocks received from the Ordering Service. While this block store can be reconstructed from the Ordering Service if lost, maintaining it locally enables fast query responses and reduces load on the Ordering Service. The Database cluster stores all committed world state (key-value pairs for each namespace), transaction statuses, namespace endorsement policies, and system configuration. This makes the database the authoritative source of truth for the entire system. + +**Transient State** exists only in memory and is reconstructed on service restart. The Coordinator maintains an in-memory dependency graph for the current block being processed and tracks pending transactions across Verifier and VC services. This state is rebuilt from scratch as new blocks arrive after restart. Verifier services cache namespace policies in memory for fast lookup—policies are reloaded from the database on restart. VC services maintain in-memory pipeline buffers for the Prepare, Validate, and Commit stages, which hold transactions temporarily as they flow through the pipeline. These buffers are emptied and rebuilt during normal operation. + +## 5. Failure Recovery + +The system handles failures in any service and ensures correctness. + +**Verifier Failure.** The Coordinator tracks transactions per Verifier instance. If a Verifier fails, the Coordinator detects this and resubmits its pending transactions to an available replica. To prevent duplicate processing from transient issues, the Coordinator ignores late responses for transactions that have already been reassigned. See [Verifier Error Handling and Recovery](verification-service.md#7-error-handling-and-recovery) for more details. + +**Validator-Committer Failure.** Similarly, the Coordinator resubmits pending transactions from a failed VC service to another replica. To prevent incorrect validation or duplicate writes from this resubmission, the commit operation is idempotent. When a VC service attempts to store a transaction's status, it first checks if a record matching the txID, block number, and transaction index already exists. If a match is found, the service reuses the existing committed status instead of reprocessing the transaction. See [VC Failure and Recovery](validator-committer.md#7-failure-and-recovery) for more details. + +**Coordinator Failure.** The Sidecar periodically stores the last fully committed block number in the database. After a restart, the Coordinator reads this number to determine the next expected block from the Sidecar, which then begins fetching blocks from that point. Because transactions can be committed across blocks, it is possible some transactions in a fetched block are already committed. The idempotent design of the VC services handles this: they detect existing transaction statuses (via txID, block number, and index) and return the stored status without recommitting. See [Coordinator Failure and Recovery](coordinator.md#7-failure-and-recovery) for more details. + +**Sidecar Failure.** When the Sidecar restarts, it gets the next expected block number from the Coordinator and compares it to the last block in its local store. If there is a gap, the Sidecar fetches the missing blocks from the Ordering Service and their transaction statuses from the state database to update its store. It then resumes fetching new blocks from the Ordering Service. See [Sidecar Failure and Recovery](sidecar.md#7-failure-and-recovery) for more details. + +**Query Service Failure.** The Query Service is stateless and holds no persistent state. On restart, it reconnects to the database and resumes serving queries immediately. Clients experience a brief interruption and can retry against another Query Service instance. See [Query Service Error Handling and Recovery](query-service.md#error-handling-and-recovery) for more details. + +**Database Node Failure.** Database node failures are handled by shard replication; if a shard fails, its replicas remain available. A catastrophic failure or corruption of the entire state database requires rebuilding the state by reprocessing all blocks from the Ordering Service. +**Multiple Service Failures.** When multiple services fail simultaneously, recovery follows the same per-service procedures described above. The idempotent design of the VC services and the checkpoint-based recovery of the Coordinator and Sidecar ensure that even concurrent failures do not lead to data corruption or duplicate commits. Each service recovers independently upon restart, and the system converges to a consistent state as services come back online. diff --git a/docs/deployment-guide.md b/docs/deployment-guide.md new file mode 100644 index 000000000..608950da2 --- /dev/null +++ b/docs/deployment-guide.md @@ -0,0 +1,176 @@ + +# Deployment Guide for Fabric-X Committer + +## Table of Contents + +1. [Hardware Requirements](#1-hardware-requirements) +2. [Reference Deployment](#2-reference-deployment) +3. [Scaling Guidance](#3-scaling-guidance) +4. [Startup and Shutdown Order](#4-startup-and-shutdown-order) + +## 1. Hardware Requirements + +Resource requirements depend on the target throughput, transaction complexity, and workload characteristics. The specifications below are sized for achieving a minimum of 50,000 transactions per second with a simple workload of read-write transactions, each involving 2 key-value pairs of 32 bytes key and 32 bytes value. More complex workloads (larger key-value sizes, higher read/write set counts, complex endorsement policies) will require additional resources. + +Recommended specifications: + +- **Sidecar Node** (1 node): 32 CPU cores, 8 GB RAM, high-performance NVMe SSD (for block store) +- **Coordinator Node** (1 node): 32 CPU cores, 8 GB RAM +- **Verifier Nodes** (3 nodes): 32 CPU cores, 8 GB RAM +- **VC + Database Nodes** (6-9 nodes): 32 CPU cores, 32 GB RAM, high-performance NVMe SSD (for database storage) +- **Query Service Nodes** (2+ nodes): 32 CPU cores, 8 GB RAM +- **Network**: 10 Gbps links between all nodes + +The Sidecar requires high-performance NVMe storage because block store operations are I/O-intensive — blocks must be written durably and read back with minimal latency to keep the pipeline fed. Database nodes similarly require NVMe SSDs to sustain the write throughput demanded by concurrent MVCC validation and commit operations, where multiple VC instances drive parallel writes across distributed tablets. For the stateless services (Coordinator, Verifier, VC), 8 GB RAM is sufficient since they hold only in-flight request state. Database nodes need 32 GB RAM to maintain effective in-memory caching of frequently accessed world state data and tablet metadata, reducing disk I/O for read-heavy MVCC validation. + +Actual requirements should be validated through benchmarking with your workload on the target hardware and topology. + +## 2. Reference Deployment + +The following topology is a reference starting point based on micro-benchmarks, not a prescription. Adjust based on benchmarking with your actual workload. + +``` +┌──────────────────────────────────────────────────────────────┐ +│ Ordering Service │ +│ (External Component) │ +└──────────────────────────┬───────────────────────────────────┘ + │ + │ Fetch Blocks + ▼ + ┌──────────────────────────────┐ + │ Sidecar │ + │ (Stateful) │ + │ - Block Store │ + │ - Query API │ + └──────────────┬───────────────┘ + │ + │ Relay Blocks + ▼ + ┌──────────────────────────────┐ + │ Coordinator │ + │ (Stateless) │ + │ - Dependency Graph │ + │ - Load Balancer │ + └──────────────┬───────────────┘ + │ + ┌────────────┴────────────┐ + │ │ + │ Verify Signatures │ Validate & Commit + ▼ ▼ + ┌───────────────────────┐ ┌──────────────────────────────────┐ + │ Verifier-1 │ │ VC-1 (co-located with DB-1) │ + │ (Stateless) │ │ (Stateless) │ + ├───────────────────────┤ ├──────────────────────────────────┤ + │ Verifier-2 │ │ VC-2 (co-located with DB-2) │ + │ (Stateless) │ │ (Stateless) │ + ├───────────────────────┤ ├──────────────────────────────────┤ + │ Verifier-3 │ │ VC-3 (co-located with DB-3) │ + │ (Stateless) │ │ (Stateless) │ + └───────────────────────┘ ├──────────────────────────────────┤ + │ VC-4 (co-located with DB-4) │ + │ (Stateless) │ + ├──────────────────────────────────┤ + │ VC-5 (co-located with DB-5) │ + │ (Stateless) │ + ├──────────────────────────────────┤ + │ VC-6 (co-located with DB-6) │ + │ (Stateless) │ + └────────────┬─────────────────────┘ + │ + │ Read/Write + │ (All VCs connect to all DB nodes) + ▼ + ┌─────────────────────────────────────────────┐ + │ Database Cluster (9 nodes) │ + │ (Stateful) │ + ├─────────────────────────────────────────────┤ + │ DB-1 │ DB-2 │ DB-3 │ DB-4 │ DB-5 │ + │ DB-6 │ DB-7 │ DB-8 │ DB-9 │ + ├─────────────────────────────────────────────┤ + │ - World State (all namespaces) │ + │ - Transaction Status │ + │ - Namespace Policies │ + │ - System Configuration │ + └─────────────────────────────────────────────┘ +``` + +**Why these numbers:** + +- **Sidecar (1)**: Single instance by design. The Sidecar processes blocks sequentially from the Ordering Service — there is no benefit to multiple instances. Scale vertically if block ingestion rate is insufficient. +- **Coordinator (1)**: Single instance by design. The Coordinator maintains an in-memory dependency graph that must be consistent, so only one instance can manage it. Scale vertically if dependency graph management becomes a bottleneck. +- **Verifier (3)**: Provides parallel signature verification with redundancy. Three instances tolerate 1-2 failures while continuing to process transactions. Adjust the count based on endorsement policy complexity — more complex policies with more signatures per transaction benefit from additional instances. +- **VC (6, co-located with DB)**: Co-location of VC instances with database nodes minimizes network latency for MVCC operations, reducing round-trip times from milliseconds to microseconds. Each VC instance connects to ALL database nodes via client-side load balancing, so co-location with a subset of nodes is sufficient to gain the latency benefit. +- **Database (9, RF=3)**: A 9-node cluster with replication factor 3 tolerates up to 2 simultaneous node failures without data loss or availability impact. The node count enables distributed query processing across many tablets for high write throughput. +- **Query Service (2+)**: Provides read-only access to committed world state for clients and endorsers. Minimum 2 instances for redundancy. Does not need to be co-located with database nodes since it performs read-only operations and should not impact the commit path performance. Scale based on the number of endorsers and query throughput requirements. + +For detailed service descriptions and architecture, see the [Architecture Guide](architecture.md). + +## 3. Scaling Guidance + +All service endpoints must be pre-configured before deployment. To scale up, start additional pre-configured instances. To scale down, stop the instances no longer needed. + +### 3.1 Scaling Verifier Instances + +Verifier services are CPU-bound — their primary work is cryptographic signature verification. Add Verifier instances when CPU utilization on existing instances is consistently saturated. To scale up, start additional pre-configured Verifier instances. To scale down, stop the instances no longer needed. + +### 3.2 Scaling VC and Database Instances + +VC instances may be co-located with database nodes but do not need to be scaled together. When adding VC capacity, start additional pre-configured VC instances. When adding database capacity, add new database nodes. After adding database nodes, YugabyteDB requires tablet rebalancing to distribute data across the new nodes. + +### 3.3 Scaling Database Cluster + +When adding database nodes, add them in pairs to maintain balanced data distribution. Configure `table-pre-split-tablets` to match the new tablet server count so that newly created tables are distributed evenly. See [Validator-Committer Configuration](validator-committer.md) for database configuration details. + +### 3.4 Scaling Query Service Instances + +Query Service instances are stateless and perform read-only queries against the database. Add instances based on the number of endorsers and query throughput requirements. Query Service instances do not need to be co-located with database nodes. + +### 3.5 Sidecar and Coordinator + +Both the Sidecar and Coordinator are single-instance services. They cannot be scaled horizontally. If either becomes a bottleneck, scale vertically by allocating more CPU cores and memory to the node. + +See [Performance Tuning](performance-tuning.md) for configuration parameters and their impact on throughput and latency. + +## 4. Startup and Shutdown Order + +### Startup Sequence + +Start the database cluster first and wait for it to be healthy before starting any services. + +1. **Start Database Cluster** + - Start all database nodes + - Wait for the cluster to form and achieve quorum + - Verify cluster health and connectivity + - For YugabyteDB: ensure all tablet servers are online and the tablet leader election is complete + - For PostgreSQL: ensure replication is established between primary and replicas + +2. **Start Services** (any order after database is healthy) + - VC Service — connects to database on startup + - Query Service — connects to database on startup + - Verifier Service — stateless, loads policies on first request + - Coordinator Service — connects to VC and Verifier services + - Sidecar Service — connects to Coordinator and Ordering Service + +### Service Dependencies + +| Service | Dependencies | +|---------|-------------| +| Database | No dependencies | +| VC Service | Requires Database | +| Query Service | Requires Database | +| Verifier | No dependencies (policies loaded on first request) | +| Coordinator | Requires VC Service and Verifier | +| Sidecar | Requires Coordinator and Ordering Service | + +### Graceful Shutdown + +Shut down in reverse order to drain work through the pipeline: + +1. **Sidecar** — stops ingesting new blocks +2. **Coordinator** — stops dispatching transactions +3. **Verifier and VC Services** — completes in-flight work +4. **Database** — shut down last to ensure all data is persisted diff --git a/docs/performance-tuning.md b/docs/performance-tuning.md new file mode 100644 index 000000000..6b56b137c --- /dev/null +++ b/docs/performance-tuning.md @@ -0,0 +1,379 @@ + +# Performance Tuning Guide + +This guide explains how each configuration parameter affects system performance — throughput, latency, memory, and pipeline flow. All parameters are documented with their sample values in `cmd/config/samples/`. + +## Table of Contents + +1. [Pipeline Flow Control](#1-pipeline-flow-control) +2. [Identifying the Bottleneck](#2-identifying-the-bottleneck) +3. [Sidecar](#3-sidecar) +4. [Coordinator](#4-coordinator) +5. [Verifier](#5-verifier) +6. [Validator-Committer (VC)](#6-validator-committer-vc) +7. [Query Service](#7-query-service) +8. [Database](#8-database) +9. [Co-location Impact](#9-co-location-impact) +10. [Benchmarking](#10-benchmarking) + +## 1. Pipeline Flow Control + +The commit pipeline processes transactions through a sequence of stages connected by bounded channels and slot-based limits. These flow controls don't just protect memory — they directly control how much work flows through the pipeline. + +``` +Orderer → Sidecar → Coordinator → Dependency Graph → Verifier → VC → Database + ↑ ↑ | + └── status ──└──────────── status ──────────────────┘ +``` + +Each `→` is a bounded channel or gRPC stream. The system uses three types of flow control: + +- **Slot-based limits** act as per-transaction semaphores. Slots are acquired before processing and released only when transactions complete downstream. When exhausted, channels fill up and the Sidecar stops pulling blocks. +- **Channel buffers** connect adjacent stages within a process. When full, the producer blocks. +- **gRPC flow control** operates at the transport layer between services. gRPC uses HTTP/2 flow control windows — when a receiver is slow, its window fills up and the sender is blocked from writing more data. This prevents a fast producer (e.g., Sidecar) from overwhelming a slow consumer (e.g., Coordinator) at the network level, independent of the application-level slot and channel limits. + +Setting any limit too low starves the pipeline — stages run in lock-step rather than streaming, and throughput drops. Setting limits too high increases memory usage and queuing latency. The goal is finding the balance where the pipeline has enough in-flight work to sustain throughput without excessive queuing. + +## 2. Identifying the Bottleneck + +Monitor queue length gauges to find the bottleneck. A growing queue means the downstream stage cannot keep up. Tune that stage first. The full list of queue metrics and other observability metrics can be found in the [Metrics Reference](metrics_reference.md). Key queues to watch: + +| Queue Metric | Stage | Growing Queue Means | +|-------------|-------|---------------------| +| `sidecar_relay_input_block_queue_size` | Block Ingestion | Coordinator not consuming blocks fast enough | +| `sidecar_relay_waiting_transactions_queue_size` | Relay | Transactions waiting for commit statuses to return | +| `sidecar_relay_output_committed_block_queue_size` | Committed Blocks | Committed blocks backing up; downstream consumers slow | +| `coordinator_sigverifier_input_tx_batch_queue_size` | Signature Verification | Verifiers cannot keep up; add instances or CPU | +| `coordinator_sigverifier_output_validated_tx_batch_queue_size` | Verified → VC | VC services not consuming verified transactions fast enough | +| `coordinator_vcservice_output_validated_tx_batch_queue_size` | VC → Dep Graph | Dependency graph not processing validated results fast enough | +| `coordinator_vcservice_output_tx_status_batch_queue_size` | Status Response | Status responses backing up between VC and Coordinator | +| `vcservice_preparer_input_queue_size` | VC Preparation | Preparer workers saturated | +| `vcservice_validator_input_queue_size` | VC Validation | DB validation queries too slow; check connections or co-location | +| `vcservice_committer_input_queue_size` | VC Commit | DB commit throughput is the bottleneck; most common | +| `vcservice_txstatus_output_queue_size` | VC Status Output | Status responses backing up; Coordinator not consuming fast enough | + +## 3. Sidecar + +### `waiting-txs-limit` + +Maximum number of transactions the Sidecar has sent to the Coordinator and is awaiting status for. The Sidecar acquires one slot per transaction before sending a block. Slots are released only when the Coordinator returns the transaction's final status. When all slots are occupied, the Sidecar blocks on slot acquisition, which causes the internal block channel to fill up, and eventually the Sidecar stops pulling blocks from the ordering service. + +This value directly controls how many transactions can be in-flight across the entire pipeline. With a low value (e.g., 100), the Sidecar sends 100 transactions and blocks until all complete — the pipeline runs in lock-step and throughput drops dramatically. With a very high value, more transactions queue across the pipeline, increasing memory and queuing latency. Config blocks trigger a full drain regardless of this value. Default: 20,000,000. + +### `ledger.sync-interval` + +How often the block store calls `fsync` to durable storage. Every Nth block triggers a full sync; intermediate blocks are written without fsync. Config blocks and file rollovers always sync. + +Each fsync is an expensive I/O operation. A low value (e.g., 1) syncs every block, which can bottleneck the block ingestion path. Higher values (100, 500+) significantly improve block append throughput by amortizing I/O cost. The tradeoff is durability — blocks lost on crash are recoverable from the ordering service. Default: 100. + +### `channel-buffer-size` + +Buffer size for internal Go channels in the Sidecar — block delivery, committed blocks, and status updates. When a channel is full, the producing goroutine blocks until the consumer reads. + +A small buffer (e.g., 1-10) tightly couples the Sidecar's internal stages: any slowdown in the relay to the Coordinator immediately stalls block delivery from the orderer. A larger buffer absorbs temporary throughput variations but uses more memory. Default: 100. + +### `last-committed-block-set-interval` + +How often the Sidecar sends the latest committed block number to the Coordinator. The Coordinator uses this for dependency resolution. + +Minimal effect on steady-state throughput. Shorter intervals improve recovery speed after failures. Default: 5s. + +### `notification.max-active-tx-ids` + +Global limit on active transaction ID subscriptions across all notification streams. When exhausted, new subscriptions are partially rejected. + +Too low causes clients to receive rejections under moderate load, forcing retries that increase end-to-end latency. Default: 100,000. + +### `notification.max-tx-ids-per-request` + +Maximum transaction IDs per single notification request. Requests exceeding this are rejected entirely. + +Prevents individual clients from consuming a disproportionate share of the subscription budget in a single call. Default: 1,000. + +### `server.max-concurrent-streams` + +Maximum concurrent streaming RPCs (Deliver + Notification) per client connection. Each stream holds server resources (goroutines, buffers). + +Too low limits client concurrency and can cause connection failures under load. Default: 10. + +## 4. Coordinator + +### `dependency-graph.num-of-local-dep-constructors` + +Number of goroutines that process transaction batches in parallel to construct batch-level dependency graphs. Each worker processes one batch at a time, and output is serialized in FIFO order. + +Increasing this parallelizes the CPU work of building dependency graphs. However, since output order is enforced via a condition variable, gains diminish beyond 2-4 workers. Default: 1. + +### `dependency-graph.waiting-txs-limit` + +Maximum number of transactions in the global dependency graph. The Coordinator acquires one slot per transaction before adding it to the graph. Slots are released when the VC returns validation results. When exhausted, the dependency graph construction blocks, channels fill up, and the Sidecar stops pulling blocks. + +The dependency graph is what enables parallel dispatch to Verifier and VC services. A small graph (e.g., 100) means once transactions are dispatched, no new ones enter until results return — creating idle gaps and reducing throughput. A very large graph increases memory for dependency tracking state and queuing latency. Incoming blocks are chunked into batches of `min(waiting-txs-limit, 500)` to prevent a single block from consuming all slots. Default: 100,000. + +### `per-channel-buffer-size-per-goroutine` + +Base buffer size for internal Go channels connecting the Coordinator's pipeline stages. The actual buffer for each channel is computed as base × number of endpoints (or constructors): + +``` +Coordinator → DepGraph: base × num-of-local-dep-constructors +DepGraph → Verifier: base × number-of-vc-endpoints +Verifier → VC: base × number-of-verifier-endpoints +VC → DepGraph: base × number-of-vc-endpoints +VC → Coordinator: base × number-of-vc-endpoints +``` + +When a channel is full, the producing stage blocks. A small buffer means any momentary downstream slowdown immediately stalls all upstream stages. A larger buffer absorbs temporary throughput variations but increases memory and queuing latency. With 3 verifiers and 6 VCs, the defaults produce 30-60 item buffers per channel. Default: 10. + +## 5. Verifier + +### `parallel-executor.parallelism` + +Number of goroutines that verify signatures in parallel. Signature verification is CPU-bound. Each stream from the Coordinator creates an independent executor with this many workers. + +Set this to match the number of CPU cores available on the Verifier node. Under-setting leaves CPU idle, reducing verification throughput and causing transactions to queue at the Coordinator. Over-setting causes context switching overhead with no throughput gain. The default of 40 assumes a 32+ core machine. + +### `parallel-executor.batch-size-cutoff` + +Minimum number of verification results to collect before emitting a batch to the Coordinator. Results are buffered until this threshold is reached or `batch-time-cutoff` expires, whichever comes first. + +Setting this too low causes many small batches to be emitted, increasing per-batch overhead in channel writes and gRPC communication. Setting this too high delays results — individual transactions wait longer for the batch to fill, increasing end-to-end latency. + +| Setting | Throughput | Latency | +|---------|-----------|---------| +| 10-20 | Lower (more frequent, smaller batches) | Lower | +| 50 (default) | Good balance | Moderate | +| 100-200 | Higher (fewer, larger batches) | Higher | + +### `parallel-executor.batch-time-cutoff` + +Maximum time to wait for a batch to reach `batch-size-cutoff` before emitting a partial batch. This is the latency safety valve — without it, a partially filled batch under low load would wait indefinitely. + +Setting this too low defeats batching (batches emit before they can fill). Setting this too high adds latency during periods of low or variable transaction arrival rates. + +The default of 10ms ensures results are emitted promptly even under low load. Increase to 50-100ms only if batching efficiency is more important than latency. + +### `parallel-executor.channel-buffer-size` + +Buffer size for internal channels in the verification pipeline. The actual channel capacity is computed as: + +``` +capacity = channel-buffer-size × parallelism +``` + +With the defaults (50 buffer × 40 parallelism = 2,000 capacity). When the input channel is full, the Coordinator's dispatch to the Verifier blocks, which stalls the dependency graph. When the output channel is full, verification workers block, reducing effective parallelism. + +Setting this too low causes frequent blocking that reduces the effective parallelism of the verification workers. Setting this too high increases memory usage proportionally. The defaults provide enough buffering for sustained high throughput. + +## 6. Validator-Committer (VC) + +The VC processes transactions through three pipeline stages: preparation, validation, and commit. Each stage has an independent worker pool. The overall throughput is limited by the slowest stage. + +### `resource-limits.max-workers-for-preparer` + +Number of goroutines that extract read/write sets from transactions and organize them for validation. Preparation is **CPU-bound** — it parses transaction payloads, builds namespace-to-reads maps, and categorizes writes. + +Setting this too low makes the preparer the bottleneck — transactions queue up waiting for preparation while the validator and committer workers sit idle. Setting this higher than needed wastes CPU on goroutine overhead. + +The default of 1 is sufficient for most workloads because preparation is fast relative to the database-bound stages. Increase to 2-4 if you observe the preparer queue growing while validator and committer queues are empty. + +### `resource-limits.max-workers-for-validator` + +Number of goroutines that perform MVCC validation against the database. Each worker calls a stored procedure (`validate_reads_ns_`) that checks whether read set versions still match the committed state. + +Validation is **database-bound** — each call makes at least one database round-trip per namespace in the transaction's read set. Each active validator worker holds a database connection for the duration of its query. + +Setting this too low causes transactions to queue at the validator stage while the database has spare capacity. Setting this too high exhausts the connection pool — workers compete for connections, and the overhead of connection acquisition negates the parallelism benefit. + +| Setting | Effect | +|---------|--------| +| 1 (default) | Serialized validation; minimal DB load; may under-utilize DB | +| 2-4 | Parallel validation; higher throughput; needs more DB connections | +| >4 | Diminishing returns unless DB can sustain the concurrent queries | + +### `resource-limits.max-workers-for-committer` + +Number of goroutines that commit validated transactions to the database using stored procedures (`insert_tx_status`, `insert_ns_`, `update_ns_`). Each commit involves multiple database round-trips within a transaction: writing transaction status, inserting new keys, and updating existing keys. + +The committer is typically the **pipeline bottleneck** because it performs the most database work per transaction. Each active committer worker holds a database connection for the duration of the commit. When transactions complete here, slots are released in the Coordinator's dependency graph and the Sidecar's waiting-txs pool — so committer throughput directly controls how fast the entire pipeline can flow. + +Setting this too low starves the pipeline — the Coordinator's dependency graph fills up, backpressure reaches the Sidecar, and block ingestion stalls. Setting this too high overwhelms the database with concurrent transactions, causing contention and increased retry rates. + +| Setting | Effect | +|---------|--------| +| 1-2 | Low throughput; database under-utilized; pipeline stalls | +| 10-20 (default: 20) | Good throughput; recommended starting point | +| >20 | Diminishing returns; may overwhelm the database with contention | + +### `resource-limits.min-transaction-batch-size` + +Minimum number of transactions that must accumulate before the batch is forwarded to the preparation stage. The VC waits for this many transactions or until `timeout-for-min-transaction-batch-size` expires, whichever comes first. Config blocks are always sent immediately regardless of batch size. + +Larger batches improve efficiency — stored procedures process keys in bulk, reducing per-transaction overhead. But larger batches also increase latency because early-arriving transactions wait for the batch to fill. + +Setting this too high under low transaction rates causes transactions to sit idle until the timeout expires, adding `timeout-for-min-transaction-batch-size` of latency to every batch. Setting this to 1 (default) disables batching entirely — every transaction is forwarded immediately for lowest latency. + +| Setting | Throughput | Latency | +|---------|-----------|---------| +| 1 (default) | Lower (no batching benefit) | Lowest | +| 10-50 | Higher (bulk stored procedure operations) | Higher (waits for batch to fill) | + +### `resource-limits.timeout-for-min-transaction-batch-size` + +Maximum time to wait for a batch to reach `min-transaction-batch-size`. This is the latency safety valve for batching — without it, a partially filled batch under low load would wait indefinitely. + +When `min-transaction-batch-size` is 1, this timeout has no effect (batches are sent immediately). When `min-transaction-batch-size` is higher, this timeout determines the worst-case additional latency during low-throughput periods. + +The default of 2s pairs with the default batch size of 1 (effectively unused). If you increase the batch size, consider reducing the timeout to 100-500ms to bound the latency impact. + +### `database.max-connections` + +Maximum number of connections in the database connection pool. The validator and committer stages share this pool (the preparer does not use the database — it performs in-memory parsing only). If the pool is exhausted, workers block waiting for a free connection, reducing effective parallelism. + +Size the pool to accommodate concurrent usage: + +``` +Required connections >= max-workers-for-validator + max-workers-for-committer +``` + +Setting this too low causes connection starvation — workers sit idle waiting for connections while the database has spare capacity. Setting this too high wastes database server memory and can cause connection-level contention. + +The default of 10 is conservative and likely insufficient for production workloads with the default committer worker count of 20. Size this to at least match the total validator + committer worker count. + +### `database.min-connections` + +Minimum number of idle connections maintained in the pool. Keeps connections warm to avoid the overhead of establishing new connections (TCP handshake + TLS negotiation + authentication) under sudden load spikes. Set to roughly 50% of `max-connections`. + +### `database.load-balance` + +Enables client-side load balancing across multiple database endpoints. When enabled, each new connection is distributed across the configured endpoints. + +- **`true`**: Required for YugabyteDB clusters to distribute operations across nodes and avoid hotspots. Without this, all connections go to the first endpoint, overloading one node while others sit idle. +- **`false`**: Use for single-node deployments. + +### `database.table-pre-split-tablets` + +Number of tablets to pre-split each table into at creation time (YugabyteDB only). Pre-splitting distributes data across tablet servers from the start, preventing the "hot tablet" bottleneck where a single tablet handles all initial writes before automatic splitting kicks in. + +Without pre-splitting, the first hours of operation can see severe write latency spikes as all writes converge on a single tablet. Pre-splitting eliminates this problem entirely. + +Set this to match the number of tablet servers in the cluster, or a small multiple (1x-2x). For example, with 9 tablet servers, set to 9 or 18. + +When the database is PostgreSQL, this setting is automatically ignored. + +| Setting | Effect | +|---------|--------| +| 0 (default) | No pre-splitting; hot tablet during initial writes; latency spikes | +| = tserver count | Even distribution from the start; predictable latency | +| 2x tserver count | Finer distribution; slightly more per-tablet overhead | + +### `database.retry` + +Exponential backoff retry strategy for database operations. Required for YugabyteDB to handle retryable transaction conflicts (e.g., serialization failures from concurrent MVCC operations). + +| Parameter | Default | Effect | +|-----------|---------|--------| +| `initial-interval` | 500ms | First retry delay | +| `randomization-factor` | 0.5 | Jitter range (+/- 50%) to avoid thundering herd | +| `multiplier` | 1.5 | Exponential backoff factor | +| `max-interval` | 60s | Cap on any single retry delay | +| `max-elapsed-time` | 15m | Total retry duration before giving up | + +For high-throughput workloads with frequent transaction conflicts, reduce `initial-interval` to 100-200ms and `max-interval` to 10-30s to retry faster. Too-aggressive retries (very short intervals) can amplify contention under heavy load. + +## 7. Query Service + +### `min-batch-keys` + +Minimum number of keys that must accumulate in a query batch before executing against the database. The batch is submitted when this threshold is reached or `max-batch-wait` expires, whichever comes first. + +Batching reduces database round-trips by combining keys from multiple concurrent requests into a single query. Setting this too low sends many small queries, increasing per-query overhead. Setting this too high delays queries waiting for the batch to fill, increasing latency for individual requests. + +| Setting | Throughput | Latency | +|---------|-----------|---------| +| 256-512 | Lower (smaller batches, more round-trips) | Lower | +| 1024 (default) | Good balance | Moderate | +| 2048-4096 | Higher (larger batches, fewer round-trips) | Higher | + +### `max-batch-wait` + +Maximum time to wait for a batch to reach `min-batch-keys`. This bounds the worst-case latency during low-load periods when keys accumulate slowly. Setting this too low defeats batching (queries execute before batches can fill). Setting this too high adds latency during quiet periods. + +The default of 100ms is appropriate for most deployments. Reduce to 50ms for lower latency. + +### `view-aggregation-window` + +Time window for aggregating multiple views with the same parameters (isolation level, deferrable mode) into a single batched view. Views created within this window share the same batcher, reducing database load. + +Setting this too low creates many independent batchers, each holding its own database connection — increasing connection pool pressure. Setting this too high delays the first query in a window, as new views must wait for the window to open a batcher. + +The default of 100ms balances throughput and latency. + +### `max-aggregated-views` + +Maximum number of views that can be aggregated into a single batcher. Once reached, a new batcher is created even if within the `view-aggregation-window`. This prevents a single batcher from becoming a contention point under very high concurrency. The default of 1024 is appropriate for most deployments. + +### `max-active-views` + +Maximum concurrent active views across all clients. New `BeginView` requests are rejected with `RESOURCE_EXHAUSTED` when this limit is reached. + +Setting this too low causes clients to receive errors during peak load, forcing retries that amplify latency. Setting this too high allows unbounded resource consumption. Set to 0 to disable the limit (not recommended in production). The default of 4096 is permissive. + +### `max-view-timeout` + +Maximum lifetime of a view from creation to completion. Views exceeding this timeout are aborted and their queries return errors. This prevents dangling views from holding resources indefinitely. + +Setting this too low causes legitimate long-running queries to be aborted. Setting this too high allows idle or abandoned views to hold database connections for extended periods, reducing pool availability for other views. + +This parameter interacts with the connection pool. The maximum number of database connections needed by the Query Service depends on: + +``` +max connections needed = (max-view-timeout / view-aggregation-window) × view-parameter-permutations +``` + +There are up to 8 view parameter permutations (4 isolation levels × 2 deferrable modes). With defaults: `(10s / 100ms) × 8 = 800`. In practice, not all permutations are used simultaneously, so fewer connections are needed. Monitor connection pool wait metrics to determine the right size. + +### `max-request-keys` + +Maximum number of keys allowed in a single query request. Applies to both `GetRows` (total keys across all namespaces) and `GetTransactionStatus` (number of transaction IDs). Setting this too low forces clients to split queries into many small requests, increasing round-trip overhead. Setting this too high allows individual requests to cause memory spikes and long-running queries that block the connection pool. The default of 10,000 is a reasonable balance. + +### `database.max-connections` (Query Service) + +The Query Service connection pool operates differently from the VC pool. Connections are held for the duration of a view's queries, which can be up to `max-view-timeout`. A view that holds a connection for 10 seconds blocks that connection from serving other views. + +Size the pool based on expected concurrent view load and the formula above. The default of 10 is conservative — increase for production workloads. Monitor the connection pool wait metrics to detect when views are blocking on connection acquisition. + +## 8. Database + +### YugabyteDB Considerations + +- **Tablet distribution**: Use `table-pre-split-tablets` (VC config) to distribute tablets evenly across all tablet servers. Without pre-splitting, initial writes hit a single tablet, causing latency spikes until automatic splitting occurs. +- **Rebalancing**: After adding new tablet servers, rebalance tablets to distribute data to the new nodes. +- **Connection load balancing**: Enable `database.load-balance: true` on all services connected to YugabyteDB to distribute queries across nodes. + +### PostgreSQL Considerations + +- **`table-pre-split-tablets`**: Automatically ignored when PostgreSQL is detected. +- **Connection pool**: Same `max-connections` and `min-connections` parameters apply. Size based on concurrent query volume. +- **Replication**: For read-heavy workloads, configure Query Service instances to connect to read replicas. + +## 9. Co-location Impact + +MVCC validation requires multiple database round-trips per transaction — read set validation, write set application, and status updates. When VC instances are co-located with database nodes, each of these round-trips takes microseconds instead of milliseconds. Without co-location, expect significantly higher commit latency, which directly limits overall system throughput. + +Co-location is most impactful for the VC service because it performs the most database operations per transaction. The Query Service benefits less because its read-only queries are less latency-sensitive. + +## 10. Benchmarking + +The tuning recommendations in this guide are starting points, not guarantees. Real performance depends on factors specific to your deployment: + +- Transaction size and complexity (number of read/write keys per transaction) +- Number of namespaces and endorsement policy complexity +- Read set and write set sizes +- Network topology and latency between nodes +- Storage hardware characteristics + +Benchmark with your actual workload on your target hardware to establish baseline performance and identify the tuning parameters that matter most for your use case.