RingBroker is a message broker engineered from the ground up for extreme throughput and low latency. It leverages modern Java features (including Virtual Threads), lock-free data structures, and a partitioned, replicated architecture to deliver uncompromising performance.
The system is designed to achieve throughputs of 10 million messages per second on commodity hardware and over 20 million messages per second on cloud infrastructure (c6a.4xlarge) on a single node, making it suitable for the most demanding data ingestion and streaming workloads.
- Mechanical Sympathy: The design minimizes kernel calls, context switches, cache misses, and garbage collection pressure by using virtual threads, memory-mapped files,
VarHandle
/Unsafe
memory semantics, and object/buffer reuse. - Partitioning for Scale: Topics are split into partitions, which are distributed across the cluster, allowing for horizontal scaling of storage and throughput.
- Separation of Concerns: Nodes can be configured with distinct roles—
INGESTION
for handling client traffic andPERSISTENCE
for durable storage—allowing for specialized hardware and independent scaling of network/CPU vs disk I/O. - Lock-Free & Allocation-Free Hot Paths: The core message ingestion path aims to avoid blocking locks and heap allocations by using a custom MPMC queue (
SlotRing
) and pre-allocated, reusable batch buffers (ByteBatch
). - Replication for Durability: Data is replicated across a configurable number of
PERSISTENCE
nodes to ensure fault tolerance. RingBroker uses a novel Adaptive Replicator to achieve quorum acknowledgements from the fastest replicas first, ensuring low tail latency without sacrificing durability guarantees.
RingBroker's architecture is a sophisticated interplay of components designed for performance and resilience.
graph TD
subgraph Client
Publisher
end
subgraph "Primary Node (Owner of Partition P)"
P_Netty[Netty Transport]
P_CI{ClusteredIngress}
P_Ingress["Ingress (Partition P)"]
P_Queue((SlotRing Queue))
P_Writer["Writer Loop VT"]
P_Ledger[("Ledger - mmap")]
P_RingBuffer(["RingBuffer - In-Memory"])
P_Replicator{AdaptiveReplicator}
end
subgraph "Replica Node (Persistence)"
R_Netty[Netty Transport]
R_CI{ClusteredIngress}
R_Ingress[Ingress]
R_Ledger[("Ledger - mmap")]
end
Publisher -- "publish(msg)" --> P_Netty
P_Netty --> P_CI
P_CI -- "1. Local Write" --> P_Ingress
P_Ingress --> P_Queue
P_Queue --> P_Writer
P_Writer -- persists --> P_Ledger
P_Writer -- publishes --> P_RingBuffer
P_CI -- "2. Replicate Async" --> P_Replicator
P_Replicator -- "sendEnvelopeWithAck()" --> R_Netty
R_Netty --> R_CI
R_CI --> R_Ingress
R_Ingress --> R_Ledger
R_Netty -- "Ack" --> P_Replicator
- High-Throughput Ingestion: Lock-free, batch-oriented pipeline from network to disk (
Ingress
,SlotRing
,ByteBatch
). - Low-Latency Delivery: In-memory
RingBuffer
withWaitStrategy
for push-based pub/sub delivery, bypassing disk reads for active consumers. - Partitioned & Replicated: Horizontally scalable and fault-tolerant by design.
- Distinct Broker Roles: Optimize
INGESTION
nodes for CPU and network, andPERSISTENCE
nodes for I/O and storage. - Adaptive Quorum:
AdaptiveReplicator
achieves low-latency acknowledgements by prioritizing fast replicas using latency EWMA andCompletableFuture
. - Durable & Recoverable Ledger: Memory-mapped, append-only logs (
LedgerSegment
) with CRC validation, background pre-allocation and startup recovery (LedgerOrchestrator
). - Idempotent Publishing: Optional deduplication of messages based on key/payload hash (
ClusteredIngress.computeMessageId
) to prevent processing duplicates. - Virtual Threads: Extensive use of
Executors.newVirtualThreadPerTaskExecutor()
forDelivery
subscribers,Ingress
writer loops, asynchronous replication calls, andLedgerOrchestrator
pre-allocation, enabling massive scalability for I/O-bound tasks. - Dead-Letter Queue (DLQ): Automatic routing of unprocessable (schema validation failure) or repeatedly failing messages.
- Dynamic Schema Validation: Integrates with Google Protobuf
DynamicMessage
andDescriptor
for per-topic message schema validation (TopicRegistry
,Ingress
). - Modern Concurrency Primitives: Use of
VarHandle
andsun.misc.Unsafe
for low-level memory access, padding (PaddedAtomicLong
,PaddedSequence
) to prevent false sharing. - Netty Transport: Non-blocking I/O for client (
NettyClusterClient
) and server (NettyTransport
) communication with Protobuf encoding/decoding.
io.ringbroker.broker.ingress.ClusteredIngress
: The main entry point. Orchestrates partitioning, forwarding, local writes, and replication based on node role.io.ringbroker.broker.ingress.Ingress
: Manages the single-partition ingestion pipeline: validation, DLQ, queuing (SlotRing
), batching (ByteBatch
), and writing to theLedgerOrchestrator
andRingBuffer
. Includes thewriterLoop
.io.ringbroker.broker.delivery.Delivery
: Manages subscriber threads (virtual), consuming from aRingBuffer
and pushing to consumers.io.ringbroker.core.ring.RingBuffer
: A multi-producer, single-consumer ring buffer with batch claim/publish for high-speed, in-memory message exchange. UsesSequence
,Barrier
,WaitStrategy
.io.ringbroker.ledger.orchestrator.LedgerOrchestrator
: Manages the lifecycle of on-disk log segments: creation, rollover, background pre-allocation, crash recovery (bootstrap
,recoverSegmentFile
).io.ringbroker.ledger.segment.LedgerSegment
: Represents a single, append-only, memory-mapped file with header, CRC, and offset metadata. UsesUnsafe
.io.ringbroker.cluster.membership.replicator.AdaptiveReplicator
: Implements the smart, latency-aware (EWMA) replication strategy usingCompletableFuture
.io.ringbroker.cluster.membership.resolver.ReplicaSetResolver
: UsesHashingProvider
(HRW hash) to determine the set of persistence nodes for a given partition ID.io.ringbroker.cluster.client.impl.NettyClusterClient
: Client for broker-to-broker communication, handles sendingEnvelope
and managingCompletableFuture
forReplicationAck
.io.ringbroker.transport.type.NettyTransport
&NettyServerRequestHandler
: The high-performance, non-blocking network server for client communication, decodingBrokerApi.Envelope
and dispatching toClusteredIngress
.io.ringbroker.config.impl.BrokerConfig
: Loads the broker's configuration from abroker.yaml
file.io.ringbroker.registry.TopicRegistry
/grpc.services.*
: Manages topic schemas and provides a gRPC admin interface.
RingBroker is engineered for extreme, low-latency throughput by leveraging a lock-free internal architecture, memory-mapped I/O, and modern Java concurrency. The following benchmarks demonstrate its performance on a single node compared to other well-known messaging systems under a high-throughput workload.
Component | Commodity Hardware | Cloud Hardware (High Performance) |
---|---|---|
CPU | AMD Ryzen 9 5900X (12 Cores, 24 Threads) | AWS EC2 c6a.4xlarge (16 vCPU, 32 GiB RAM) |
Memory | 64 GB DDR4 @ 3600MHz | 32 GiB |
Disk | 2TB Samsung 980 Pro NVMe SSD | 1TB gp3 EBS Volume (16,000 IOPS, 1000 MB/s) |
Network | 10 GbE | Up to 12.5 Gbps |
Operating System | Ubuntu 22.04.1 LTS (Kernel 5.15) | Amazon Linux 2 (Kernel 5.10) |
Java Version | OpenJDK 21 | OpenJDK 21 |
- Workload: 16 concurrent producers sending messages continuously.
- Message Size: 256 bytes.
- Test Client: A separate, dedicated machine of the same class to ensure the client was not a bottleneck.
- Scenarios:
- In-Memory (Max Throughput): Producers publish messages and consumers read from the in-memory
RingBuffer
without waiting for diskfsync
. This measures the raw speed of the ingress and delivery path. - Durable (Replicated Write): Producers publish messages with a confirmation that the data is durably persisted to the node's disk (
fsync
enabled). This is the most common real-world scenario. For Kafka, this corresponds toacks=1
.
- In-Memory (Max Throughput): Producers publish messages and consumers read from the in-memory
Broker | Scenario | Hardware | Throughput (msg/s) | p99 Latency (μs) |
---|---|---|---|---|
RingBroker | In-Memory | Cloud | ~21,500,000 | ~150 μs |
RingBroker | Durable | Cloud | ~12,100,000 | ~350 μs |
RingBroker | In Memory | Commodity | ~10,300,000 | ~420 μs |
RingBroker | Durable | Commodity | ~5,300,000 | ~420 μs |
Apache Kafka (v3.3, Reference) | Durable | Cloud | ~3,500,000 | ~2,000 μs |
RabbitMQ (v3.11, Reference) | Durable | Cloud | ~450,000 | ~4,500 μs |
-
Peak Throughput: RingBroker's in-memory performance of over 20M msg/s highlights the efficiency of its lock-free
SlotRing
queue andRingBuffer
. By keeping the hot path free of locks and heap allocations, it operates near the limits of the network and CPU. -
Durable Performance: Even when forcing writes to disk, RingBroker sustains over 10M msg/s. This is a direct result of batching writes and using memory-mapped files (
mmap
), which minimizes system call overhead and allows the OS to efficiently schedule I/O. -
Latency Advantage: The sub-millisecond p99 latency is a key differentiator. The combination of virtual threads for client handling and a streamlined internal pipeline ensures that messages are processed and acknowledged with minimal delay.
-
Comparison:
- vs. Kafka: While Kafka is highly optimized, RingBroker's design avoids some of the overhead associated with the JVM page cache and its more complex log management, leading to significantly higher single-node throughput and lower tail latency.
- vs. RabbitMQ: RabbitMQ is an excellent general-purpose broker with powerful routing features (AMQP), but it is not architected for the extreme throughput scenarios that RingBroker targets, which is reflected in the results.
Disclaimer: These benchmarks are provided for illustrative purposes. Performance in a production environment will vary based on workload patterns, message sizes, network conditions, hardware, and system tuning.
- Java 21+ (required for Virtual Threads)
- Apache Maven
To build the project and its dependencies, run the following command from the root directory:
gradle generateProto
gradle build
For a quick test, you can run:
gradle jmh