|
1 | | -# Porting shared memory MPSC queues to distributed context using MPI-3 RMA |
| 1 | +# Non-Blocking Distributed MPSC Queues |
2 | 2 |
|
3 | 3 |   [](https://h-dna.github.io/MPiSC/) |
4 | 4 |
|
5 | | -This project ports lock-free Multiple-Producer Single-Consumer (MPSC) queue algorithms from shared-memory to distributed systems using MPI-3 Remote Memory Access (RMA). |
6 | | - |
7 | 5 | ## Table of Contents |
8 | 6 |
|
9 | | -- [Objective](#objective) |
10 | | -- [Motivation](#motivation) |
11 | | -- [Approach](#approach) |
12 | | - - [Why MPI RMA?](#why-mpi-rma) |
13 | | - - [Why MPI-3 RMA?](#why-mpi-3-rma) |
14 | | - - [Hybrid MPI+MPI](#hybrid-mpimpi) |
15 | | - - [Hybrid MPI+MPI+C++11](#hybrid-mpimpic11) |
16 | | - - [Lock-Free MPI Porting](#lock-free-mpi-porting) |
17 | | -- [Literature Review](#literature-review) |
18 | | - - [Known Problems](#known-problems) |
19 | | - - [Trends](#trends) |
20 | | -- [Evaluation Strategy](#evaluation-strategy) |
21 | | - - [Correctness](#correctness) |
22 | | - - [Lock-Freedom](#lock-freedom) |
23 | | - - [Performance](#performance) |
24 | | - - [Scalability](#scalability) |
| 7 | +- [Abstract](#abstract) |
| 8 | +- [Motivation and Methodology](#motivation-and-methodology) |
| 9 | +- [Contributions](#contributions) |
| 10 | +- [Results](#results) |
25 | 11 | - [Related](#related) |
26 | 12 |
|
27 | | -## Objective |
28 | | - |
29 | | -- Survey shared-memory literature for lock-free, concurrent MPSC queue algorithms. |
30 | | -- Port candidate algorithms to distributed contexts using MPI-3 RMA. |
31 | | -- Optimize ports using MPI-3 SHM and the C++11 memory model. |
32 | | - |
33 | | -Target characteristics: |
34 | | - |
35 | | -| Dimension | Requirement | |
36 | | -| ------------------- | ----------------------- | |
37 | | -| Queue length | Fixed | |
38 | | -| Number of producers | Multiple | |
39 | | -| Number of consumers | Single | |
40 | | -| Operations | `enqueue`, `dequeue` | |
41 | | -| Progress guarantee | Lock-free | |
42 | | - |
43 | | -## Motivation |
44 | | - |
45 | | -Queues are fundamental to scheduling, event handling, and message buffering. Under high contention—such as multiple event sources writing simultaneously—a poorly designed queue becomes a scalability bottleneck. This holds for both shared-memory and distributed systems. |
46 | | - |
47 | | -Shared-memory research has produced efficient, scalable, lock-free queue algorithms. Distributed computing literature largely ignores these algorithms due to differing programming models. MPI-3 RMA bridges this gap by enabling one-sided communication that closely mirrors shared-memory semantics. This project investigates whether porting shared-memory algorithms via MPI-3 RMA yields competitive distributed queues. |
48 | | - |
49 | | -## Approach |
50 | | - |
51 | | -We port lock-free queue algorithms using MPI-3 RMA, then optimize with MPI SHM (hybrid MPI+MPI) and C++11 atomics for intra-node communication. |
52 | | - |
53 | | -### Why MPI RMA? |
54 | | - |
55 | | -MPSC queues are *irregular* applications: |
56 | | - |
57 | | -- Memory access patterns are dynamic. |
58 | | -- Data locations are determined at runtime. |
59 | | - |
60 | | -Traditional two-sided communication (`MPI_Send`/`MPI_Recv`) requires the receiver to anticipate requests—impractical when access patterns are unknown. MPI RMA allows one-sided communication where the initiator specifies all parameters. |
61 | | - |
62 | | -### Why MPI-3 RMA? |
63 | | - |
64 | | -MPI-3 introduces `MPI_Win_lock_all`, a non-collective operation for opening access epochs on process groups, enabling lock-free synchronization. |
65 | | - |
66 | | -### Hybrid MPI+MPI |
| 13 | +## Abstract |
67 | 14 |
|
68 | | -Pure MPI ignores intra-node locality. MPI-3 SHM provides `MPI_Win_allocate_shared` for allocating shared memory windows among co-located processes. These windows use the unified memory model and can leverage both MPI and native synchronization. This exploits multi-core parallelism within nodes. |
| 15 | +Distributed applications such as the actor model and fan-out/fan-in pattern require MPSC queues that are both performant and fault-tolerant. We address the absence of non-blocking distributed MPSC queues by adapting LTQueue — a wait-free shared-memory MPSC queue — to distributed environments using MPI-3 RMA. We introduce three novel **wait-free** distributed MPSC queues: **dLTQueue**, **Slotqueue**, and **dLTQueueV2**. Evaluation on SuperMUC-NG and CoolMUC-4 shows ~2x better enqueue throughput than the existing AMQueue while providing stronger fault tolerance. |
69 | 16 |
|
70 | | -### Hybrid MPI+MPI+C++11 |
| 17 | +## Motivation and Methodology |
71 | 18 |
|
72 | | -C++11 atomics outperform MPI synchronization for intra-node communication. Using C++11 within shared memory windows optimizes the intra-node path. |
| 19 | +### The Problem |
73 | 20 |
|
74 | | -### Lock-Free MPI Porting |
| 21 | +MPSC queues are essential for **irregular applications** — programs with unpredictable, data-dependent memory access patterns: |
75 | 22 |
|
76 | | -MPI-3 RMA enables lock-free implementations: |
| 23 | +- **Actor model**: Each actor maintains a mailbox (MPSC queue) receiving messages from other actors |
| 24 | +- **Fan-out/fan-in**: Worker nodes enqueue results to an aggregation node for processing |
77 | 25 |
|
78 | | -- `MPI_Win_lock_all` / `MPI_Win_unlock_all` manage access epochs. |
79 | | -- MPI atomic operations (`MPI_Fetch_and_op`, `MPI_Compare_and_swap`) provide synchronization. |
| 26 | +These patterns demand queues that are both performant and fault-tolerant. A slow or crashed producer should not block the entire system. |
80 | 27 |
|
81 | | -## Literature Review |
| 28 | +### Gap in the Literature |
82 | 29 |
|
83 | | -### Known Problems |
| 30 | +**Shared-memory** has several non-blocking MPSC queues: LTQueue, DQueue, WRLQueue, and Jiffy. However, our analysis reveals critical flaws in most: |
84 | 31 |
|
85 | | -* **ABA problem** |
| 32 | +| Queue | Issue | |
| 33 | +|-------|-------| |
| 34 | +| DQueue | Incorrect ABA solution and unsafe memory reclamation | |
| 35 | +| WRLQueue | Actually **blocking** — dequeuer waits for all enqueuers | |
| 36 | +| Jiffy | Insufficient memory reclamation, not truly wait-free | |
| 37 | +| **LTQueue** | **Correct** — uses LL/SC for ABA, proper memory reclamation | |
86 | 38 |
|
87 | | -A pointer is reused after deallocation, causing a CAS to incorrectly succeed. |
| 39 | +**Distributed** has only one MPSC queue: **AMQueue**. Despite claiming lock-freedom, it is actually **blocking** — the dequeuer must wait for all enqueuers to finish. A single slow enqueuer halts the entire system. ([Confirmed by the original author](assets/amqueue-blocking-evidence.jpg)) |
88 | 40 |
|
89 | | -Solutions: Monotonic counters, hazard pointers. |
| 41 | +### Our Approach |
90 | 42 |
|
91 | | -* **Safe memory reclamation** |
| 43 | +We adapt **LTQueue** — the only correct shared-memory MPSC queue — to distributed environments using MPI-3 RMA one-sided communication. |
92 | 44 |
|
93 | | -Premature deallocation while other threads hold references. |
| 45 | +**Key challenge**: LTQueue relies on LL/SC (Load-Link/Store-Conditional) to solve the ABA problem, but LL/SC is unavailable in MPI. |
94 | 46 |
|
95 | | -Solutions: Hazard pointers, epoch-based reclamation. |
| 47 | +**Our solution**: Replace LL/SC with CAS + unique timestamps. Each value is tagged with a monotonically increasing version number, preventing ABA without LL/SC. |
96 | 48 |
|
97 | | -* **Empty queue contention** |
| 49 | +**Key techniques**: |
| 50 | +- **SPSC-per-enqueuer**: Each producer maintains a local queue, eliminating producer contention |
| 51 | +- **Unique timestamps**: Solves ABA via monotonic version numbers |
| 52 | +- **Double-refresh**: Bounds retries to two per node, ensuring wait-freedom |
98 | 53 |
|
99 | | -Concurrent `enqueue` and `dequeue` on an empty queue can conflict. |
| 54 | +## Contributions |
100 | 55 |
|
101 | | -Solutions: Sentinel node to separate head and tail pointers. |
| 56 | +### Findings |
102 | 57 |
|
103 | | -* **Intermediate state from slow processes** |
| 58 | +- **3 of 4** shared-memory MPSC queues (DQueue, WRLQueue, Jiffy) have correctness or progress issues |
| 59 | +- **AMQueue**, the only distributed MPSC queue, is blocking despite claims of lock-freedom |
| 60 | +- **LTQueue** is the only correct candidate for distributed adaptation |
104 | 61 |
|
105 | | -A delayed process may leave the queue in an inconsistent state mid-operation. |
| 62 | +### Novel Algorithms |
106 | 63 |
|
107 | | -Solutions: Helping—other processes complete the pending operation. |
| 64 | +| Algorithm | Progress | Enqueue | Dequeue | |
| 65 | +|-----------|----------|---------|---------| |
| 66 | +| **dLTQueue** | Wait-free | O(log n) remote | O(log n) remote | |
| 67 | +| **Slotqueue** | Wait-free | O(1) remote | O(1) remote, O(n) local | |
| 68 | +| **dLTQueueV2** | Wait-free | O(1) remote | O(1) remote, O(log n) local | |
108 | 69 |
|
109 | | -* **Intermediate state from failed processes** |
| 70 | +All algorithms are **linearizable** with no dynamic memory allocation. |
110 | 71 |
|
111 | | -A crashed process may leave the queue permanently inconsistent. |
| 72 | +## Results |
112 | 73 |
|
113 | | -Solutions: Helping mechanisms that can complete any pending operation. |
| 74 | +Benchmarked on [SuperMUC-NG](https://doku.lrz.de/supermuc-ng-10745965.html) (6000+ nodes) and [CoolMUC-4](https://doku.lrz.de/coolmuc-4-10746415.html) (100+ nodes): |
114 | 75 |
|
115 | | -* **Help mechanism rationale** |
| 76 | +| Metric | Our Queues vs AMQueue | |
| 77 | +|--------|----------------------| |
| 78 | +| Enqueue throughput | **~2x better** | |
| 79 | +| Dequeue throughput | 3-10x worse | |
| 80 | +| Fault tolerance | **Wait-free** (vs blocking) | |
116 | 81 |
|
117 | | -Multi-step operations can leave the queue in intermediate states. Rather than blocking until consistency is restored, processes detect and complete pending operations. Implementation: |
118 | | - |
119 | | -1. Detect intermediate state |
120 | | -2. Attempt completion via CAS |
121 | | - |
122 | | -A failed CAS indicates another process already helped; retry is unnecessary. |
123 | | - |
124 | | -### Trends |
125 | | - |
126 | | -- Fast-path optimization |
127 | | - - Lock-free fast path with wait-free fallback |
128 | | - - Replace CAS with FAA or load/store where possible |
129 | | -- Contention reduction |
130 | | - - Per-producer local buffers |
131 | | - - Elimination and backoff (for MPMC) |
132 | | -- Cache-aware design |
133 | | - |
134 | | -## Evaluation Strategy |
135 | | - |
136 | | -We focus on the following criteria, in the order of decreasing importance: |
137 | | -* Correctness |
138 | | -* Lock-freedom |
139 | | -* Performance & Scalability |
140 | | - |
141 | | -### Correctness |
142 | | - |
143 | | -- Linearizability |
144 | | -- ABA-freedom |
145 | | -- Safe memory reclamation |
146 | | - |
147 | | -### Lock-Freedom |
148 | | - |
149 | | -No process may block system-wide progress. Note: lock-freedom depends on underlying primitives being lock-free on the target platform. |
150 | | - |
151 | | -### Performance |
152 | | - |
153 | | -Minimize latency and maximize throughput for target workloads. |
154 | | - |
155 | | -### Scalability |
156 | | - |
157 | | -Throughput should scale with process count. |
| 82 | +**Trade-off**: Stronger fault tolerance at the cost of dequeue performance. |
158 | 83 |
|
159 | 84 | ## Related |
160 | 85 |
|
161 | | -- [dLTQueue: A Non-Blocking Distributed-Memory Multi-Producer Single-Consumer Queue](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue) |
162 | | -- [Slotqueue: A Wait-Free Distributed Multi-Producer Single-Consumer Queue with Constant Remote Operations](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations) |
| 86 | +1. **dLTQueue** - FDSE 2025 ([ResearchGate](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue)) |
| 87 | +2. **Slotqueue** - NPC 2025 ([ResearchGate](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations)) |
| 88 | + |
| 89 | +[Full thesis](https://h-dna.github.io/MPiSC/) |
0 commit comments