|
1 | 1 | # Porting shared memory MPSC queues to distributed context using MPI-3 RMA |
2 | 2 |
|
| 3 | +  [](https://h-dna.github.io/MPiSC/) |
| 4 | + |
| 5 | +This project ports lock-free Multiple-Producer Single-Consumer (MPSC) queue algorithms from shared-memory to distributed systems using MPI-3 Remote Memory Access (RMA). |
| 6 | + |
| 7 | +## Table of Contents |
| 8 | + |
| 9 | +- [Publications](#publications) |
| 10 | +- [Objective](#objective) |
| 11 | +- [Motivation](#motivation) |
| 12 | +- [Approach](#approach) |
| 13 | + - [Why MPI RMA?](#why-mpi-rma) |
| 14 | + - [Why MPI-3 RMA?](#why-mpi-3-rma) |
| 15 | + - [Hybrid MPI+MPI](#hybrid-mpimpi) |
| 16 | + - [Hybrid MPI+MPI+C++11](#hybrid-mpimpic11) |
| 17 | + - [Lock-free MPI porting](#lock-free-mpi-porting) |
| 18 | +- [Literature Review](#literature-review) |
| 19 | + - [Known Problems](#known-problems) |
| 20 | + - [Trends](#trends) |
| 21 | +- [Evaluation Strategy](#evaluation-strategy) |
| 22 | + - [Correctness](#correctness) |
| 23 | + - [Lock-freedom](#lock-freedom) |
| 24 | + - [Performance](#performance) |
| 25 | + - [Scalability](#scalability) |
| 26 | + |
| 27 | +## Publications |
| 28 | + |
| 29 | +- [dLTQueue: A Non-Blocking Distributed-Memory Multi-Producer Single-Consumer Queue](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue) |
| 30 | +- [Slotqueue: A Wait-Free Distributed Multi-Producer Single-Consumer Queue with Constant Remote Operations](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations) |
| 31 | + |
3 | 32 | ## Objective |
4 | 33 |
|
5 | | -- Examination of the *shared-memory* literature to find potential *lock-free*, *concurrent*, *multiple-producer single-consumer* queue algorithms. |
6 | | -- Use the new MPI-3 RMA capabilities to port potential lock-free *shared-memory* queue algorithms to distributed context. |
7 | | -- Potentially optimize MPI RMA ports using MPI-3 SHM + C++11 memory model. |
| 34 | +- Survey shared-memory literature for lock-free, concurrent MPSC queue algorithms. |
| 35 | +- Port candidate algorithms to distributed contexts using MPI-3 RMA. |
| 36 | +- Optimize ports using MPI-3 SHM and the C++11 memory model. |
8 | 37 |
|
9 | | -- Minimum required characteristics: |
| 38 | +Target characteristics: |
10 | 39 |
|
11 | | -| Dimension | Desired property | |
| 40 | +| Dimension | Requirement | |
12 | 41 | | ------------------- | ----------------------- | |
13 | | -| Queue length | Fixed length | |
14 | | -| Number of producers | Many | |
15 | | -| Number of consumers | One | |
16 | | -| Operations | `queue`, `enqueue` | |
17 | | -| Concurrency | Concurrent & Lock-free | |
| 42 | +| Queue length | Fixed | |
| 43 | +| Number of producers | Multiple | |
| 44 | +| Number of consumers | Single | |
| 45 | +| Operations | `enqueue`, `dequeue` | |
| 46 | +| Concurrency | Lock-free | |
18 | 47 |
|
19 | 48 | ## Motivation |
20 | 49 |
|
21 | | -- Queue is the backbone data structures in many applications: scheduling, event handling, message bufferring. In these applications, the queue may be highly contended, for example, in event handling, there can be multiple sources of events & many consumers of events at the same time. If the queue has not been designed properly, it can become a bottleneck in a highly concurrent environment, adversely affecting the application's scalability. This sentiment also applies to queues in distributed contexts. |
22 | | -- Within the context of shared-memory, there have been plenty of research and testing going into efficient, scalable & lock-free queue algorithms. This presents an opportunity to port these high-quality algorithms to the distributed context, albeit some inherent differences that need to be taken into consideration between the two contexts. |
23 | | -- In the distributed literature, most of the proposed algorithms completely disregard the existing shared-memory algorithms, mostly due to the discrepancy between the programming model of shared memory and that of distributed computing. However, with MPI-3 RMA, the gap is bridged, and we can straightforwardly model shared memory application using MPI. This is why we investigate the porting approach & compare them with existing distributed queue algorithms. |
| 50 | +Queues are fundamental to scheduling, event handling, and message buffering. Under high contention—such as multiple event sources writing simultaneously—a poorly designed queue becomes a scalability bottleneck. This holds for both shared-memory and distributed systems. |
| 51 | + |
| 52 | +Shared-memory research has produced efficient, scalable, lock-free queue algorithms. Distributed computing literature largely ignores these algorithms due to differing programming models. MPI-3 RMA bridges this gap by enabling one-sided communication that closely mirrors shared-memory semantics. This project investigates whether porting shared-memory algorithms via MPI-3 RMA yields competitive distributed queues. |
24 | 53 |
|
25 | 54 | ## Approach |
26 | 55 |
|
27 | | -The porting approach we choose is to use MPI-3 RMA to port lock-free queue algorithms. We further optimize these ports using MPI SHM (or the so called MPI+MPI hybrid approach) and C++11 for shared memory synchronization. |
| 56 | +We port lock-free queue algorithms using MPI-3 RMA, then optimize with MPI SHM (hybrid MPI+MPI) and C++11 atomics for intra-node communication. |
| 57 | + |
| 58 | +### Why MPI RMA? |
| 59 | + |
| 60 | +MPSC queues are *irregular* applications: |
| 61 | + |
| 62 | +- Memory access patterns are dynamic. |
| 63 | +- Data locations are determined at runtime. |
| 64 | + |
| 65 | +Traditional two-sided communication (`MPI_Send`/`MPI_Recv`) requires the receiver to anticipate requests—impractical when access patterns are unknown. MPI RMA allows one-sided communication where the initiator specifies all parameters. |
| 66 | + |
| 67 | +### Why MPI-3 RMA? |
| 68 | + |
| 69 | +MPI-3 introduces `MPI_Win_lock_all`, a non-collective operation for opening access epochs on process groups, enabling lock-free synchronization. |
| 70 | + |
| 71 | +### Hybrid MPI+MPI |
| 72 | + |
| 73 | +Pure MPI ignores intra-node locality. MPI-3 SHM provides `MPI_Win_allocate_shared` for allocating shared memory windows among co-located processes. These windows use the unified memory model and can leverage both MPI and native synchronization. This exploits multi-core parallelism within nodes. |
| 74 | + |
| 75 | +### Hybrid MPI+MPI+C++11 |
28 | 76 |
|
29 | | -<details> |
30 | | - <summary>Why MPI RMA?</summary> |
31 | | - |
32 | | - MPSC queue belongs to the class of <i>irregular</i> applications, this means that: |
33 | | - <ul> |
34 | | - <li>Memory access pattern is not known.</li> |
35 | | - <li>Data locations cannot be known in advance, it can change during execution.</li> |
36 | | - </ul> |
37 | | - |
38 | | - In other words, we cannot statically analyze where the data may be stored - data can be stored anywhere and we can only determine its location at runtime. This means the tradition message passing interface using <code>MPI_Send</code> and <code>MPI_Recv</code> is insufficient: Suppose at runtime, process <code>A</code> wants and knows to access a piece of data at <code>B</code>, then <code>A</code> must issue <code>MPI_Recv(B)</code>, but this requires <code>B</code> to anticipate that it should issue <code>MPI_Send(A, data)</code> and know that which data <code>A</code> actually wants. The latter issue can be worked around by having <code>A</code> issue <code>MPI_Send(B, data_descriptor)</code> first. Then, <code>B</code> must have waited for <code>MPI_Recv(A)</code>. However, because the memory access pattern is not known, <code>B</code> must anticipate that any other processes may want to access its data. It is possible but cumbersome. |
39 | | - |
40 | | - MPI RMA is specifically designed to conveniently express irregular applications by having one side specify all it wants. |
| 77 | +C++11 atomics outperform MPI synchronization for intra-node communication. Using C++11 within shared memory windows optimizes the intra-node path. |
41 | 78 |
|
42 | | -</details> |
| 79 | +### Lock-free MPI porting |
43 | 80 |
|
44 | | -<details> |
45 | | - <summary>Why MPI-3 RMA?</summary> |
| 81 | +MPI-3 RMA enables lock-free implementations: |
46 | 82 |
|
47 | | - MPI-3 improves the RMA API, providing the non-collective <code>MPI_Win_lock_all</code> for a process to open an access epoch on a group of processes. This allows for lock-free synchronization. |
48 | | -</details> |
| 83 | +- `MPI_Win_lock_all` / `MPI_Win_unlock_all` manage access epochs. |
| 84 | +- MPI atomic operations (`MPI_Fetch_and_op`, `MPI_Compare_and_swap`) provide synchronization. |
49 | 85 |
|
50 | | -<details> |
51 | | - <summary>Hybrid MPI+MPI</summary> |
52 | | - The Pure MPI approach is oblivious to the fact that some MPI processes are on the same node, which causes some unnecessary overhead. MPI-3 introduces the MPI SHM API, allowing us to obtain a communicator containing processes on a single node. From this communicator, we can allocate a shared memory window using <code>MPI_Win_allocate_shared</code>. Hybrid MPI+MPI means that MPI is used for both intra-node and inter-node communication. This shared memory window follows the <em>unified memory model</em> and can be synchronized both using MPI facilities or any other alternatives. Hybrid MPI+MPI can take advantage of the many cores of current computer processors. |
53 | | -</details> |
| 86 | +## Literature Review |
54 | 87 |
|
55 | | -<details> |
56 | | - <summary>Hybrid MPI+MPI+C++11</summary> |
57 | | - Within the shared memory window, C++11 synchronization facilities can be used and prove to be much more efficient than MPI. So incorporating C++11 can be thought of as an optimization step for intra-node communication. |
58 | | -</details> |
| 88 | +### Known Problems |
59 | 89 |
|
60 | | -<details> |
61 | | - <summary>How to perform an MPI port in a lock-free manner?</summary> |
62 | | - |
63 | | - With MPI-3 RMA capabilities: |
64 | | - <ul> |
65 | | - <li>Use <code>MPI_Win_lock_all</code> and <code>MPI_Win_unlock_all</code> to open and end access epochs.</li> |
66 | | - <li>Within an access epoch, MPI atomics are used.</li> |
67 | | - </ul> |
68 | | -</details> |
| 90 | +* **ABA problem** |
69 | 91 |
|
70 | | -## Literature review |
| 92 | +A pointer is reused after deallocation, causing a CAS to incorrectly succeed. |
71 | 93 |
|
72 | | -### Known problems |
73 | | -- ABA problem. |
| 94 | +Solutions: Monotonic counters, hazard pointers. |
74 | 95 |
|
75 | | - Possible solutions: Monotonic counter, hazard pointer. |
| 96 | +* **Safe memory reclamation** |
76 | 97 |
|
77 | | -- Safe memory reclamation problem. |
| 98 | +Premature deallocation while other threads hold references. |
78 | 99 |
|
79 | | - Possible solutions: Hazard pointer. |
| 100 | +Solutions: Hazard pointers, epoch-based reclamation. |
80 | 101 |
|
81 | | -- Special case: empty queue - Concurrent `enqueue` and `dequeue` can conflict with each other. |
| 102 | +* **Empty queue contention** |
82 | 103 |
|
83 | | - Possible solutions: Dummy node to decouple head and tail. |
| 104 | +Concurrent `enqueue` and `dequeue` on an empty queue can conflict. |
84 | 105 |
|
85 | | -- A slow process performing `enqueue` and `dequeue` could leave the queue in an intermediate state. |
| 106 | +Solutions: Sentinel node to separate head and tail pointers. |
86 | 107 |
|
87 | | - Possible solutions: |
88 | | - - Help mechanism: To be lock-free, the other processes can help out patching up the queue (do not wait). |
| 108 | +* **Intermediate state from slow processes** |
89 | 109 |
|
90 | | -- A dead process performing `enqueue` and `dequeue` could leave the queue broken. |
91 | | - |
92 | | - Possible solutions: |
93 | | - - Help mechanism: The other processes can help out patching up the queue. |
| 110 | +A delayed process may leave the queue in an inconsistent state mid-operation. |
94 | 111 |
|
95 | | -- Motivation for the help mechanism? |
| 112 | +Solutions: Helping—other processes complete the pending operation. |
96 | 113 |
|
97 | | - Why: If `enqueue` or `dequeue` needs to perform some updates on the queue to move it to a consistent state, then a suspended process may leave the queue in an intermediate state. The `enqueue` and `dequeue` should not wait until it sees a consistent state or else the algorithm is blocking. Rather, they should help the suspended process complete the operation. |
| 114 | +* **Intermediate state from failed processes** |
98 | 115 |
|
99 | | - Solutions often involve (1) detecting intermediate state (2) trying to patch. |
| 116 | +A crashed process may leave the queue permanently inconsistent. |
100 | 117 |
|
101 | | - Possible solutions: |
102 | | - - Typically, updates are performed using CAS. If CAS fails, some state changes have occurred, we can detect if this is intermediary & try to perform another CAS to patch up the queue. |
103 | | - Note that the patching CAS may fail in case the queue is just patched up, so looping until a successful CAS may not be necessary. |
| 118 | +Solutions: Helping mechanisms that can complete any pending operation. |
| 119 | + |
| 120 | +* **Help mechanism rationale** |
| 121 | + |
| 122 | +Multi-step operations can leave the queue in intermediate states. Rather than blocking until consistency is restored, processes detect and complete pending operations. Implementation: |
| 123 | + |
| 124 | +1. Detect intermediate state |
| 125 | +2. Attempt completion via CAS |
| 126 | + |
| 127 | +A failed CAS indicates another process already helped; retry is unnecessary. |
104 | 128 |
|
105 | 129 | ### Trends |
106 | 130 |
|
107 | | -- Speed up happy paths. |
108 | | - - The happy path can use lock-free algorithm and fall back to the wait-free algorithm. As lock-free algorithms are typically more efficient, this can lead to speedups. |
109 | | - - Replacing CAS with simpler operations like FAA, load/store in the fast path. |
110 | | -- Avoid contention: Enqueuers or dequeuers performing on a shared data structures can harm each other's progress. |
111 | | - - Local buffers can be used at the enqueuers' side in MPSC queue so that enqueuers do not contend with each other. |
112 | | - - Elimination + Backing off techniques in MPMC. |
113 | | -- Cache-aware solutions. |
| 131 | +- Fast-path optimization |
| 132 | + - Lock-free fast path with wait-free fallback |
| 133 | + - Replace CAS with FAA or load/store where possible |
| 134 | +- Contention reduction |
| 135 | + - Per-producer local buffers |
| 136 | + - Elimination and backoff (for MPMC) |
| 137 | +- Cache-aware design |
114 | 138 |
|
115 | | -## Evaluation strategy |
| 139 | +## Evaluation Strategy |
116 | 140 |
|
117 | | -We need to evaluate at least 3 levels: |
118 | | -- Theory verification: Prove that the algorithm possesses the desired properties. |
119 | | -- Implementation verification: Even though theory is correct, implementation details nuances can affect the desired properties. |
120 | | - - Static verification: *Verify* the source code + its dependencies. |
121 | | - - Dynamic verification: *Verify* its behavior at runtime & *Benchmark*. |
| 141 | +We focus on the following criteria, in the order of decreasing importance: |
| 142 | +* Correctness |
| 143 | +* Lock-freedom |
| 144 | +* Performance & Scalability |
122 | 145 |
|
123 | 146 | ### Correctness |
124 | | -- Linearizability |
125 | | -- No problematic ABA problem |
126 | | -- Memory safety: |
127 | | - - Safe memory reclamation |
128 | 147 |
|
129 | | -### Performance |
130 | | -- Performance: The less time it takes to serve common workloads on the target platform the better. |
| 148 | +- Linearizability |
| 149 | +- ABA-freedom |
| 150 | +- Safe memory reclamation |
131 | 151 |
|
132 | 152 | ### Lock-freedom |
133 | | -- Lock-freedom: A process suspended while using the queue should not prevent other processes from making progress using the queue. |
134 | 153 |
|
135 | | -<details> |
136 | | - <summary>Caution - Lock-freedom of dependencies</summary> |
137 | | - A lock-free algorithm often <em>assumes</em> that some synchronization primitive is lock-free. This depends on the target platform and during implementation, the library used. Care must be taken to avoid accidental non-lock-free operation usage. |
138 | | -</details> |
| 154 | +No process may block system-wide progress. Note: lock-freedom depends on underlying primitives being lock-free on the target platform. |
| 155 | + |
| 156 | +### Performance |
| 157 | + |
| 158 | +Minimize latency and maximize throughput for target workloads. |
139 | 159 |
|
140 | 160 | ### Scalability |
141 | | -- Scalability: The performance gain for `queue` and `enqueue` should scale with the number of threads on the target platform. |
| 161 | + |
| 162 | +Throughput should scale with process count. |
0 commit comments