Skip to content

Commit cf55251

Browse files
committed
doc: improve readme
1 parent 83e687a commit cf55251

File tree

1 file changed

+115
-94
lines changed

1 file changed

+115
-94
lines changed

README.md

Lines changed: 115 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -1,141 +1,162 @@
11
# Porting shared memory MPSC queues to distributed context using MPI-3 RMA
22

3+
![MPiSC](https://img.shields.io/badge/MPiSC-blue?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyTDIgN2wxMCA1IDEwLTV6TTIgMTdsMTAgNSAxMC01TTIgMTJsMTAgNSAxMC01Ii8+PC9zdmc+) ![Status](https://img.shields.io/badge/status-complete-brightgreen) [![Thesis](https://img.shields.io/badge/thesis-h--dna.github.io-informational)](https://h-dna.github.io/MPiSC/)
4+
5+
This project ports lock-free Multiple-Producer Single-Consumer (MPSC) queue algorithms from shared-memory to distributed systems using MPI-3 Remote Memory Access (RMA).
6+
7+
## Table of Contents
8+
9+
- [Publications](#publications)
10+
- [Objective](#objective)
11+
- [Motivation](#motivation)
12+
- [Approach](#approach)
13+
- [Why MPI RMA?](#why-mpi-rma)
14+
- [Why MPI-3 RMA?](#why-mpi-3-rma)
15+
- [Hybrid MPI+MPI](#hybrid-mpimpi)
16+
- [Hybrid MPI+MPI+C++11](#hybrid-mpimpic11)
17+
- [Lock-free MPI porting](#lock-free-mpi-porting)
18+
- [Literature Review](#literature-review)
19+
- [Known Problems](#known-problems)
20+
- [Trends](#trends)
21+
- [Evaluation Strategy](#evaluation-strategy)
22+
- [Correctness](#correctness)
23+
- [Lock-freedom](#lock-freedom)
24+
- [Performance](#performance)
25+
- [Scalability](#scalability)
26+
27+
## Publications
28+
29+
- [dLTQueue: A Non-Blocking Distributed-Memory Multi-Producer Single-Consumer Queue](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue)
30+
- [Slotqueue: A Wait-Free Distributed Multi-Producer Single-Consumer Queue with Constant Remote Operations](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations)
31+
332
## Objective
433

5-
- Examination of the *shared-memory* literature to find potential *lock-free*, *concurrent*, *multiple-producer single-consumer* queue algorithms.
6-
- Use the new MPI-3 RMA capabilities to port potential lock-free *shared-memory* queue algorithms to distributed context.
7-
- Potentially optimize MPI RMA ports using MPI-3 SHM + C++11 memory model.
34+
- Survey shared-memory literature for lock-free, concurrent MPSC queue algorithms.
35+
- Port candidate algorithms to distributed contexts using MPI-3 RMA.
36+
- Optimize ports using MPI-3 SHM and the C++11 memory model.
837

9-
- Minimum required characteristics:
38+
Target characteristics:
1039

11-
| Dimension | Desired property |
40+
| Dimension | Requirement |
1241
| ------------------- | ----------------------- |
13-
| Queue length | Fixed length |
14-
| Number of producers | Many |
15-
| Number of consumers | One |
16-
| Operations | `queue`, `enqueue` |
17-
| Concurrency | Concurrent & Lock-free |
42+
| Queue length | Fixed |
43+
| Number of producers | Multiple |
44+
| Number of consumers | Single |
45+
| Operations | `enqueue`, `dequeue` |
46+
| Concurrency | Lock-free |
1847

1948
## Motivation
2049

21-
- Queue is the backbone data structures in many applications: scheduling, event handling, message bufferring. In these applications, the queue may be highly contended, for example, in event handling, there can be multiple sources of events & many consumers of events at the same time. If the queue has not been designed properly, it can become a bottleneck in a highly concurrent environment, adversely affecting the application's scalability. This sentiment also applies to queues in distributed contexts.
22-
- Within the context of shared-memory, there have been plenty of research and testing going into efficient, scalable & lock-free queue algorithms. This presents an opportunity to port these high-quality algorithms to the distributed context, albeit some inherent differences that need to be taken into consideration between the two contexts.
23-
- In the distributed literature, most of the proposed algorithms completely disregard the existing shared-memory algorithms, mostly due to the discrepancy between the programming model of shared memory and that of distributed computing. However, with MPI-3 RMA, the gap is bridged, and we can straightforwardly model shared memory application using MPI. This is why we investigate the porting approach & compare them with existing distributed queue algorithms.
50+
Queues are fundamental to scheduling, event handling, and message buffering. Under high contention—such as multiple event sources writing simultaneously—a poorly designed queue becomes a scalability bottleneck. This holds for both shared-memory and distributed systems.
51+
52+
Shared-memory research has produced efficient, scalable, lock-free queue algorithms. Distributed computing literature largely ignores these algorithms due to differing programming models. MPI-3 RMA bridges this gap by enabling one-sided communication that closely mirrors shared-memory semantics. This project investigates whether porting shared-memory algorithms via MPI-3 RMA yields competitive distributed queues.
2453

2554
## Approach
2655

27-
The porting approach we choose is to use MPI-3 RMA to port lock-free queue algorithms. We further optimize these ports using MPI SHM (or the so called MPI+MPI hybrid approach) and C++11 for shared memory synchronization.
56+
We port lock-free queue algorithms using MPI-3 RMA, then optimize with MPI SHM (hybrid MPI+MPI) and C++11 atomics for intra-node communication.
57+
58+
### Why MPI RMA?
59+
60+
MPSC queues are *irregular* applications:
61+
62+
- Memory access patterns are dynamic.
63+
- Data locations are determined at runtime.
64+
65+
Traditional two-sided communication (`MPI_Send`/`MPI_Recv`) requires the receiver to anticipate requests—impractical when access patterns are unknown. MPI RMA allows one-sided communication where the initiator specifies all parameters.
66+
67+
### Why MPI-3 RMA?
68+
69+
MPI-3 introduces `MPI_Win_lock_all`, a non-collective operation for opening access epochs on process groups, enabling lock-free synchronization.
70+
71+
### Hybrid MPI+MPI
72+
73+
Pure MPI ignores intra-node locality. MPI-3 SHM provides `MPI_Win_allocate_shared` for allocating shared memory windows among co-located processes. These windows use the unified memory model and can leverage both MPI and native synchronization. This exploits multi-core parallelism within nodes.
74+
75+
### Hybrid MPI+MPI+C++11
2876

29-
<details>
30-
<summary>Why MPI RMA?</summary>
31-
32-
MPSC queue belongs to the class of <i>irregular</i> applications, this means that:
33-
<ul>
34-
<li>Memory access pattern is not known.</li>
35-
<li>Data locations cannot be known in advance, it can change during execution.</li>
36-
</ul>
37-
38-
In other words, we cannot statically analyze where the data may be stored - data can be stored anywhere and we can only determine its location at runtime. This means the tradition message passing interface using <code>MPI_Send</code> and <code>MPI_Recv</code> is insufficient: Suppose at runtime, process <code>A</code> wants and knows to access a piece of data at <code>B</code>, then <code>A</code> must issue <code>MPI_Recv(B)</code>, but this requires <code>B</code> to anticipate that it should issue <code>MPI_Send(A, data)</code> and know that which data <code>A</code> actually wants. The latter issue can be worked around by having <code>A</code> issue <code>MPI_Send(B, data_descriptor)</code> first. Then, <code>B</code> must have waited for <code>MPI_Recv(A)</code>. However, because the memory access pattern is not known, <code>B</code> must anticipate that any other processes may want to access its data. It is possible but cumbersome.
39-
40-
MPI RMA is specifically designed to conveniently express irregular applications by having one side specify all it wants.
77+
C++11 atomics outperform MPI synchronization for intra-node communication. Using C++11 within shared memory windows optimizes the intra-node path.
4178

42-
</details>
79+
### Lock-free MPI porting
4380

44-
<details>
45-
<summary>Why MPI-3 RMA?</summary>
81+
MPI-3 RMA enables lock-free implementations:
4682

47-
MPI-3 improves the RMA API, providing the non-collective <code>MPI_Win_lock_all</code> for a process to open an access epoch on a group of processes. This allows for lock-free synchronization.
48-
</details>
83+
- `MPI_Win_lock_all` / `MPI_Win_unlock_all` manage access epochs.
84+
- MPI atomic operations (`MPI_Fetch_and_op`, `MPI_Compare_and_swap`) provide synchronization.
4985

50-
<details>
51-
<summary>Hybrid MPI+MPI</summary>
52-
The Pure MPI approach is oblivious to the fact that some MPI processes are on the same node, which causes some unnecessary overhead. MPI-3 introduces the MPI SHM API, allowing us to obtain a communicator containing processes on a single node. From this communicator, we can allocate a shared memory window using <code>MPI_Win_allocate_shared</code>. Hybrid MPI+MPI means that MPI is used for both intra-node and inter-node communication. This shared memory window follows the <em>unified memory model</em> and can be synchronized both using MPI facilities or any other alternatives. Hybrid MPI+MPI can take advantage of the many cores of current computer processors.
53-
</details>
86+
## Literature Review
5487

55-
<details>
56-
<summary>Hybrid MPI+MPI+C++11</summary>
57-
Within the shared memory window, C++11 synchronization facilities can be used and prove to be much more efficient than MPI. So incorporating C++11 can be thought of as an optimization step for intra-node communication.
58-
</details>
88+
### Known Problems
5989

60-
<details>
61-
<summary>How to perform an MPI port in a lock-free manner?</summary>
62-
63-
With MPI-3 RMA capabilities:
64-
<ul>
65-
<li>Use <code>MPI_Win_lock_all</code> and <code>MPI_Win_unlock_all</code> to open and end access epochs.</li>
66-
<li>Within an access epoch, MPI atomics are used.</li>
67-
</ul>
68-
</details>
90+
* **ABA problem**
6991

70-
## Literature review
92+
A pointer is reused after deallocation, causing a CAS to incorrectly succeed.
7193

72-
### Known problems
73-
- ABA problem.
94+
Solutions: Monotonic counters, hazard pointers.
7495

75-
Possible solutions: Monotonic counter, hazard pointer.
96+
* **Safe memory reclamation**
7697

77-
- Safe memory reclamation problem.
98+
Premature deallocation while other threads hold references.
7899

79-
Possible solutions: Hazard pointer.
100+
Solutions: Hazard pointers, epoch-based reclamation.
80101

81-
- Special case: empty queue - Concurrent `enqueue` and `dequeue` can conflict with each other.
102+
* **Empty queue contention**
82103

83-
Possible solutions: Dummy node to decouple head and tail.
104+
Concurrent `enqueue` and `dequeue` on an empty queue can conflict.
84105

85-
- A slow process performing `enqueue` and `dequeue` could leave the queue in an intermediate state.
106+
Solutions: Sentinel node to separate head and tail pointers.
86107

87-
Possible solutions:
88-
- Help mechanism: To be lock-free, the other processes can help out patching up the queue (do not wait).
108+
* **Intermediate state from slow processes**
89109

90-
- A dead process performing `enqueue` and `dequeue` could leave the queue broken.
91-
92-
Possible solutions:
93-
- Help mechanism: The other processes can help out patching up the queue.
110+
A delayed process may leave the queue in an inconsistent state mid-operation.
94111

95-
- Motivation for the help mechanism?
112+
Solutions: Helping—other processes complete the pending operation.
96113

97-
Why: If `enqueue` or `dequeue` needs to perform some updates on the queue to move it to a consistent state, then a suspended process may leave the queue in an intermediate state. The `enqueue` and `dequeue` should not wait until it sees a consistent state or else the algorithm is blocking. Rather, they should help the suspended process complete the operation.
114+
* **Intermediate state from failed processes**
98115

99-
Solutions often involve (1) detecting intermediate state (2) trying to patch.
116+
A crashed process may leave the queue permanently inconsistent.
100117

101-
Possible solutions:
102-
- Typically, updates are performed using CAS. If CAS fails, some state changes have occurred, we can detect if this is intermediary & try to perform another CAS to patch up the queue.
103-
Note that the patching CAS may fail in case the queue is just patched up, so looping until a successful CAS may not be necessary.
118+
Solutions: Helping mechanisms that can complete any pending operation.
119+
120+
* **Help mechanism rationale**
121+
122+
Multi-step operations can leave the queue in intermediate states. Rather than blocking until consistency is restored, processes detect and complete pending operations. Implementation:
123+
124+
1. Detect intermediate state
125+
2. Attempt completion via CAS
126+
127+
A failed CAS indicates another process already helped; retry is unnecessary.
104128

105129
### Trends
106130

107-
- Speed up happy paths.
108-
- The happy path can use lock-free algorithm and fall back to the wait-free algorithm. As lock-free algorithms are typically more efficient, this can lead to speedups.
109-
- Replacing CAS with simpler operations like FAA, load/store in the fast path.
110-
- Avoid contention: Enqueuers or dequeuers performing on a shared data structures can harm each other's progress.
111-
- Local buffers can be used at the enqueuers' side in MPSC queue so that enqueuers do not contend with each other.
112-
- Elimination + Backing off techniques in MPMC.
113-
- Cache-aware solutions.
131+
- Fast-path optimization
132+
- Lock-free fast path with wait-free fallback
133+
- Replace CAS with FAA or load/store where possible
134+
- Contention reduction
135+
- Per-producer local buffers
136+
- Elimination and backoff (for MPMC)
137+
- Cache-aware design
114138

115-
## Evaluation strategy
139+
## Evaluation Strategy
116140

117-
We need to evaluate at least 3 levels:
118-
- Theory verification: Prove that the algorithm possesses the desired properties.
119-
- Implementation verification: Even though theory is correct, implementation details nuances can affect the desired properties.
120-
- Static verification: *Verify* the source code + its dependencies.
121-
- Dynamic verification: *Verify* its behavior at runtime & *Benchmark*.
141+
We focus on the following criteria, in the order of decreasing importance:
142+
* Correctness
143+
* Lock-freedom
144+
* Performance & Scalability
122145

123146
### Correctness
124-
- Linearizability
125-
- No problematic ABA problem
126-
- Memory safety:
127-
- Safe memory reclamation
128147

129-
### Performance
130-
- Performance: The less time it takes to serve common workloads on the target platform the better.
148+
- Linearizability
149+
- ABA-freedom
150+
- Safe memory reclamation
131151

132152
### Lock-freedom
133-
- Lock-freedom: A process suspended while using the queue should not prevent other processes from making progress using the queue.
134153

135-
<details>
136-
<summary>Caution - Lock-freedom of dependencies</summary>
137-
A lock-free algorithm often <em>assumes</em> that some synchronization primitive is lock-free. This depends on the target platform and during implementation, the library used. Care must be taken to avoid accidental non-lock-free operation usage.
138-
</details>
154+
No process may block system-wide progress. Note: lock-freedom depends on underlying primitives being lock-free on the target platform.
155+
156+
### Performance
157+
158+
Minimize latency and maximize throughput for target workloads.
139159

140160
### Scalability
141-
- Scalability: The performance gain for `queue` and `enqueue` should scale with the number of threads on the target platform.
161+
162+
Throughput should scale with process count.

0 commit comments

Comments
 (0)