Skip to content

Commit bd4634a

Browse files
committed
doc: improve readme
1 parent f40f633 commit bd4634a

File tree

2 files changed

+54
-127
lines changed

2 files changed

+54
-127
lines changed

README.md

Lines changed: 54 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -1,162 +1,89 @@
1-
# Porting shared memory MPSC queues to distributed context using MPI-3 RMA
1+
# Non-Blocking Distributed MPSC Queues
22

33
![MPiSC](https://img.shields.io/badge/MPiSC-blue?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyTDIgN2wxMCA1IDEwLTV6TTIgMTdsMTAgNSAxMC01TTIgMTJsMTAgNSAxMC01Ii8+PC9zdmc+) ![Status](https://img.shields.io/badge/status-complete-brightgreen) [![Thesis](https://img.shields.io/badge/thesis-h--dna.github.io-informational)](https://h-dna.github.io/MPiSC/)
44

5-
This project ports lock-free Multiple-Producer Single-Consumer (MPSC) queue algorithms from shared-memory to distributed systems using MPI-3 Remote Memory Access (RMA).
6-
75
## Table of Contents
86

9-
- [Objective](#objective)
10-
- [Motivation](#motivation)
11-
- [Approach](#approach)
12-
- [Why MPI RMA?](#why-mpi-rma)
13-
- [Why MPI-3 RMA?](#why-mpi-3-rma)
14-
- [Hybrid MPI+MPI](#hybrid-mpimpi)
15-
- [Hybrid MPI+MPI+C++11](#hybrid-mpimpic11)
16-
- [Lock-Free MPI Porting](#lock-free-mpi-porting)
17-
- [Literature Review](#literature-review)
18-
- [Known Problems](#known-problems)
19-
- [Trends](#trends)
20-
- [Evaluation Strategy](#evaluation-strategy)
21-
- [Correctness](#correctness)
22-
- [Lock-Freedom](#lock-freedom)
23-
- [Performance](#performance)
24-
- [Scalability](#scalability)
7+
- [Abstract](#abstract)
8+
- [Motivation and Methodology](#motivation-and-methodology)
9+
- [Contributions](#contributions)
10+
- [Results](#results)
2511
- [Related](#related)
2612

27-
## Objective
28-
29-
- Survey shared-memory literature for lock-free, concurrent MPSC queue algorithms.
30-
- Port candidate algorithms to distributed contexts using MPI-3 RMA.
31-
- Optimize ports using MPI-3 SHM and the C++11 memory model.
32-
33-
Target characteristics:
34-
35-
| Dimension | Requirement |
36-
| ------------------- | ----------------------- |
37-
| Queue length | Fixed |
38-
| Number of producers | Multiple |
39-
| Number of consumers | Single |
40-
| Operations | `enqueue`, `dequeue` |
41-
| Progress guarantee | Lock-free |
42-
43-
## Motivation
44-
45-
Queues are fundamental to scheduling, event handling, and message buffering. Under high contention—such as multiple event sources writing simultaneously—a poorly designed queue becomes a scalability bottleneck. This holds for both shared-memory and distributed systems.
46-
47-
Shared-memory research has produced efficient, scalable, lock-free queue algorithms. Distributed computing literature largely ignores these algorithms due to differing programming models. MPI-3 RMA bridges this gap by enabling one-sided communication that closely mirrors shared-memory semantics. This project investigates whether porting shared-memory algorithms via MPI-3 RMA yields competitive distributed queues.
48-
49-
## Approach
50-
51-
We port lock-free queue algorithms using MPI-3 RMA, then optimize with MPI SHM (hybrid MPI+MPI) and C++11 atomics for intra-node communication.
52-
53-
### Why MPI RMA?
54-
55-
MPSC queues are *irregular* applications:
56-
57-
- Memory access patterns are dynamic.
58-
- Data locations are determined at runtime.
59-
60-
Traditional two-sided communication (`MPI_Send`/`MPI_Recv`) requires the receiver to anticipate requests—impractical when access patterns are unknown. MPI RMA allows one-sided communication where the initiator specifies all parameters.
61-
62-
### Why MPI-3 RMA?
63-
64-
MPI-3 introduces `MPI_Win_lock_all`, a non-collective operation for opening access epochs on process groups, enabling lock-free synchronization.
65-
66-
### Hybrid MPI+MPI
13+
## Abstract
6714

68-
Pure MPI ignores intra-node locality. MPI-3 SHM provides `MPI_Win_allocate_shared` for allocating shared memory windows among co-located processes. These windows use the unified memory model and can leverage both MPI and native synchronization. This exploits multi-core parallelism within nodes.
15+
Distributed applications such as the actor model and fan-out/fan-in pattern require MPSC queues that are both performant and fault-tolerant. We address the absence of non-blocking distributed MPSC queues by adapting LTQueue — a wait-free shared-memory MPSC queue — to distributed environments using MPI-3 RMA. We introduce three novel **wait-free** distributed MPSC queues: **dLTQueue**, **Slotqueue**, and **dLTQueueV2**. Evaluation on SuperMUC-NG and CoolMUC-4 shows ~2x better enqueue throughput than the existing AMQueue while providing stronger fault tolerance.
6916

70-
### Hybrid MPI+MPI+C++11
17+
## Motivation and Methodology
7118

72-
C++11 atomics outperform MPI synchronization for intra-node communication. Using C++11 within shared memory windows optimizes the intra-node path.
19+
### The Problem
7320

74-
### Lock-Free MPI Porting
21+
MPSC queues are essential for **irregular applications** — programs with unpredictable, data-dependent memory access patterns:
7522

76-
MPI-3 RMA enables lock-free implementations:
23+
- **Actor model**: Each actor maintains a mailbox (MPSC queue) receiving messages from other actors
24+
- **Fan-out/fan-in**: Worker nodes enqueue results to an aggregation node for processing
7725

78-
- `MPI_Win_lock_all` / `MPI_Win_unlock_all` manage access epochs.
79-
- MPI atomic operations (`MPI_Fetch_and_op`, `MPI_Compare_and_swap`) provide synchronization.
26+
These patterns demand queues that are both performant and fault-tolerant. A slow or crashed producer should not block the entire system.
8027

81-
## Literature Review
28+
### Gap in the Literature
8229

83-
### Known Problems
30+
**Shared-memory** has several non-blocking MPSC queues: LTQueue, DQueue, WRLQueue, and Jiffy. However, our analysis reveals critical flaws in most:
8431

85-
* **ABA problem**
32+
| Queue | Issue |
33+
|-------|-------|
34+
| DQueue | Incorrect ABA solution and unsafe memory reclamation |
35+
| WRLQueue | Actually **blocking** — dequeuer waits for all enqueuers |
36+
| Jiffy | Insufficient memory reclamation, not truly wait-free |
37+
| **LTQueue** | **Correct** — uses LL/SC for ABA, proper memory reclamation |
8638

87-
A pointer is reused after deallocation, causing a CAS to incorrectly succeed.
39+
**Distributed** has only one MPSC queue: **AMQueue**. Despite claiming lock-freedom, it is actually **blocking** — the dequeuer must wait for all enqueuers to finish. A single slow enqueuer halts the entire system. ([Confirmed by the original author](assets/amqueue-blocking-evidence.jpg))
8840

89-
Solutions: Monotonic counters, hazard pointers.
41+
### Our Approach
9042

91-
* **Safe memory reclamation**
43+
We adapt **LTQueue** — the only correct shared-memory MPSC queue — to distributed environments using MPI-3 RMA one-sided communication.
9244

93-
Premature deallocation while other threads hold references.
45+
**Key challenge**: LTQueue relies on LL/SC (Load-Link/Store-Conditional) to solve the ABA problem, but LL/SC is unavailable in MPI.
9446

95-
Solutions: Hazard pointers, epoch-based reclamation.
47+
**Our solution**: Replace LL/SC with CAS + unique timestamps. Each value is tagged with a monotonically increasing version number, preventing ABA without LL/SC.
9648

97-
* **Empty queue contention**
49+
**Key techniques**:
50+
- **SPSC-per-enqueuer**: Each producer maintains a local queue, eliminating producer contention
51+
- **Unique timestamps**: Solves ABA via monotonic version numbers
52+
- **Double-refresh**: Bounds retries to two per node, ensuring wait-freedom
9853

99-
Concurrent `enqueue` and `dequeue` on an empty queue can conflict.
54+
## Contributions
10055

101-
Solutions: Sentinel node to separate head and tail pointers.
56+
### Findings
10257

103-
* **Intermediate state from slow processes**
58+
- **3 of 4** shared-memory MPSC queues (DQueue, WRLQueue, Jiffy) have correctness or progress issues
59+
- **AMQueue**, the only distributed MPSC queue, is blocking despite claims of lock-freedom
60+
- **LTQueue** is the only correct candidate for distributed adaptation
10461

105-
A delayed process may leave the queue in an inconsistent state mid-operation.
62+
### Novel Algorithms
10663

107-
Solutions: Helping—other processes complete the pending operation.
64+
| Algorithm | Progress | Enqueue | Dequeue |
65+
|-----------|----------|---------|---------|
66+
| **dLTQueue** | Wait-free | O(log n) remote | O(log n) remote |
67+
| **Slotqueue** | Wait-free | O(1) remote | O(1) remote, O(n) local |
68+
| **dLTQueueV2** | Wait-free | O(1) remote | O(1) remote, O(log n) local |
10869

109-
* **Intermediate state from failed processes**
70+
All algorithms are **linearizable** with no dynamic memory allocation.
11071

111-
A crashed process may leave the queue permanently inconsistent.
72+
## Results
11273

113-
Solutions: Helping mechanisms that can complete any pending operation.
74+
Benchmarked on [SuperMUC-NG](https://doku.lrz.de/supermuc-ng-10745965.html) (6000+ nodes) and [CoolMUC-4](https://doku.lrz.de/coolmuc-4-10746415.html) (100+ nodes):
11475

115-
* **Help mechanism rationale**
76+
| Metric | Our Queues vs AMQueue |
77+
|--------|----------------------|
78+
| Enqueue throughput | **~2x better** |
79+
| Dequeue throughput | 3-10x worse |
80+
| Fault tolerance | **Wait-free** (vs blocking) |
11681

117-
Multi-step operations can leave the queue in intermediate states. Rather than blocking until consistency is restored, processes detect and complete pending operations. Implementation:
118-
119-
1. Detect intermediate state
120-
2. Attempt completion via CAS
121-
122-
A failed CAS indicates another process already helped; retry is unnecessary.
123-
124-
### Trends
125-
126-
- Fast-path optimization
127-
- Lock-free fast path with wait-free fallback
128-
- Replace CAS with FAA or load/store where possible
129-
- Contention reduction
130-
- Per-producer local buffers
131-
- Elimination and backoff (for MPMC)
132-
- Cache-aware design
133-
134-
## Evaluation Strategy
135-
136-
We focus on the following criteria, in the order of decreasing importance:
137-
* Correctness
138-
* Lock-freedom
139-
* Performance & Scalability
140-
141-
### Correctness
142-
143-
- Linearizability
144-
- ABA-freedom
145-
- Safe memory reclamation
146-
147-
### Lock-Freedom
148-
149-
No process may block system-wide progress. Note: lock-freedom depends on underlying primitives being lock-free on the target platform.
150-
151-
### Performance
152-
153-
Minimize latency and maximize throughput for target workloads.
154-
155-
### Scalability
156-
157-
Throughput should scale with process count.
82+
**Trade-off**: Stronger fault tolerance at the cost of dequeue performance.
15883

15984
## Related
16085

161-
- [dLTQueue: A Non-Blocking Distributed-Memory Multi-Producer Single-Consumer Queue](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue)
162-
- [Slotqueue: A Wait-Free Distributed Multi-Producer Single-Consumer Queue with Constant Remote Operations](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations)
86+
1. **dLTQueue** - FDSE 2025 ([ResearchGate](https://www.researchgate.net/publication/395381301_dLTQueue_A_Non-Blocking_Distributed-Memory_Multi-Producer_Single-Consumer_Queue))
87+
2. **Slotqueue** - NPC 2025 ([ResearchGate](https://www.researchgate.net/publication/395448251_Slotqueue_A_Wait-Free_Distributed_Multi-Producer_Single-Consumer_Queue_with_Constant_Remote_Operations))
88+
89+
[Full thesis](https://h-dna.github.io/MPiSC/)
583 KB
Loading

0 commit comments

Comments
 (0)