Skip to content

Commit 4449c28

Browse files
[Libraries/MPI] Samples refactoring
1 parent 726a357 commit 4449c28

File tree

14 files changed

+1125
-823
lines changed

14 files changed

+1125
-823
lines changed

Libraries/MPI/jacobian_solver/README.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,16 +29,97 @@ This sample implements a well-known distributed 2D Jacobian solver with 1D data
2929

3030
The sample has three variants demonstrating different approaches to the Jacobi solver.
3131

32+
### `Data layout description`
33+
34+
The data layout is a 2D grid of size (Nx+2) x (Ny+2), distributed across MPI processes along the Y-axis.
35+
Where first and last row/column areconstant and used for boundary conditions.
36+
Each porcess handles Nx x (Ny/comm_size) subarray.
37+
38+
```
39+
Left border Right border
40+
| |
41+
v v
42+
------------------------------------------------
43+
Top border ---> |X| |X|
44+
------------------------------------------------
45+
| | /\ | |
46+
| | | | |
47+
| | | | |
48+
| | | | | ------------------------------------------------
49+
| | | | | |X| |X| <- Last row of of i-1 subarray from previous iterarion used for calculation
50+
| | | | |....................------------------------------------------------
51+
| |<--------- Nx x Ny array ---------------->| | | | | |
52+
| | | | | | | i-th process subarray | |
53+
| | | | | | | Nx x (Ny/comm_size) | |
54+
| | | | | | | | |
55+
| | | | |....................------------------------------------------------
56+
| | | | | |X| |X| <- First row of of i+1 subarray from previous iterarion used for calculation
57+
| | V | | ------------------------------------------------
58+
------------------------------------------------
59+
Bottom border-> |X| |X|
60+
------------------------------------------------
61+
```
62+
3263
### `01_jacobian_host_mpi_one-sided`
3364

3465
This program demonstrates baseline implementation of the distributed Jacobian solver. In this sample you will see the basic idea of the algorithm, as well as how to implement the halo-exchange using MPI-3 one-sided primitives required for this solver.
3566

3667
The solver is an iterative algorithm where each iteration of the program recalculates border values first, then border values transfer to neighbor processes, which are used in next iteration of algorithm. Each process recalculate internal points values for the next iteration in parallel with communication. After a number of iterations, the algorithm reports NORM values for validation purposes.
3768

69+
```mermaid
70+
sequenceDiagram
71+
participant APP as Application
72+
participant HC as Host compute
73+
participant COMM as Communication
74+
participant GC as GPU compute
75+
76+
loop Solever: batch iterations
77+
loop Solver: single iteration
78+
APP ->>+ HC: Calculate values on the edges
79+
HC ->>- APP: edge values
80+
APP ->>+ COMM: transfer data to neighbours using MPI_Put
81+
APP ->>+ HC: Recalculate internal points
82+
HC ->> HC: Main compute loop
83+
HC ->>- APP: Updated internal points
84+
APP ->> COMM: RMA window synchronization
85+
COMM ->>- APP: RMA syncronization completion
86+
end
87+
APP ->>+ HC: start compute of local norm
88+
HC ->>- APP: local norm value
89+
APP ->>+ COMM: Collect global norm using MPI_Reduce
90+
COMM ->>- APP: global norm value
91+
end
92+
```
93+
3894
### `02_jacobian_device_mpi_one-sided_gpu_aware`
3995

4096
This program demonstrates how the same algorithm can be modified to add GPU offload capability. The program comes in two versions: OpenMP and SYCL. The program illustrates how device memory can be passed directly to MPI one-sided primitives. In particular, device memory may be passed to `MPI_Win_create` call to create an RMA Window placed on a device. Also, aside from a device RMA-window placement, device memory can be passed to `MPI_Put`/`MPI_Get` primitives as a target or origin buffer.
4197

98+
```mermaid
99+
sequenceDiagram
100+
participant APP as Application
101+
participant HC as Host compute
102+
participant GC as GPU compute
103+
participant COMM as Communication
104+
105+
loop Solever: batch iterations
106+
loop Solver: single iteration
107+
APP ->>+ GC: Calculate values on the edges
108+
GC ->>- APP: edge values
109+
APP ->>+ COMM: transfer data to neighbours using MPI_Put
110+
APP ->>+ GC: Recalculate internal points
111+
GC ->> GC: Main compute loop
112+
GC ->>- APP: Updated internal points
113+
APP ->> COMM: RMA window synchronization
114+
COMM ->>- APP: RMA syncronization completion
115+
end
116+
APP ->>+ GC: start compute of local norm
117+
GC ->>- APP: local norm value
118+
APP ->>+ COMM: Collect global norm using MPI_Reduce
119+
COMM ->>- APP: global norm value
120+
end
121+
```
122+
42123
> **Note**: Only contigouous MPI datatypes are supported.
43124
44125
### `03_jacobian_device_mpi_one-sided_device_initiated`
@@ -54,12 +135,60 @@ This program demonstrates how to initiate one-sided communications directly from
54135

55136
To enable device-initiated communications, you must set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
56137

138+
```mermaid
139+
sequenceDiagram
140+
participant APP as Application
141+
participant HC as Host compute
142+
participant GC as GPU compute
143+
participant COMM as Communication
144+
145+
loop Solver: batch iterations
146+
APP ->>+ GC: Start fused kernel
147+
loop Solver: single iteration
148+
GC ->> GC: Calculate values on the edges
149+
GC ->>+ COMM: transfer data to neighbours using MPI_Put
150+
GC ->> GC: Recalculate internal points
151+
GC ->> COMM: RMA window synchronization
152+
COMM ->>- GC: RMA syncronization completion
153+
end
154+
GC ->>- APP: Fused kernel completion
155+
APP ->>+ GC: start compute of local norm
156+
GC ->>- APP: local norm value
157+
APP ->>+ COMM: Collect global norm using MPI_Reduce
158+
COMM ->>- APP: global norm value
159+
end
160+
```
161+
162+
57163
### `04_jacobian_device_mpi_one-sided_device_initiated_notify`
58164

59165
This program demonstrates how to initiate one-sided communications directly from the offloaded code. The Intel® MPI Library allows calls to some communication primitives directly from the offloaded code (SYCL or OpenMP). In contrast to prior example, this one demonstrates usage of one-sided communications with notification (extention of MPI-4.1 standard).
60166

61167
To enable device-initiated communications, you must set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
62168

169+
```mermaid
170+
sequenceDiagram
171+
participant APP as Application
172+
participant HC as Host compute
173+
participant GC as GPU compute
174+
participant COMM as Communication
175+
176+
loop Solver: batch iterations
177+
APP ->>+ GC: Start fused kernel
178+
loop Solver: single iteration
179+
GC ->> GC: Calculate values on the edges
180+
GC ->>+ COMM: transfer data to neighbours using MPI_Put_notify
181+
GC ->> GC: Recalculate internal points
182+
COMM -->>- GC: notification from the remote rank
183+
end
184+
GC ->>- APP: Fused kernel completion
185+
APP ->>+ GC: start compute of local norm
186+
GC ->>- APP: local norm value
187+
APP ->>+ COMM: Collect global norm using MPI_Reduce
188+
COMM ->>- APP: global norm value
189+
end
190+
```
191+
63192
## Build the `Distributed Jacobian Solver SYCL/MPI` Sample
64193

65194
> **Note**: If you have not already done so, set up your CLI

Libraries/MPI/jacobian_solver/src/01_jacobian_host_mpi_one-sided/GNUmakefile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ debug: CFLAGS += -O0 -g
1515
debug: CXXFLAGS += -O0 -g
1616
debug: $(example)
1717

18+
%(example): ../include/common.h
19+
1820
% : %.c
1921
$(CC) $(CFLAGS) $(INCLUDES) -o $@ $< $(LDFLAGS)
2022

0 commit comments

Comments
 (0)