You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Libraries/MPI/jacobian_solver/README.md
+27-20Lines changed: 27 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# `Distributed Jacobian Solver SYCL/MPI`Sample
1
+
# `Distributed Jacobian Solver SYCL/MPI`sample
2
2
3
3
The `Distributed Jacobian Solver SYCL/MPI` demonstrates using GPU-aware MPI-3, one-sided communications available in the Intel® MPI Library.
4
4
@@ -13,27 +13,27 @@ see the [Intel® MPI Library Documentation](https://www.intel.com/content/www/us
13
13
14
14
## Purpose
15
15
16
-
The sample demonstrates an actual use case (Jacobian solver) for MPI-3 one-sided communications allowing to overlap compute kernel and communications. The sample illustrated how to use host- and device-initiated onesided communication with SYCL kernels.
16
+
The sample demonstrates an actual use case (Jacobian solver) for MPI-3 one-sided communications allowing to overlap compute kernel and communications. The sample illustrates how to use host- and device-initiated one-sided communication with SYCL kernels.
17
17
18
18
## Prerequisites
19
19
20
20
| Optimized for | Description
21
21
|:--- |:---
22
22
| OS | Linux*
23
-
| Hardware | 4th Generation Intel® Xeon® Scalable Processors <br> Intel® Data Center GPU Max Series
23
+
| Hardware | 4th Generation Intel® Xeon® Scalable processors <br> Intel® Data Center GPU Max Series
24
24
| Software | Intel® MPI Library 2021.11
25
25
26
26
## Key Implementation Details
27
27
28
-
This sample implements a well-known distributed 2D Jacobian solver with 1D data distribution. The sampple uses Intel® MPI [GPU Support](https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/current/gpu-support.html).
28
+
This sample implements a well-known distributed 2D Jacobi solver with 1D data distribution. The sample uses Intel® MPI [GPU Support](https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/current/gpu-support.html).
29
29
30
30
The sample has three variants demonstrating different approaches to the Jacobi solver.
31
31
32
32
### `Data layout description`
33
33
34
34
The data layout is a 2D grid of size (Nx+2) x (Ny+2), distributed across MPI processes along the Y-axis.
35
-
Where first and last row/column areconstant and used for boundary conditions.
36
-
Each porcess handles Nx x (Ny/comm_size) subarray.
35
+
The first and last rows/columns are constant and used for boundary conditions.
36
+
Each process handles Nx x (Ny/comm_size) subarray.
37
37
38
38
```
39
39
Left border Right border
@@ -46,14 +46,14 @@ Each porcess handles Nx x (Ny/comm_size) subarray.
| | | | | |X| |X| <- First row of of i+1 subarray from previous iterarion used for calculation
56
+
| | | | | |X| |X| <- First row of i+1 subarray from the previous iteration used for calculation
57
57
| | V | | ------------------------------------------------
58
58
------------------------------------------------
59
59
Bottom border-> |X| |X|
@@ -62,9 +62,9 @@ Each porcess handles Nx x (Ny/comm_size) subarray.
62
62
63
63
### `01_jacobian_host_mpi_one-sided`
64
64
65
-
This program demonstrates baseline implementation of the distributed Jacobian solver. In this sample you will see the basic idea of the algorithm, as well as how to implement the halo-exchange using MPI-3 one-sided primitives required for this solver.
65
+
This program demonstrates a baseline implementation of the distributed Jacobian solver. In this sample you will see the basic idea of the algorithm, as well as how to implement the halo-exchange using MPI-3 one-sided primitives required for this solver.
66
66
67
-
The solver is an iterative algorithm where each iteration of the program recalculates border values first, then border values transfer to neighbor processes, which are used in next iteration of algorithm. Each process recalculate internal points values for the next iteration in parallel with communication. After a number of iterations, the algorithm reports NORM values for validation purposes.
67
+
The solver is an iterative algorithm where each iteration of the program recalculates border values first, then border values transfer to neighbor processes, which are used in next iteration of algorithm. Each process recalculates internal point values for the next iteration in parallel with communication. After a number of iterations, the algorithm reports norm values for validation purposes.
68
68
69
69
```mermaid
70
70
sequenceDiagram
@@ -73,7 +73,7 @@ sequenceDiagram
73
73
participant COMM as Communication
74
74
participant GC as GPU compute
75
75
76
-
loop Solever: batch iterations
76
+
loop Solver: batch iterations
77
77
loop Solver: single iteration
78
78
APP ->>+ HC: Calculate values on the edges
79
79
HC ->>- APP: edge values
@@ -82,7 +82,7 @@ sequenceDiagram
82
82
HC ->> HC: Main compute loop
83
83
HC ->>- APP: Updated internal points
84
84
APP ->> COMM: RMA window synchronization
85
-
COMM ->>- APP: RMA syncronization completion
85
+
COMM ->>- APP: RMA synchronization completion
86
86
end
87
87
APP ->>+ HC: start compute of local norm
88
88
HC ->>- APP: local norm value
@@ -93,7 +93,7 @@ sequenceDiagram
93
93
94
94
### `02_jacobian_device_mpi_one-sided_gpu_aware`
95
95
96
-
This program demonstrates how the same algorithm can be modified to add GPU offload capability. The program comes in two versions: OpenMP and SYCL. The program illustrates how device memory can be passed directly to MPI one-sided primitives. In particular, device memory may be passed to `MPI_Win_create` call to create an RMA Window placed on a device. Also, aside from a device RMA-window placement, device memory can be passed to `MPI_Put`/`MPI_Get` primitives as a target or origin buffer.
96
+
This program demonstrates how the same algorithm can be modified to add GPU offload capability. The program comes in two versions: OpenMP and SYCL. The program illustrates how device memory can be passed directly to MPI one-sided primitives. In particular, device memory can be passed to `MPI_Win_create` call to create an RMA Window placed on a device. Also, aside from a device RMA-window placement, device memory can be passed to `MPI_Put`/`MPI_Get` primitives as a target or origin buffer.
97
97
98
98
```mermaid
99
99
sequenceDiagram
@@ -102,7 +102,7 @@ sequenceDiagram
102
102
participant GC as GPU compute
103
103
participant COMM as Communication
104
104
105
-
loop Solever: batch iterations
105
+
loop Solver: batch iterations
106
106
loop Solver: single iteration
107
107
APP ->>+ GC: Calculate values on the edges
108
108
GC ->>- APP: edge values
@@ -111,7 +111,7 @@ sequenceDiagram
111
111
GC ->> GC: Main compute loop
112
112
GC ->>- APP: Updated internal points
113
113
APP ->> COMM: RMA window synchronization
114
-
COMM ->>- APP: RMA syncronization completion
114
+
COMM ->>- APP: RMA synchronization completion
115
115
end
116
116
APP ->>+ GC: start compute of local norm
117
117
GC ->>- APP: local norm value
@@ -120,7 +120,7 @@ sequenceDiagram
120
120
end
121
121
```
122
122
123
-
> **Note**: Only contigouous MPI datatypes are supported.
123
+
> **Note**: Only contiguous MPI datatypes are supported.
This program demonstrates how to initiate one-sided communications directly from the offloaded code. The Intel® MPI Library allows calls to some communication primitives directly from the offloaded code (SYCL or OpenMP). In contrast to prior example, this one demonstrates usage of one-sided communications with notification (extention of MPI-4.1 standard).
165
+
---
166
+
**NOTE**
167
+
Intel® MPI Library 2021.13 is minimaly required version to run this sample.
168
+
Intel® MPI Library 2021.14 or later is recommended version to run this sample.
169
+
170
+
---
171
+
172
+
This program demonstrates how to initiate one-sided communications directly from the offloaded code. The Intel® MPI Library allows calls to some communication primitives directly from the offloaded code (SYCL or OpenMP). In contrast to the prior example, this one demonstrates the usage of one-sided communications with notification (extension of MPI-4.1 standard).
166
173
167
174
To enable device-initiated communications, you must set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
168
175
@@ -239,9 +246,9 @@ If you receive an error message, troubleshoot the problem using the Diagnostics
Device-initiated communications requires that you set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
249
+
Device-initiated communications require to set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
243
250
244
-
If everything worked, the Jacobi solver started an iterative computation for defined number of iterations. By default, the sample reports NORM values after every 10 computation iterations and reports the overall solver time at the end.
251
+
If everything worked, the Jacobi solver started an iterative computation for a defined number of iterations. By default, the sample reports norm values after every 10 computation iterations and reports the overall solver time at the end.
0 commit comments