Skip to content

Commit 4ba57f6

Browse files
[Libraries/MPI] Corrected samples comments and README
1 parent 4187aed commit 4ba57f6

File tree

12 files changed

+233
-228
lines changed

12 files changed

+233
-228
lines changed

Libraries/MPI/jacobian_solver/README.md

Lines changed: 27 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# `Distributed Jacobian Solver SYCL/MPI` Sample
1+
# `Distributed Jacobian Solver SYCL/MPI` sample
22

33
The `Distributed Jacobian Solver SYCL/MPI` demonstrates using GPU-aware MPI-3, one-sided communications available in the Intel® MPI Library.
44

@@ -13,27 +13,27 @@ see the [Intel® MPI Library Documentation](https://www.intel.com/content/www/us
1313

1414
## Purpose
1515

16-
The sample demonstrates an actual use case (Jacobian solver) for MPI-3 one-sided communications allowing to overlap compute kernel and communications. The sample illustrated how to use host- and device-initiated onesided communication with SYCL kernels.
16+
The sample demonstrates an actual use case (Jacobian solver) for MPI-3 one-sided communications allowing to overlap compute kernel and communications. The sample illustrates how to use host- and device-initiated one-sided communication with SYCL kernels.
1717

1818
## Prerequisites
1919

2020
| Optimized for | Description
2121
|:--- |:---
2222
| OS | Linux*
23-
| Hardware | 4th Generation Intel® Xeon® Scalable Processors <br> Intel® Data Center GPU Max Series
23+
| Hardware | 4th Generation Intel® Xeon® Scalable processors <br> Intel® Data Center GPU Max Series
2424
| Software | Intel® MPI Library 2021.11
2525

2626
## Key Implementation Details
2727

28-
This sample implements a well-known distributed 2D Jacobian solver with 1D data distribution. The sampple uses Intel® MPI [GPU Support](https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/current/gpu-support.html).
28+
This sample implements a well-known distributed 2D Jacobi solver with 1D data distribution. The sample uses Intel® MPI [GPU Support](https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/current/gpu-support.html).
2929

3030
The sample has three variants demonstrating different approaches to the Jacobi solver.
3131

3232
### `Data layout description`
3333

3434
The data layout is a 2D grid of size (Nx+2) x (Ny+2), distributed across MPI processes along the Y-axis.
35-
Where first and last row/column areconstant and used for boundary conditions.
36-
Each porcess handles Nx x (Ny/comm_size) subarray.
35+
The first and last rows/columns are constant and used for boundary conditions.
36+
Each process handles Nx x (Ny/comm_size) subarray.
3737

3838
```
3939
Left border Right border
@@ -46,14 +46,14 @@ Each porcess handles Nx x (Ny/comm_size) subarray.
4646
| | | | |
4747
| | | | |
4848
| | | | | ------------------------------------------------
49-
| | | | | |X| |X| <- Last row of of i-1 subarray from previous iterarion used for calculation
49+
| | | | | |X| |X| <- Last row of i-1 subarray from the previous iteration used for calculation
5050
| | | | |....................------------------------------------------------
5151
| |<--------- Nx x Ny array ---------------->| | | | | |
5252
| | | | | | | i-th process subarray | |
5353
| | | | | | | Nx x (Ny/comm_size) | |
5454
| | | | | | | | |
5555
| | | | |....................------------------------------------------------
56-
| | | | | |X| |X| <- First row of of i+1 subarray from previous iterarion used for calculation
56+
| | | | | |X| |X| <- First row of i+1 subarray from the previous iteration used for calculation
5757
| | V | | ------------------------------------------------
5858
------------------------------------------------
5959
Bottom border-> |X| |X|
@@ -62,9 +62,9 @@ Each porcess handles Nx x (Ny/comm_size) subarray.
6262

6363
### `01_jacobian_host_mpi_one-sided`
6464

65-
This program demonstrates baseline implementation of the distributed Jacobian solver. In this sample you will see the basic idea of the algorithm, as well as how to implement the halo-exchange using MPI-3 one-sided primitives required for this solver.
65+
This program demonstrates a baseline implementation of the distributed Jacobian solver. In this sample you will see the basic idea of the algorithm, as well as how to implement the halo-exchange using MPI-3 one-sided primitives required for this solver.
6666

67-
The solver is an iterative algorithm where each iteration of the program recalculates border values first, then border values transfer to neighbor processes, which are used in next iteration of algorithm. Each process recalculate internal points values for the next iteration in parallel with communication. After a number of iterations, the algorithm reports NORM values for validation purposes.
67+
The solver is an iterative algorithm where each iteration of the program recalculates border values first, then border values transfer to neighbor processes, which are used in next iteration of algorithm. Each process recalculates internal point values for the next iteration in parallel with communication. After a number of iterations, the algorithm reports norm values for validation purposes.
6868

6969
```mermaid
7070
sequenceDiagram
@@ -73,7 +73,7 @@ sequenceDiagram
7373
participant COMM as Communication
7474
participant GC as GPU compute
7575
76-
loop Solever: batch iterations
76+
loop Solver: batch iterations
7777
loop Solver: single iteration
7878
APP ->>+ HC: Calculate values on the edges
7979
HC ->>- APP: edge values
@@ -82,7 +82,7 @@ sequenceDiagram
8282
HC ->> HC: Main compute loop
8383
HC ->>- APP: Updated internal points
8484
APP ->> COMM: RMA window synchronization
85-
COMM ->>- APP: RMA syncronization completion
85+
COMM ->>- APP: RMA synchronization completion
8686
end
8787
APP ->>+ HC: start compute of local norm
8888
HC ->>- APP: local norm value
@@ -93,7 +93,7 @@ sequenceDiagram
9393

9494
### `02_jacobian_device_mpi_one-sided_gpu_aware`
9595

96-
This program demonstrates how the same algorithm can be modified to add GPU offload capability. The program comes in two versions: OpenMP and SYCL. The program illustrates how device memory can be passed directly to MPI one-sided primitives. In particular, device memory may be passed to `MPI_Win_create` call to create an RMA Window placed on a device. Also, aside from a device RMA-window placement, device memory can be passed to `MPI_Put`/`MPI_Get` primitives as a target or origin buffer.
96+
This program demonstrates how the same algorithm can be modified to add GPU offload capability. The program comes in two versions: OpenMP and SYCL. The program illustrates how device memory can be passed directly to MPI one-sided primitives. In particular, device memory can be passed to `MPI_Win_create` call to create an RMA Window placed on a device. Also, aside from a device RMA-window placement, device memory can be passed to `MPI_Put`/`MPI_Get` primitives as a target or origin buffer.
9797

9898
```mermaid
9999
sequenceDiagram
@@ -102,7 +102,7 @@ sequenceDiagram
102102
participant GC as GPU compute
103103
participant COMM as Communication
104104
105-
loop Solever: batch iterations
105+
loop Solver: batch iterations
106106
loop Solver: single iteration
107107
APP ->>+ GC: Calculate values on the edges
108108
GC ->>- APP: edge values
@@ -111,7 +111,7 @@ sequenceDiagram
111111
GC ->> GC: Main compute loop
112112
GC ->>- APP: Updated internal points
113113
APP ->> COMM: RMA window synchronization
114-
COMM ->>- APP: RMA syncronization completion
114+
COMM ->>- APP: RMA synchronization completion
115115
end
116116
APP ->>+ GC: start compute of local norm
117117
GC ->>- APP: local norm value
@@ -120,7 +120,7 @@ sequenceDiagram
120120
end
121121
```
122122

123-
> **Note**: Only contigouous MPI datatypes are supported.
123+
> **Note**: Only contiguous MPI datatypes are supported.
124124
125125
### `03_jacobian_device_mpi_one-sided_device_initiated`
126126

@@ -149,7 +149,7 @@ sequenceDiagram
149149
GC ->>+ COMM: transfer data to neighbours using MPI_Put
150150
GC ->> GC: Recalculate internal points
151151
GC ->> COMM: RMA window synchronization
152-
COMM ->>- GC: RMA syncronization completion
152+
COMM ->>- GC: RMA synchronization completion
153153
end
154154
GC ->>- APP: Fused kernel completion
155155
APP ->>+ GC: start compute of local norm
@@ -162,7 +162,14 @@ sequenceDiagram
162162

163163
### `04_jacobian_device_mpi_one-sided_device_initiated_notify`
164164

165-
This program demonstrates how to initiate one-sided communications directly from the offloaded code. The Intel® MPI Library allows calls to some communication primitives directly from the offloaded code (SYCL or OpenMP). In contrast to prior example, this one demonstrates usage of one-sided communications with notification (extention of MPI-4.1 standard).
165+
---
166+
**NOTE**
167+
Intel® MPI Library 2021.13 is minimaly required version to run this sample.
168+
Intel® MPI Library 2021.14 or later is recommended version to run this sample.
169+
170+
---
171+
172+
This program demonstrates how to initiate one-sided communications directly from the offloaded code. The Intel® MPI Library allows calls to some communication primitives directly from the offloaded code (SYCL or OpenMP). In contrast to the prior example, this one demonstrates the usage of one-sided communications with notification (extension of MPI-4.1 standard).
166173

167174
To enable device-initiated communications, you must set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
168175

@@ -239,9 +246,9 @@ If you receive an error message, troubleshoot the problem using the Diagnostics
239246
mpirun -n 2 -genv I_MPI_OFFLOAD=1 ./src/02_jacobian_device_mpi_one-sided_gpu_aware/mpi3_onesided_jacobian_gpu_sycl
240247
```
241248

242-
Device-initiated communications requires that you set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
249+
Device-initiated communications require to set an extra environment variable: `I_MPI_OFFLOAD_ONESIDED_DEVICE_INITIATED=1`.
243250

244-
If everything worked, the Jacobi solver started an iterative computation for defined number of iterations. By default, the sample reports NORM values after every 10 computation iterations and reports the overall solver time at the end.
251+
If everything worked, the Jacobi solver started an iterative computation for a defined number of iterations. By default, the sample reports norm values after every 10 computation iterations and reports the overall solver time at the end.
245252

246253
## Example Output
247254

Libraries/MPI/jacobian_solver/src/01_jacobian_host_mpi_one-sided/GNUmakefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ INCLUDES =
22
LDFLAGS = -lm
33
CFLAGS = -Wall -Wformat-security -Werror=format-security
44
CXXFLAGS = -Wall -Wformat-security -Werror=format-security
5-
# Use icx from DPC++ oneAPI toolkit to compile. Please source DPCPP's vars.sh before compilation.
5+
# Use icx from the DPC++ oneAPI toolkit to compile. Please source DPCPP's vars.sh before compilation.
66
CC = mpiicx
77
CXX = mpiicpx
88
example = mpi3_onesided_jacobian

Libraries/MPI/jacobian_solver/src/01_jacobian_host_mpi_one-sided/mpi3_onesided_jacobian.c

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -18,48 +18,48 @@ int main(int argc, char *argv[])
1818
{
1919
double t_start;
2020
struct subarray my_subarray = { };
21-
/* Here we uses double buffering to allow overlap of compute and communication phase.
22-
* Odd iterations use buffs[0] as input and buffs[1] as output and vice versa.
23-
* Same scheme is used for MPI_Win objects.
21+
/* Here we use double buffering to allow the overlap of the compute and communication phases.
22+
* Odd iterations use buffs[0] as input and buffs[1] as output, and vice versa.
23+
* The same scheme is used for MPI_Win objects.
2424
*/
2525
double *buffs[2] = { NULL, NULL };
2626
MPI_Win win[2] = { MPI_WIN_NULL, MPI_WIN_NULL };
2727

28-
/* Initialization of runtime and initial state of data */
28+
/* Initialization of runtime and initial state of data. */
2929
MPI_Init(&argc, &argv);
30-
/* Initialize subarray owned by current process
31-
* and create RMA-windows for MPI-3 one-sided communications.
30+
/* Initialize the subarray owned by the current process
31+
* and create RMA windows for MPI-3 one-sided communications.
3232
* - For this sample, we use host memory for buffers and windows.
33-
* - Sample uses MPI_Win_fence for synchronization.
33+
* - This sample uses MPI_Win_fence for synchronization.
3434
*/
3535
InitSubarryAndWindows(&my_subarray, buffs, win, "host", false);
3636

37-
/* Start RMA exposure epoch */
37+
/* Start the RMA exposure epoch. */
3838
MPI_Win_fence(0, win[0]);
3939
MPI_Win_fence(0, win[1]);
4040

4141
const int row_size = ROW_SIZE(my_subarray);
42-
/* Amount of iterations to perform between norm calculations */
42+
/* Number of iterations to perform between norm calculations. */
4343
const int iterations_batch = (NormIteration <= 0) ? Niter : NormIteration;
4444

45-
/* Timestamp start time to measure overall execution time */
45+
/* Timestamp the start time to measure overall execution time. */
4646
BEGIN_PROFILING
47-
/* Main computation loop */
47+
/* Main computation loop. */
4848
for (int passed_iters = 0; passed_iters < Niter; passed_iters += iterations_batch) {
49-
/* Perfrom a batch of iterations before checking norm */
49+
/* Perform a batch of iterations before checking the norm. */
5050
for (int k = 0; k < iterations_batch; ++k) {
5151
int i = passed_iters + k;
5252
double *in = buffs[i % 2];
5353
double *out = buffs[(1 + i) % 2];
5454
MPI_Win current_win = win[(i + 1) % 2];
5555

56-
/* Calculate values on borders to initiate communications early */
56+
/* Calculate values on the borders to initiate communications early. */
5757
for (int column = 0; column < my_subarray.x_size; ++column) {
5858
RECALCULATE_POINT(out, in, column, 0, row_size);
5959
RECALCULATE_POINT(out, in, column, my_subarray.y_size - 1, row_size);
6060
}
6161

62-
/* Perform 1D halo-exchange with neighbours */
62+
/* Perform 1D halo-exchange with neighbors. */
6363
if (my_subarray.up_neighbour != MPI_PROC_NULL) {
6464
int idx = XY_2_IDX(0, 0, row_size);
6565
MPI_Put(&out[idx], my_subarray.x_size, MPI_DOUBLE,
@@ -74,18 +74,18 @@ int main(int argc, char *argv[])
7474
my_subarray.x_size, MPI_DOUBLE, current_win);
7575
}
7676

77-
/* Recalculate internal points in parallel with communication */
77+
/* Recalculate internal points in parallel with communications. */
7878
for (int row = 1; row < my_subarray.y_size - 1; ++row) {
7979
for (int column = 0; column < my_subarray.x_size; ++column) {
8080
RECALCULATE_POINT(out, in, column, row, row_size);
8181
}
8282
}
8383

84-
/* Ensure all communications are complete before next iteration */
84+
/* Ensure all communications are complete before the next iteration. */
8585
MPI_Win_fence(0, current_win);
8686
}
8787

88-
/* Calculate norm value after given number of iterations */
88+
/* Calculate the norm value after the given number of iterations. */
8989
if (NormIteration > 0) {
9090
double result_norm = 0.0;
9191
double norm = 0.0;
@@ -104,10 +104,10 @@ int main(int argc, char *argv[])
104104
}
105105
}
106106
}
107-
/* Timestamp end time to measure overall execution time and report average compute time */
107+
/* Timestamp the end time to measure overall execution time and report average compute time. */
108108
END_PROFILING
109109

110-
/* Close RMA exposure epoch and free resources */
110+
/* Close the RMA exposure epoch and free resources. */
111111
MPI_Win_fence(0, win[0]);
112112
MPI_Win_fence(0, win[1]);
113113
MPI_Win_free(&win[1]);

Libraries/MPI/jacobian_solver/src/02_jacobian_device_mpi_one-sided_gpu_aware/GNUmakefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ INCLUDES =
22
LDFLAGS = -lm
33
CFLAGS = -qopenmp -fopenmp-targets=spir64 -Wall -Wformat-security -Werror=format-security
44
CXXFLAGS = -fsycl -Wall -Wformat-security -Werror=format-security
5-
# Use icx from DPC++ oneAPI toolkit to compile. Please source DPCPP's vars.sh before compilation.
5+
# Use icx from the DPC++ oneAPI toolkit to compile. Please source DPCPP's vars.sh before compilation.
66
CC = mpiicx
77
CXX = mpiicpx
88
example = mpi3_onesided_jacobian_gpu_openmp mpi3_onesided_jacobian_gpu_sycl

0 commit comments

Comments
 (0)