Skip to content

Commit 1511744

Browse files
committed
Merge branch 'cuda-aware' of https://github.com/tharittk/pylops-mpi into cuda-aware
2 parents 2cdb8f7 + a317a88 commit 1511744

File tree

8 files changed

+936
-144
lines changed

8 files changed

+936
-144
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ doc_nccl:
7878
rm tutorials/*_cupy.py tutorials/*_nccl.py
7979

8080
docupdate:
81-
cd docs && make html && cd ..
81+
cd docs && NCCL_PYLOPS_MPI=0 make html && cd ..
8282

8383
servedoc:
8484
$(PYTHON) -m http.server --directory docs/build/

docs/source/api/index.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,15 @@ Utils
118118
local_split
119119

120120

121+
.. currentmodule:: pylops_mpi.basicoperators.MatrixMult
122+
123+
.. autosummary::
124+
:toctree: generated/
125+
126+
block_gather
127+
local_block_split
128+
active_grid_comm
129+
121130
.. currentmodule:: pylops_mpi.utils
122131

123132
.. autosummary::

docs/source/gpu.rst

Lines changed: 45 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -150,67 +150,70 @@ one MPI process. In fact, minor communications like those dealing with array-rel
150150
The CuPy and NCCL backend is in active development, with many examples not yet in the docs.
151151
You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
152152

153-
154-
Supports for NCCL Backend
155-
----------------------------
156-
In the following, we provide a list of modules (i.e., operators and solvers)
157-
and their current status in terms of support for the 3 different communication
158-
backends:
153+
Supports for CuPy and NCCL
154+
--------------------------
155+
In the following, we provide a list of modules (i.e., operators and solvers) with their current status (available on CPU+MPI,
156+
GPU+MPI, and GPU+NCCL):
159157

160158
.. list-table::
161159
:widths: 50 25 25 25
162160
:header-rows: 1
163161

164162
* - Operator/method
165163
- CPU
166-
- GPU with MPI
167-
- GPU with NCCL
164+
- GPU+MPI
165+
- GPU+NCCL
168166
* - :class:`pylops_mpi.DistributedArray`
169-
- ✅
170-
- ✅
171-
- ✅
167+
- ✅
168+
- ✅
169+
- ✅
172170
* - :class:`pylops_mpi.basicoperators.MPIMatrixMult`
173-
- ✅
171+
- ✅
174172
- 🔴
175173
- 🔴
176174
* - :class:`pylops_mpi.basicoperators.MPIVStack`
177-
- ✅
178-
- ✅
179-
- ✅
175+
- ✅
176+
- ✅
177+
- ✅
180178
* - :class:`pylops_mpi.basicoperators.MPIHStack`
181-
- ✅
182-
- ✅
183-
- ✅
179+
- ✅
180+
- ✅
181+
- ✅
184182
* - :class:`pylops_mpi.basicoperators.MPIBlockDiag`
185-
- ✅
186-
- ✅
187-
- ✅
188-
* - :class:`pylops_mpi.basicoperators.MPIGradient`
189-
- ✅
190-
- ✅
191-
- ✅
183+
- ✅
184+
- ✅
185+
- ✅
192186
* - :class:`pylops_mpi.basicoperators.MPIFirstDerivative`
187+
- ✅
188+
- ✅
193189
- ✅
194190
- ✅
195191
- ✅
196192
* - :class:`pylops_mpi.basicoperators.MPISecondDerivative`
197-
- ✅
198-
- ✅
199-
- ✅
193+
- ✅
194+
- ✅
195+
- ✅
200196
* - :class:`pylops_mpi.basicoperators.MPILaplacian`
201-
- ✅
202-
- ✅
203-
- ✅
204-
* - :class:`pylops_mpi.signalprocessing.Fredhoml1`
205-
- ✅
206-
- ✅
207-
- ✅
197+
- ✅
198+
- ✅
199+
- ✅
200+
* - :class:`pylops_mpi.basicoperators.MPIGradient`
201+
- ✅
202+
- ✅
203+
- ✅
204+
* - :class:`pylops_mpi.signalprocessing.MPIFredhoml1`
205+
- ✅
206+
- ✅
207+
- ✅
208+
* - :class:`pylops_mpi.waveeqprocessing.MPIMDC`
209+
- ✅
210+
- ✅
211+
- ✅
208212
* - :class:`pylops_mpi.optimization.basic.cg`
209-
- ✅
210-
- ✅
211-
- ✅
213+
- ✅
214+
- ✅
215+
- ✅
212216
* - :class:`pylops_mpi.optimization.basic.cgls`
213-
- ✅
214-
- ✅
215-
- ✅
216-
217+
- ✅
218+
- ✅
219+
- ✅

examples/plot_matrixmult.py

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
r"""
2-
Distributed Matrix Multiplication
3-
=================================
2+
Distributed Matrix Multiplication - Block-row-column decomposition
3+
==================================================================
44
This example shows how to use the :py:class:`pylops_mpi.basicoperators.MPIMatrixMult`
5-
operator to perform matrix-matrix multiplication between a matrix :math:`\mathbf{A}`
6-
blocked over rows (i.e., blocks of rows are stored over different ranks) and a
7-
matrix :math:`\mathbf{X}` blocked over columns (i.e., blocks of columns are
8-
stored over different ranks), with equal number of row and column blocks.
9-
Similarly, the adjoint operation can be peformed with a matrix :math:`\mathbf{Y}`
10-
blocked in the same fashion of matrix :math:`\mathbf{X}`.
5+
operator with ``kind='blocked'`` to perform matrix-matrix multiplication between
6+
a matrix :math:`\mathbf{A}` blocked over rows (i.e., blocks of rows are stored
7+
over different ranks) and a matrix :math:`\mathbf{X}` blocked over columns
8+
(i.e., blocks of columns are stored over different ranks), with equal number
9+
of row and column blocks. Similarly, the adjoint operation can be peformed with
10+
a matrix :math:`\mathbf{Y}` blocked in the same fashion of matrix :math:`\mathbf{X}`.
1111
1212
Note that whilst the different blocks of the matrix :math:`\mathbf{A}` are directly
1313
stored in the operator on different ranks, the matrix :math:`\mathbf{X}` is
@@ -19,15 +19,16 @@
1919
2020
"""
2121

22-
from matplotlib import pyplot as plt
2322
import math
2423
import numpy as np
2524
from mpi4py import MPI
25+
from matplotlib import pyplot as plt
2626

2727
import pylops
2828

2929
import pylops_mpi
3030
from pylops_mpi import Partition
31+
from pylops_mpi.basicoperators.MatrixMult import active_grid_comm, MPIMatrixMult
3132

3233
plt.close("all")
3334

@@ -39,8 +40,7 @@
3940

4041
###############################################################################
4142
# We are now ready to create the input matrices :math:`\mathbf{A}` of size
42-
# :math:`M \times k` :math:`\mathbf{A}` of size and :math:`\mathbf{A}` of size
43-
# :math:`K \times N`.
43+
# :math:`M \times k` and :math:`\mathbf{X}` of size :math:`K \times N`.
4444
N, K, M = 4, 4, 4
4545
A = np.random.rand(N * K).astype(dtype=np.float32).reshape(N, K)
4646
X = np.random.rand(K * M).astype(dtype=np.float32).reshape(K, M)
@@ -88,8 +88,7 @@
8888
# than the row or columm ranks.
8989

9090
base_comm = MPI.COMM_WORLD
91-
comm, rank, row_id, col_id, is_active = \
92-
pylops_mpi.MPIMatrixMult.active_grid_comm(base_comm, N, M)
91+
comm, rank, row_id, col_id, is_active = active_grid_comm(base_comm, N, M)
9392
print(f"Process {base_comm.Get_rank()} is {'active' if is_active else 'inactive'}")
9493
if not is_active: exit(0)
9594

@@ -147,7 +146,7 @@
147146
################################################################################
148147
# We are now ready to create the :py:class:`pylops_mpi.basicoperators.MPIMatrixMult`
149148
# operator and the input matrix :math:`\mathbf{X}`
150-
Aop = pylops_mpi.MPIMatrixMult(A_p, M, base_comm=comm, dtype="float32")
149+
Aop = MPIMatrixMult(A_p, M, base_comm=comm, dtype="float32", kind="block")
151150

152151
col_lens = comm.allgather(my_own_cols)
153152
total_cols = np.sum(col_lens)

examples/plot_summamatrixmult.py

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
r"""
2+
Distributed Matrix Multiplication - SUMMA
3+
=========================================
4+
This example shows how to use the :py:class:`pylops_mpi.basicoperators.MPIMatrixMult`
5+
operator with ``kind='summa'`` to perform matrix-matrix multiplication between
6+
a matrix :math:`\mathbf{A}` distributed in 2D blocks across a square process
7+
grid and matrices :math:`\mathbf{X}` and :math:`\mathbf{Y}` distributed in 2D
8+
blocks across the same grid. Similarly, the adjoint operation can be performed
9+
with a matrix :math:`\mathbf{Y}` distributed in the same fashion as matrix
10+
:math:`\mathbf{X}`.
11+
12+
Note that whilst the different blocks of matrix :math:`\mathbf{A}` are directly
13+
stored in the operator on different ranks, the matrices :math:`\mathbf{X}` and
14+
:math:`\mathbf{Y}` are effectively represented by 1-D :py:class:`pylops_mpi.DistributedArray`
15+
objects where the different blocks are flattened and stored on different ranks.
16+
Note that to optimize communications, the ranks are organized in a square grid and
17+
blocks of :math:`\mathbf{A}` and :math:`\mathbf{X}` are systematically broadcast
18+
across different ranks during computation - see below for details.
19+
"""
20+
21+
import math
22+
import numpy as np
23+
from mpi4py import MPI
24+
from matplotlib import pyplot as plt
25+
26+
import pylops_mpi
27+
from pylops import Conj
28+
from pylops_mpi.basicoperators.MatrixMult import \
29+
local_block_split, MPIMatrixMult, active_grid_comm
30+
31+
plt.close("all")
32+
33+
###############################################################################
34+
# We set the seed such that all processes can create the input matrices filled
35+
# with the same random number. In practical applications, such matrices will be
36+
# filled with data that is appropriate to the use-case.
37+
np.random.seed(42)
38+
39+
###############################################################################
40+
# We are now ready to create the input matrices for our distributed matrix
41+
# multiplication example. We need to set up:
42+
#
43+
# - Matrix :math:`\mathbf{A}` of size :math:`N \times K` (the left operand)
44+
# - Matrix :math:`\mathbf{X}` of size :math:`K \times M` (the right operand)
45+
# - The result will be :math:`\mathbf{Y} = \mathbf{A} \mathbf{X}` of size
46+
# :math:`N \times M`
47+
#
48+
# We create here global test matrices with sequential values for easy verification:
49+
#
50+
# - Matrix A: Each element :math:`A_{i,j} = i \cdot K + j` (row-major ordering)
51+
# - Matrix X: Each element :math:`X_{i,j} = i \cdot M + j`
52+
53+
N, M, K = 6, 6, 6
54+
A_shape, x_shape, y_shape = (N, K), (K, M), (N, M)
55+
56+
A_data = np.arange(int(A_shape[0] * A_shape[1])).reshape(A_shape)
57+
x_data = np.arange(int(x_shape[0] * x_shape[1])).reshape(x_shape)
58+
59+
################################################################################
60+
# For distributed computation, we arrange processes in a square grid of size
61+
# :math:`P' \times P'` where :math:`P' = \sqrt{P}` and :math:`P` is the total
62+
# number of MPI processes. Each process will own a block of each matrix
63+
# according to this 2D grid layout.
64+
65+
base_comm = MPI.COMM_WORLD
66+
comm, rank, row_id, col_id, is_active = active_grid_comm(base_comm, N, M)
67+
print(f"Process {base_comm.Get_rank()} is {'active' if is_active else 'inactive'}")
68+
69+
p_prime = math.isqrt(comm.Get_size())
70+
print(f"Process grid: {p_prime} x {p_prime} = {comm.Get_size()} processes")
71+
72+
if rank == 0:
73+
print(f"Global matrix A shape: {A_shape} (N={A_shape[0]}, K={A_shape[1]})")
74+
print(f"Global matrix X shape: {x_shape} (K={x_shape[0]}, M={x_shape[1]})")
75+
print(f"Expected Global result Y shape: ({A_shape[0]}, {x_shape[1]}) = (N, M)")
76+
77+
################################################################################
78+
# Next we must determine which block of each matrix each process should own.
79+
#
80+
# The 2D block distribution requires:
81+
#
82+
# - Process at grid position :math:`(i,j)` gets block
83+
# :math:`\mathbf{A}[i_{start}:i_{end}, j_{start}:j_{end}]`
84+
# - Block sizes are approximately :math:`\lceil N/P' \rceil \times \lceil K/P' \rceil`
85+
# with edge processes handling remainder
86+
#
87+
# .. raw:: html
88+
#
89+
# <div style="text-align: left; font-family: monospace; white-space: pre;">
90+
# <b>Example: 2x2 Process Grid with 6x6 Matrices</b>
91+
#
92+
# Matrix A (6x6): Matrix X (6x6):
93+
# ┌───────────┬───────────┐ ┌───────────┬───────────┐
94+
# │ 0 1 2 │ 3 4 5 │ │ 0 1 2 │ 3 4 5 │
95+
# │ 6 7 8 │ 9 10 11 │ │ 6 7 8 │ 9 10 11 │
96+
# │ 12 13 14 │ 15 16 17 │ │ 12 13 14 │ 15 16 17 │
97+
# ├───────────┼───────────┤ ├───────────┼───────────┤
98+
# │ 18 19 20 │ 21 22 23 │ │ 18 19 20 │ 21 22 23 │
99+
# │ 24 25 26 │ 27 28 29 │ │ 24 25 26 │ 27 28 29 │
100+
# │ 30 31 32 │ 33 34 35 │ │ 30 31 32 │ 33 34 35 │
101+
# └───────────┴───────────┘ └───────────┴───────────┘
102+
#
103+
# Process (0,0): A[0:3, 0:3], X[0:3, 0:3]
104+
# Process (0,1): A[0:3, 3:6], X[0:3, 3:6]
105+
# Process (1,0): A[3:6, 0:3], X[3:6, 0:3]
106+
# Process (1,1): A[3:6, 3:6], X[3:6, 3:6]
107+
# </div>
108+
#
109+
110+
A_slice = local_block_split(A_shape, rank, comm)
111+
x_slice = local_block_split(x_shape, rank, comm)
112+
113+
################################################################################
114+
# Extract the local portion of each matrix for this process
115+
A_local = A_data[A_slice]
116+
x_local = x_data[x_slice]
117+
118+
print(f"Process {rank}: A_local shape {A_local.shape}, X_local shape {x_local.shape}")
119+
print(f"Process {rank}: A slice {A_slice}, X slice {x_slice}")
120+
121+
################################################################################
122+
123+
################################################################################
124+
# We are now ready to create the SUMMA :py:class:`pylops_mpi.basicoperators.MPIMatrixMult`
125+
# operator and the input matrix :math:`\mathbf{X}`
126+
127+
Aop = MPIMatrixMult(A_local, M, base_comm=comm, kind="summa", dtype=A_local.dtype)
128+
129+
x_dist = pylops_mpi.DistributedArray(
130+
global_shape=(K * M),
131+
local_shapes=comm.allgather(x_local.shape[0] * x_local.shape[1]),
132+
base_comm=comm,
133+
partition=pylops_mpi.Partition.SCATTER,
134+
dtype=x_local.dtype)
135+
x_dist[:] = x_local.flatten()
136+
137+
################################################################################
138+
# We can now apply the forward pass :math:`\mathbf{y} = \mathbf{Ax}` (which
139+
# effectively implements a distributed matrix-matrix multiplication
140+
# :math:`Y = \mathbf{AX}`). Note :math:`\mathbf{Y}` is distributed in the same
141+
# way as the input :math:`\mathbf{X}` in a block-block fashion.
142+
y_dist = Aop @ x_dist
143+
144+
###############################################################################
145+
# Next we apply the adjoint pass :math:`\mathbf{x}_{adj} = \mathbf{A}^H \mathbf{x}`
146+
# (which effectively implements a distributed summa matrix-matrix multiplication
147+
# :math:`\mathbf{X}_{adj} = \mathbf{A}^H \mathbf{X}`). Note that
148+
# :math:`\mathbf{X}_{adj}` is again distributed in the same way as the input
149+
# :math:`\mathbf{X}` in a block-block fashion.
150+
xadj_dist = Aop.H @ y_dist
151+
152+
###############################################################################
153+
# Finally, we show that the SUMMA :py:class:`pylops_mpi.basicoperators.MPIMatrixMult`
154+
# operator can be combined with any other PyLops-MPI operator. We are going to
155+
# apply here a conjugate operator to the output of the matrix multiplication.
156+
Dop = Conj(dims=(A_local.shape[0], x_local.shape[1]))
157+
DBop = pylops_mpi.MPIBlockDiag(ops=[Dop, ])
158+
Op = DBop @ Aop
159+
y1 = Op @ x_dist

pylops_mpi/LinearOperator.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ def matvec(self, x: DistributedArray) -> DistributedArray:
7676
7777
"""
7878
M, N = self.shape
79-
8079
if x.global_shape != (N,):
8180
raise ValueError("dimension mismatch")
8281

0 commit comments

Comments
 (0)