Skip to content

Commit 256b98e

Browse files
authored
Merge branch 'main' into actual-SUMMA
2 parents 25f30bc + 1cacf8b commit 256b98e

16 files changed

+393
-152
lines changed

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,19 @@
1+
2+
# 0.3.0
3+
* Added `pylops_mpi.basicoperators.MPIMatrixMult` operator.
4+
* Added NCCL support to all operators in :mod:`pylops_mpi.basicoperators`,
5+
and `pylops_mpi.signalprocessing`.
6+
* Added ``base_comm_nccl`` in constructor of `pylops_mpi.DistributedArray`,
7+
to enable NCCL communication backend.
8+
* Added `pylops_mpi.utils.benchmark` subpackage providing methods
9+
to decorate and mark functions / class methods to measure their execution
10+
time.
11+
* Added `pylops_mpi.utils._nccl` subpackage implementing methods
12+
for NCCL communication backend.
13+
* Added `pylops_mpi.utils.deps` subpackage to safely import ``nccl``
14+
* Fixed partition in the creation of the output distributed array in
15+
`pylops_mpi.signalprocessing.MPIFredholm1`.
16+
117
# 0.2.0
218
- Added support for using CuPy arrays with PyLops-MPI.
319
- Introduced the `pylops_mpi.signalprocessing.MPIFredholm1` and `pylops_mpi.waveeqprocessing.MPIMDC` operators.

Makefile

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ dev-install:
2929

3030
dev-install_nccl:
3131
make pipcheck
32-
$(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12 $(PIP) install -e .
32+
$(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12 && $(PIP) install -e .
3333

3434
install_conda:
3535
conda env create -f environment.yml && conda activate pylops_mpi && pip install .
@@ -49,6 +49,10 @@ lint:
4949
tests:
5050
mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
5151

52+
# assuming NUM_PROCESSES <= number of gpus available
53+
tests_gpu:
54+
export TEST_CUPY_PYLOPS=1 && mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
55+
5256
# assuming NUM_PROCESSES <= number of gpus available
5357
tests_nccl:
5458
mpiexec -n $(NUM_PROCESSES) pytest tests_nccl/ --with-mpi

docs/source/changelog.rst

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,27 @@
33
Changelog
44
=========
55

6+
7+
Version 0.3.0
8+
-------------
9+
10+
*Released on: 05/08/2025*
11+
12+
* Added :class:`pylops_mpi.basicoperators.MPIMatrixMult` operator.
13+
* Added NCCL support to all operators in :mod:`pylops_mpi.basicoperators`,
14+
and :mod:`pylops_mpi.signalprocessing`.
15+
* Added ``base_comm_nccl`` in constructor of :class:`pylops_mpi.DistributedArray`,
16+
to enable NCCL communication backend.
17+
* Added :class:`pylops_mpi.utils.benchmark` subpackage providing methods
18+
to decorate and mark functions / class methods to measure their execution
19+
time.
20+
* Added :class:`pylops_mpi.utils._nccl` subpackage implementing methods
21+
for NCCL communication backend.
22+
* Added :class:`pylops_mpi.utils.deps` subpackage to safely import ``nccl``
23+
* Fixed partition in the creation of the output distributed array in
24+
:class:`pylops_mpi.signalprocessing.MPIFredholm1`.
25+
26+
627
Version 0.2.0
728
-------------
829

@@ -14,6 +35,7 @@ Version 0.2.0
1435
* Added a dottest function to perform dot tests on PyLops-MPI operators.
1536
* Created a tutorial for Multi-Dimensional Deconvolution (MDD).
1637

38+
1739
Version 0.1.0
1840
-------------
1941

docs/source/contributing.rst

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,18 @@ that the both old and new tests pass successfully:
6969
7070
>> make tests
7171
72+
If you run PyLops-MPI with GPUs you may also do:
73+
74+
.. code-block:: bash
75+
76+
>> make tests_gpu
77+
78+
Additionally, if you have a NCCL-enabled environment, you may also check:
79+
80+
.. code-block:: bash
81+
82+
>> make tests_nccl
83+
7284
4. Make sure the ``examples`` python scripts are executed using 3 processes without any errors:
7385

7486
.. code-block:: bash
@@ -123,8 +135,11 @@ Project structure
123135
This repository is organized as follows:
124136

125137
* **pylops_mpi**: Python library containing various mpi linear operators.
126-
* **tests**: Set of tests using pytest-mpi.
138+
* **tests**: Set of tests using pytest-mpi for both CPU and GPU.
139+
* **tests_nccl** Set of tests for NCCL-enabled environment using pytest-mpi
127140
* **testdata**: Sample datasets used in tests and documentation.
128141
* **docs**: Sphinx documentation.
129142
* **examples**: Set of python script examples for each mpi linear operator to be embedded in documentation using sphinx-gallery.
130-
* **tutorials**: Set of python script tutorials to be embedded in documentation using sphinx-gallery.
143+
* **tutorials**: Set of python script tutorials (NumPy & MPI) to be embedded in documentation using sphinx-gallery.
144+
* **tutorials_cupy**: Same set of scripts as above but with CuPy & MPI
145+
* **tutorials_nccl**: Same set of scripts as above but with CuPy & NCCL

pylops_mpi/DistributedArray.py

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -694,14 +694,25 @@ def _compute_vector_norm(self, local_array: NDArray,
694694
recv_buf = self._allreduce_subcomm(ncp.count_nonzero(local_array, axis=axis).astype(ncp.float64))
695695
elif ord == ncp.inf:
696696
# Calculate max followed by max reduction
697-
recv_buf = self._allreduce_subcomm(ncp.max(ncp.abs(local_array), axis=axis).astype(ncp.float64),
698-
recv_buf, op=MPI.MAX)
699-
recv_buf = ncp.squeeze(recv_buf, axis=axis)
697+
# TODO (tharitt): currently CuPy + MPI does not work well with buffered communication, particularly
698+
# with MAX, MIN operator. Here we copy the array back to CPU, transfer, and copy them back to GPUs
699+
send_buf = ncp.max(ncp.abs(local_array), axis=axis).astype(ncp.float64)
700+
if self.engine == "cupy" and self.base_comm_nccl is None:
701+
recv_buf = self._allreduce_subcomm(send_buf.get(), recv_buf.get(), op=MPI.MAX)
702+
recv_buf = ncp.asarray(ncp.squeeze(recv_buf, axis=axis))
703+
else:
704+
recv_buf = self._allreduce_subcomm(send_buf, recv_buf, op=MPI.MAX)
705+
recv_buf = ncp.squeeze(recv_buf, axis=axis)
700706
elif ord == -ncp.inf:
701707
# Calculate min followed by min reduction
702-
recv_buf = self._allreduce_subcomm(ncp.min(ncp.abs(local_array), axis=axis).astype(ncp.float64),
703-
recv_buf, op=MPI.MIN)
704-
recv_buf = ncp.squeeze(recv_buf, axis=axis)
708+
# TODO (tharitt): see the comment above in infinity norm
709+
send_buf = ncp.min(ncp.abs(local_array), axis=axis).astype(ncp.float64)
710+
if self.engine == "cupy" and self.base_comm_nccl is None:
711+
recv_buf = self._allreduce_subcomm(send_buf.get(), recv_buf.get(), op=MPI.MIN)
712+
recv_buf = ncp.asarray(ncp.squeeze(recv_buf, axis=axis))
713+
else:
714+
recv_buf = self._allreduce_subcomm(send_buf, recv_buf, op=MPI.MIN)
715+
recv_buf = ncp.asarray(ncp.squeeze(recv_buf, axis=axis))
705716

706717
else:
707718
recv_buf = self._allreduce_subcomm(ncp.sum(ncp.abs(ncp.float_power(local_array, ord)), axis=axis))

pylops_mpi/utils/benchmark.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,7 @@ def wrapper(*args, **kwargs):
133133
header_index = len(_markers) - 1
134134

135135
def local_mark(label):
136+
_sync()
136137
_markers.append((label, time.perf_counter(), level))
137138

138139
_mark_func_stack.append(local_mark)

tests/test_blockdiag.py

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,19 @@
22
Designed to run with n processes
33
$ mpiexec -n 10 pytest test_blockdiag.py --with-mpi
44
"""
5+
import os
6+
7+
if int(os.environ.get("TEST_CUPY_PYLOPS", 0)):
8+
import cupy as np
9+
from cupy.testing import assert_allclose
10+
11+
backend = "cupy"
12+
else:
13+
import numpy as np
14+
from numpy.testing import assert_allclose
15+
16+
backend = "numpy"
517
from mpi4py import MPI
6-
import numpy as np
7-
from numpy.testing import assert_allclose
818
import pytest
919

1020
import pylops
@@ -17,6 +27,10 @@
1727
par2j = {'ny': 301, 'nx': 101, 'dtype': np.complex128}
1828

1929
np.random.seed(42)
30+
rank = MPI.COMM_WORLD.Get_rank()
31+
if backend == "cupy":
32+
device_id = rank % np.cuda.runtime.getDeviceCount()
33+
np.cuda.Device(device_id).use()
2034

2135

2236
@pytest.mark.mpi(min_size=2)
@@ -27,11 +41,11 @@ def test_blockdiag(par):
2741
Op = pylops.MatrixMult(A=((rank + 1) * np.ones(shape=(par['ny'], par['nx']))).astype(par['dtype']))
2842
BDiag_MPI = pylops_mpi.MPIBlockDiag(ops=[Op, ])
2943

30-
x = pylops_mpi.DistributedArray(global_shape=size * par['nx'], dtype=par['dtype'])
44+
x = pylops_mpi.DistributedArray(global_shape=size * par['nx'], dtype=par['dtype'], engine=backend)
3145
x[:] = np.ones(shape=par['nx'], dtype=par['dtype'])
3246
x_global = x.asarray()
3347

34-
y = pylops_mpi.DistributedArray(global_shape=size * par['ny'], dtype=par['dtype'])
48+
y = pylops_mpi.DistributedArray(global_shape=size * par['ny'], dtype=par['dtype'], engine=backend)
3549
y[:] = np.ones(shape=par['ny'], dtype=par['dtype'])
3650
y_global = y.asarray()
3751

@@ -68,16 +82,16 @@ def test_stacked_blockdiag(par):
6882
FirstDeriv_MPI = pylops_mpi.MPIFirstDerivative(dims=(par['ny'], par['nx']), dtype=par['dtype'])
6983
StackedBDiag_MPI = pylops_mpi.MPIStackedBlockDiag(ops=[BDiag_MPI, FirstDeriv_MPI])
7084

71-
dist1 = pylops_mpi.DistributedArray(global_shape=size * par['nx'], dtype=par['dtype'])
85+
dist1 = pylops_mpi.DistributedArray(global_shape=size * par['nx'], dtype=par['dtype'], engine=backend)
7286
dist1[:] = np.ones(dist1.local_shape, dtype=par['dtype'])
73-
dist2 = pylops_mpi.DistributedArray(global_shape=par['nx'] * par['ny'], dtype=par['dtype'])
87+
dist2 = pylops_mpi.DistributedArray(global_shape=par['nx'] * par['ny'], dtype=par['dtype'], engine=backend)
7488
dist2[:] = np.ones(dist2.local_shape, dtype=par['dtype'])
7589
x = pylops_mpi.StackedDistributedArray(distarrays=[dist1, dist2])
7690
x_global = x.asarray()
7791

78-
dist1 = pylops_mpi.DistributedArray(global_shape=size * par['ny'], dtype=par['dtype'])
92+
dist1 = pylops_mpi.DistributedArray(global_shape=size * par['ny'], dtype=par['dtype'], engine=backend)
7993
dist1[:] = np.ones(dist1.local_shape, dtype=par['dtype'])
80-
dist2 = pylops_mpi.DistributedArray(global_shape=par['nx'] * par['ny'], dtype=par['dtype'])
94+
dist2 = pylops_mpi.DistributedArray(global_shape=par['nx'] * par['ny'], dtype=par['dtype'], engine=backend)
8195
dist2[:] = np.ones(dist2.local_shape, dtype=par['dtype'])
8296
y = pylops_mpi.StackedDistributedArray(distarrays=[dist1, dist2])
8397
y_global = y.asarray()

0 commit comments

Comments
 (0)