Skip to content

Commit dd9b43c

Browse files
committed
Merge branch 'main' into astroC86-SUMMA
2 parents 3e9659e + 45bee36 commit dd9b43c

19 files changed

+1428
-145
lines changed

Makefile

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ PIP := $(shell command -v pip3 2> /dev/null || command which pip 2> /dev/null)
22
PYTHON := $(shell command -v python3 2> /dev/null || command which python 2> /dev/null)
33
NUM_PROCESSES = 3
44

5-
.PHONY: install dev-install install_conda dev-install_conda tests doc docupdate run_examples run_tutorials
5+
.PHONY: install dev-install dev-install_nccl install_conda install_conda_nccl dev-install_conda dev-install_conda_nccl tests tests_nccl doc docupdate run_examples run_tutorials
66

77
pipcheck:
88
ifndef PIP
@@ -24,19 +24,29 @@ dev-install:
2424
make pipcheck
2525
$(PIP) install -r requirements-dev.txt && $(PIP) install -e .
2626

27+
dev-install_nccl:
28+
make pipcheck
29+
$(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12 $(PIP) install -e .
30+
2731
install_conda:
2832
conda env create -f environment.yml && conda activate pylops_mpi && pip install .
2933

34+
install_conda_nccl:
35+
conda env create -f environment.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install .
36+
3037
dev-install_conda:
3138
conda env create -f environment-dev.yml && conda activate pylops_mpi && pip install -e .
3239

40+
dev-install_conda_nccl:
41+
conda env create -f environment-dev.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install -e .
42+
3343
lint:
3444
flake8 pylops_mpi/ tests/ examples/ tutorials/
3545

3646
tests:
3747
mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
3848

39-
# assuming NUM_PRCESS <= number of gpus available
49+
# assuming NUM_PROCESSES <= number of gpus available
4050
tests_nccl:
4151
mpiexec -n $(NUM_PROCESSES) pytest tests_nccl/ --with-mpi
4252

README.md

Lines changed: 73 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -7,99 +7,118 @@
77
[![Slack Status](https://img.shields.io/badge/chat-slack-green.svg)](https://pylops.slack.com)
88
[![DOI](https://joss.theoj.org/papers/10.21105/joss.07512/status.svg)](https://doi.org/10.21105/joss.07512)
99

10-
## PyLops MPI
11-
pylops-mpi is a Python library built on top of [PyLops](https://pylops.readthedocs.io/en/stable/), designed to enable distributed and parallel processing of
10+
# Distributed linear operators and solvers
11+
Pylops-mpi is a Python library built on top of [PyLops](https://pylops.readthedocs.io/en/stable/), designed to enable distributed and parallel processing of
1212
large-scale linear algebra operations and computations.
1313

1414
## Installation
15-
To install pylops-mpi, you need to have MPI (Message Passing Interface) installed on your system.
15+
To install pylops-mpi, you need to have Message Passing Interface (MPI) and optionally Nvidia's Collective Communication Library (NCCL) installed on your system.
16+
1617
1. **Download and Install MPI**: Visit the official MPI website to download an appropriate MPI implementation for your system.
1718
Follow the installation instructions provided by the MPI vendor.
1819
- [Open MPI](https://www.open-mpi.org/software/ompi/v1.10/)
1920
- [MPICH](https://www.mpich.org/downloads/)
2021
- [Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.10j8fx)
22+
2123
2. **Verify MPI Installation**: After installing MPI, verify its installation by opening a terminal or command prompt
2224
and running the following command:
23-
```
24-
mpiexec --version
2525
```
26-
3. **Install pylops-mpi**: Once MPI is installed and verified, you can proceed to install `pylops-mpi`.
27-
28-
You can install with `pip`:
29-
```
30-
pip install pylops-mpi
31-
```
32-
33-
You can install with `make` and `conda`:
34-
```
35-
make install_conda
36-
```
26+
mpiexec --version
27+
```
28+
29+
3. **Install pylops-mpi**: Once MPI is installed and verified, you can proceed to install `pylops-mpi` via `pip`:
30+
```
31+
pip install pylops-mpi
32+
```
33+
34+
4. (Optional) To enable the NCCL backend for multi-GPU systems, install `cupy` and `nccl` via `pip`:
35+
```
36+
pip install cupy-cudaXx nvidia-nccl-cuX
37+
```
3738

39+
with `X=11,12`.
40+
41+
Alternatively, if the Conda package manager is used to setup the Python environment, steps 1 and 2 can be skipped and `mpi4py` can be installed directly alongside the MPI distribution of choice:
42+
43+
```
44+
conda install -c conda-forge mpi4py X
45+
```
46+
47+
with `X=mpich, openmpi, impi_rt, msmpi`. Similarly step 4 can be accomplished using:
48+
49+
```
50+
conda install -c conda-forge cupy nccl
51+
```
52+
53+
See the docs ([Installation](https://pylops.github.io/pylops-mpi/installation.html)) for more information.
54+
3855
## Run Pylops-MPI
3956
Once you have installed the prerequisites and pylops-mpi, you can run pylops-mpi using the `mpiexec` command.
40-
Here's an example on how to run the command:
57+
58+
Here is an example on how to run a python script called `<script_name>.py`:
4159
```
4260
mpiexec -n <NUM_PROCESSES> python <script_name>.py
4361
```
4462

45-
## Example
46-
The DistributedArray can be used to either broadcast or scatter the NumPy array across different
47-
ranks or processes.
63+
## Example: A distributed finite-difference operator
64+
The following example is a modified version of
65+
[PyLops' README](https://github.com/PyLops/pylops/blob/dev/README.md)_ starting
66+
example that can handle a 2D-array distributed across ranks over the first dimension
67+
via the `DistributedArray` object:
68+
4869
```python
70+
import numpy as np
4971
from pylops_mpi import DistributedArray, Partition
5072

51-
global_shape = (10, 5)
73+
# Initialize DistributedArray with partition set to Scatter
74+
nx, ny = 11, 21
75+
x = np.zeros((nx, ny), dtype=np.float64)
76+
x[nx // 2, ny // 2] = 1.0
5277

53-
# Initialize a DistributedArray with partition set to Broadcast
54-
dist_array_broadcast = DistributedArray(global_shape=global_shape,
55-
partition=Partition.BROADCAST)
78+
x_dist = pylops_mpi.DistributedArray.to_dist(
79+
x=x.flatten(),
80+
partition=Partition.SCATTER)
5681

57-
# Initialize a DistributedArray with partition set to Scatter
58-
dist_array_scatter = DistributedArray(global_shape=global_shape,
59-
partition=Partition.SCATTER)
60-
```
82+
# Distributed first-derivative
83+
D_op = pylops_mpi.MPIFirstDerivative((nx, ny), dtype=np.float64)
6184

62-
Additionally, the DistributedArray can be used to scatter the array along any
63-
specified axis.
85+
# y = Dx
86+
y_dist = D_op @ x_dist
6487

65-
```python
66-
# Partition axis = 0
67-
dist_array_0 = DistributedArray(global_shape=global_shape,
68-
partition=Partition.SCATTER, axis=0)
88+
# xadj = D^H y
89+
xadj_dist = D_op.H @ y_dist
6990

70-
# Partition axis = 1
71-
dist_array_1 = DistributedArray(global_shape=global_shape,
72-
partition=Partition.SCATTER, axis=1)
91+
# xinv = D^-1 y
92+
x0_dist = pylops_mpi.DistributedArray(D_op.shape[1], dtype=np.float64)
93+
x0_dist[:] = 0
94+
xinv_dist = pylops_mpi.cgls(D_op, y_dist, x0=x0_dist, niter=10)[0]
7395
```
7496

75-
The DistributedArray class provides a `to_dist` class method that accepts a NumPy array as input and converts it into an
76-
instance of the `DistributedArray` class. This method is used to transform a regular NumPy array into a DistributedArray that can be distributed
77-
and processed across multiple nodes or processes.
78-
79-
```python
80-
import numpy as np
81-
np.random.seed(42)
97+
Note that the `DistributedArray` class provides the `to_dist` class method that accepts a NumPy array as input and converts it into an instance of the `DistributedArray` class. This method is used to transform a regular NumPy array into a DistributedArray that is distributed and processed across multiple nodes or processes.
8298

83-
dist_arr = DistributedArray.to_dist(x=np.random.normal(100, 100, global_shape),
84-
partition=Partition.SCATTER, axis=0)
85-
```
86-
The DistributedArray also provides fundamental mathematical operations, like element-wise addition, subtraction, and multiplication,
87-
as well as dot product and the [`np.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) function in a distributed fashion,
88-
thus utilizing the efficiency of the MPI protocol. This enables efficient computation and processing of large-scale distributed arrays.
99+
Moreover, the `DistributedArray` class provides also fundamental mathematical operations, such as element-wise addition, subtraction, multiplication, dot product, and an equivalent of the [`np.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) function that operate in a distributed fashion,
100+
thus utilizing the efficiency of the MPI/NCC; protocols. This enables efficient computation and processing of large-scale distributed arrays.
89101

90102
## Running Tests
91-
The test scripts are located in the tests folder.
103+
The MPI test scripts are located in the `tests` folder.
92104
Use the following command to run the tests:
93105
```
94-
mpiexec -n <NUM_PROCESSES> pytest --with-mpi
106+
mpiexec -n <NUM_PROCESSES> pytest tests/ --with-mpi
107+
```
108+
where the `--with-mpi` option tells pytest to enable the `pytest-mpi` plugin, allowing the tests to utilize the MPI functionality.
109+
110+
Similarly, to run the NCCL test scripts in the `tests_nccl` folder,
111+
use the following command to run the tests:
112+
```
113+
mpiexec -n <NUM_PROCESSES> pytest tests_nccl/ --with-mpi
95114
```
96-
The `--with-mpi` option tells pytest to enable the `pytest-mpi` plugin,
97-
allowing the tests to utilize the MPI functionality.
98115

99116
## Documentation
100117
The official documentation of Pylops-MPI is available [here](https://pylops.github.io/pylops-mpi/).
101118
Visit the official docs to learn more about pylops-mpi.
102119

103120
## Contributors
104121
* Rohan Babbar, rohanbabbar04
122+
* Yuxi Hong, hongyx11
105123
* Matteo Ravasi, mrava87
124+
* Tharit Tangkijwanichakul, tharittk

docs/source/credits.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,6 @@ Contributors
44
============
55

66
* `Rohan Babbar <https://github.com/rohanbabbar04>`_, rohanbabbar04
7-
* `Matteo Ravasi <https://github.com/mrava87>`_, mrava87
87
* `Yuxi Hong <https://github.com/hongyx11>`_, hongyx11
9-
* `Carlos da Costa <https://github.com/cako>`_, cako
8+
* `Matteo Ravasi <https://github.com/mrava87>`_, mrava87
9+
* `Tharit Tangkijwanichakul <https://github.com/tharittk>`_, tharittk

docs/source/gpu.rst

Lines changed: 73 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,15 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi
2222
some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
2323
cupy arrays.
2424

25+
Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized
26+
collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the
27+
proprietary technology like NVLink that might be available in their infrastructure for fast data communication.
28+
29+
.. note::
30+
31+
Set environment variable ``NCCL_PYLOPS_MPI=0`` to explicitly force PyLops-MPI to ignore the ``NCCL`` backend.
32+
However, this is optional as users may opt-out for NCCL by skip passing `cupy.cuda.nccl.NcclCommunicator` to
33+
the :class:`pylops_mpi.DistributedArray`
2534

2635
Example
2736
-------
@@ -79,7 +88,69 @@ your GPU:
7988
The code is almost unchanged apart from the fact that we now use ``cupy`` arrays,
8089
PyLops-mpi will figure this out!
8190

91+
Finally, if NCCL is available, a ``cupy.cuda.nccl.NcclCommunicator`` can be initialized and passed to :class:`pylops_mpi.DistributedArray`
92+
as follows:
93+
94+
.. code-block:: python
95+
96+
from pylops_mpi.utils._nccl import initialize_nccl_comm
97+
98+
# Initilize NCCL Communicator
99+
nccl_comm = initialize_nccl_comm()
100+
101+
# Create distributed data (broadcast)
102+
nxl, nt = 20, 20
103+
dtype = np.float32
104+
d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
105+
base_comm_nccl=nccl_comm,
106+
partition=pylops_mpi.Partition.BROADCAST,
107+
engine="cupy", dtype=dtype)
108+
d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)
109+
110+
# Create and apply VStack operator
111+
Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
112+
HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
113+
y_dist = HOp @ d_dist
114+
115+
Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to
116+
one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL
117+
82118
.. note::
83119

84-
The CuPy backend is in active development, with many examples not yet in the docs.
85-
You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
120+
The CuPy and NCCL backend is in active development, with many examples not yet in the docs.
121+
You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
122+
123+
Supports for NCCL Backend
124+
----------------------------
125+
In the following, we provide a list of modules (i.e., operators and solvers) where we plan to support NCCL and the current status:
126+
127+
.. list-table::
128+
:widths: 50 25
129+
:header-rows: 1
130+
131+
* - modules
132+
- NCCL supported
133+
* - :class:`pylops_mpi.DistributedArray`
134+
- /
135+
* - :class:`pylops_mpi.basicoperators.MPIVStack`
136+
- Ongoing
137+
* - :class:`pylops_mpi.basicoperators.MPIHStack`
138+
- Ongoing
139+
* - :class:`pylops_mpi.basicoperators.MPIBlockDiag`
140+
- Ongoing
141+
* - :class:`pylops_mpi.basicoperators.MPIGradient`
142+
- Ongoing
143+
* - :class:`pylops_mpi.basicoperators.MPIFirstDerivative`
144+
- Ongoing
145+
* - :class:`pylops_mpi.basicoperators.MPISecondDerivative`
146+
- Ongoing
147+
* - :class:`pylops_mpi.basicoperators.MPILaplacian`
148+
- Ongoing
149+
* - :class:`pylops_mpi.optimization.basic.cg`
150+
- Ongoing
151+
* - :class:`pylops_mpi.optimization.basic.cgls`
152+
- Ongoing
153+
* - ISTA Solver
154+
- Planned
155+
* - Complex Numeric Data Type for NCCL
156+
- Planned

docs/source/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ By integrating MPI (Message Passing Interface), PyLops-MPI optimizes the collabo
1414
computing nodes, enabling large and intricate tasks to be divided, solved, and aggregated in an efficient and
1515
parallelized manner.
1616

17+
PyLops-MPI also supports the Nvidia's Collective Communication Library `(NCCL) <https://developer.nvidia.com/nccl>`_ for high-performance
18+
GPU-to-GPU communications. The PyLops-MPI's NCCL engine works congruently with MPI by delegating the GPU-to-GPU communication tasks to
19+
highly-optimized NCCL, while leveraging MPI for CPU-side coordination and orchestration.
20+
1721
Get started by :ref:`installing PyLops-MPI <Installation>` and following our quick tour.
1822

1923
Terminology

0 commit comments

Comments
 (0)