PyLops
diff --git a/‎Makefile‎
Lines changed: 12 additions & 2 deletions b/‎Makefile‎
Lines changed: 12 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 73 additions & 54 deletions b/‎README.md‎
Lines changed: 73 additions & 54 deletions
diff --git a/‎docs/source/credits.rst‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/credits.rst‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/gpu.rst‎
Lines changed: 73 additions & 2 deletions b/‎docs/source/gpu.rst‎
Lines changed: 73 additions & 2 deletions
diff --git a/‎docs/source/index.rst‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/index.rst‎
Lines changed: 4 additions & 0 deletions
@@ -2,7 +2,7 @@ PIP := $(shell command -v pip3 2> /dev/null || command which pip 2> /dev/null)
 PYTHON := $(shell command -v python3 2> /dev/null || command which python 2> /dev/null)
 NUM_PROCESSES = 3
 
-.PHONY: install dev-install install_conda dev-install_conda tests doc docupdate run_examples run_tutorials
+.PHONY: install dev-install dev-install_nccl install_conda install_conda_nccl dev-install_conda dev-install_conda_nccl tests tests_nccl doc docupdate run_examples run_tutorials
 
 pipcheck:
 ifndef PIP
@@ -24,19 +24,29 @@ dev-install:
 	make pipcheck
 	$(PIP) install -r requirements-dev.txt && $(PIP) install -e .
 
+dev-install_nccl:
+	make pipcheck
+	$(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12  $(PIP) install -e .
+
 install_conda:
 	conda env create -f environment.yml && conda activate pylops_mpi && pip install .
 
+install_conda_nccl:
+	conda env create -f environment.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install .
+
 dev-install_conda:
 	conda env create -f environment-dev.yml && conda activate pylops_mpi && pip install -e .
 
+dev-install_conda_nccl:
+	conda env create -f environment-dev.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install -e .
+
 lint:
 	flake8 pylops_mpi/ tests/ examples/ tutorials/
 
 tests:
 	mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
 
-# assuming NUM_PRCESS <= number of gpus available
+# assuming NUM_PROCESSES <= number of gpus available
 tests_nccl:	
 	mpiexec -n $(NUM_PROCESSES) pytest tests_nccl/ --with-mpi
 
 
@@ -7,99 +7,118 @@
 [![Slack Status](https://img.shields.io/badge/chat-slack-green.svg)](https://pylops.slack.com)
 [![DOI](https://joss.theoj.org/papers/10.21105/joss.07512/status.svg)](https://doi.org/10.21105/joss.07512)
 
-## PyLops MPI
-pylops-mpi is a Python library built on top of [PyLops](https://pylops.readthedocs.io/en/stable/), designed to enable distributed and parallel processing of 
+# Distributed linear operators and solvers
+Pylops-mpi is a Python library built on top of [PyLops](https://pylops.readthedocs.io/en/stable/), designed to enable distributed and parallel processing of 
 large-scale linear algebra operations and computations.  
 
 ## Installation
-To install pylops-mpi, you need to have MPI (Message Passing Interface) installed on your system.
+To install pylops-mpi, you need to have Message Passing Interface (MPI) and optionally Nvidia's Collective Communication Library (NCCL) installed on your system.
+
 1. **Download and Install MPI**: Visit the official MPI website to download an appropriate MPI implementation for your system. 
 Follow the installation instructions provided by the MPI vendor.
    - [Open MPI](https://www.open-mpi.org/software/ompi/v1.10/)
    - [MPICH](https://www.mpich.org/downloads/)
    - [Intel MPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.10j8fx)
+
 2. **Verify MPI Installation**: After installing MPI, verify its installation by opening a terminal or command prompt 
 and running the following command:
-    ```
-    mpiexec --version
    ```
- 3. **Install pylops-mpi**: Once MPI is installed and verified, you can proceed to install `pylops-mpi`. 
-   
-      You can install with `pip`:
-      ```
-      pip install pylops-mpi
-      ```
-   
-      You can install with `make` and `conda`:
-      ```
-      make install_conda
-      ```
+   mpiexec --version
+   ```
+
+3. **Install pylops-mpi**: Once MPI is installed and verified, you can proceed to install `pylops-mpi` via `pip`:
+   ```
+   pip install pylops-mpi
+   ```
+
+4. (Optional) To enable the NCCL backend for multi-GPU systems, install `cupy` and `nccl` via `pip`:
+   ```
+   pip install cupy-cudaXx nvidia-nccl-cuX
+   ```
 
+   with `X=11,12`.
+
+Alternatively, if the Conda package manager is used to setup the Python environment, steps 1 and 2 can be skipped and `mpi4py` can be installed directly alongside the MPI distribution of choice:
+
+```
+conda install -c conda-forge mpi4py X
+```
+
+with `X=mpich, openmpi, impi_rt, msmpi`. Similarly step 4 can be accomplished using:
+
+```
+conda install -c conda-forge cupy nccl 
+```
+
+See the docs ([Installation](https://pylops.github.io/pylops-mpi/installation.html)) for more information.
+
 ## Run Pylops-MPI
 Once you have installed the prerequisites and pylops-mpi, you can run pylops-mpi using the `mpiexec` command. 
-Here's an example on how to run the command:
+
+Here is an example on how to run a python script called `<script_name>.py`:
 ```
 mpiexec -n <NUM_PROCESSES> python <script_name>.py
 ```
 
-## Example
-The DistributedArray can be used to either broadcast or scatter the NumPy array across different 
-ranks or processes.
+## Example: A distributed finite-difference operator
+The following example is a modified version of 
+[PyLops' README](https://github.com/PyLops/pylops/blob/dev/README.md)_ starting 
+example that can handle a 2D-array distributed across ranks over the first dimension 
+via the `DistributedArray` object:
+
 ```python
+import numpy as np
 from pylops_mpi import DistributedArray, Partition
 
-global_shape = (10, 5)
+# Initialize DistributedArray with partition set to Scatter
+nx, ny = 11, 21
+x = np.zeros((nx, ny), dtype=np.float64)
+x[nx // 2, ny // 2] = 1.0
 
-# Initialize a DistributedArray with partition set to Broadcast
-dist_array_broadcast = DistributedArray(global_shape=global_shape,
-                                        partition=Partition.BROADCAST)
+x_dist = pylops_mpi.DistributedArray.to_dist(
+            x=x.flatten(), 
+            partition=Partition.SCATTER)
 
-# Initialize a DistributedArray with partition set to Scatter
-dist_array_scatter = DistributedArray(global_shape=global_shape,
-                                      partition=Partition.SCATTER)
-```
+# Distributed first-derivative
+D_op = pylops_mpi.MPIFirstDerivative((nx, ny), dtype=np.float64)
 
-Additionally, the DistributedArray can be used to scatter the array along any
-specified axis.
+# y = Dx
+y_dist = D_op @ x_dist
 
-```python
-# Partition axis = 0
-dist_array_0 = DistributedArray(global_shape=global_shape, 
-                                partition=Partition.SCATTER, axis=0)
+# xadj = D^H y
+xadj_dist = D_op.H @ y_dist
 
-# Partition axis = 1
-dist_array_1 = DistributedArray(global_shape=global_shape, 
-                                partition=Partition.SCATTER, axis=1)
+# xinv = D^-1 y
+x0_dist = pylops_mpi.DistributedArray(D_op.shape[1], dtype=np.float64)
+x0_dist[:] = 0
+xinv_dist = pylops_mpi.cgls(D_op, y_dist, x0=x0_dist, niter=10)[0]
 ```
 
-The DistributedArray class provides a `to_dist` class method that accepts a NumPy array as input and converts it into an 
-instance of the `DistributedArray` class. This method is used to transform a regular NumPy array into a DistributedArray that can be distributed 
-and processed across multiple nodes or processes.
-
-```python
-import numpy as np
-np.random.seed(42)
+Note that the `DistributedArray` class provides the `to_dist` class method that accepts a NumPy array as input and converts it into an instance of the `DistributedArray` class. This method is used to transform a regular NumPy array into a DistributedArray that is distributed and processed across multiple nodes or processes.
 
-dist_arr = DistributedArray.to_dist(x=np.random.normal(100, 100, global_shape), 
-                                    partition=Partition.SCATTER, axis=0)
-```
-The DistributedArray also provides fundamental mathematical operations, like element-wise addition, subtraction, and multiplication, 
-as well as dot product and the [`np.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) function in a distributed fashion, 
-thus utilizing the efficiency of the MPI protocol. This enables efficient computation and processing of large-scale distributed arrays.
+Moreover, the `DistributedArray` class provides also fundamental mathematical operations, such as element-wise addition, subtraction, multiplication, dot product, and an equivalent of the [`np.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) function that operate in a distributed fashion, 
+thus utilizing the efficiency of the MPI/NCC; protocols. This enables efficient computation and processing of large-scale distributed arrays.
 
 ## Running Tests
-The test scripts are located in the tests folder.
+The MPI test scripts are located in the `tests` folder.
 Use the following command to run the tests:
 ```
-mpiexec -n <NUM_PROCESSES> pytest --with-mpi
+mpiexec -n <NUM_PROCESSES> pytest tests/ --with-mpi
+```
+where the `--with-mpi` option tells pytest to enable the `pytest-mpi` plugin, allowing the tests to utilize the MPI functionality.
+
+Similarly, to run the NCCL test scripts in the `tests_nccl` folder, 
+use the following command to run the tests:
+```
+mpiexec -n <NUM_PROCESSES> pytest tests_nccl/ --with-mpi
 ```
-The `--with-mpi` option tells pytest to enable the `pytest-mpi` plugin, 
-allowing the tests to utilize the MPI functionality.
 
 ## Documentation 
 The official documentation of Pylops-MPI is available [here](https://pylops.github.io/pylops-mpi/).
 Visit the official docs to learn more about pylops-mpi.
 
 ## Contributors
 * Rohan Babbar, rohanbabbar04
+* Yuxi Hong, hongyx11
 * Matteo Ravasi, mrava87
+* Tharit Tangkijwanichakul, tharittk
@@ -4,6 +4,6 @@ Contributors
 ============
 
 *  `Rohan Babbar <https://github.com/rohanbabbar04>`_, rohanbabbar04
-*  `Matteo Ravasi <https://github.com/mrava87>`_, mrava87
 *  `Yuxi Hong <https://github.com/hongyx11>`_, hongyx11
-*  `Carlos da Costa <https://github.com/cako>`_, cako
+*  `Matteo Ravasi <https://github.com/mrava87>`_, mrava87
+*  `Tharit Tangkijwanichakul <https://github.com/tharittk>`_, tharittk
@@ -22,6 +22,15 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi
 some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
 cupy arrays.
 
+Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized
+collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the
+proprietary technology like NVLink that might be available in their infrastructure for fast data communication.
+
+.. note::
+
+   Set environment variable ``NCCL_PYLOPS_MPI=0`` to explicitly force PyLops-MPI to ignore the ``NCCL`` backend.
+   However, this is optional as users may opt-out for NCCL by skip passing `cupy.cuda.nccl.NcclCommunicator` to
+   the :class:`pylops_mpi.DistributedArray` 
 
 Example
 -------
@@ -79,7 +88,69 @@ your GPU:
 The code is almost unchanged apart from the fact that we now use ``cupy`` arrays,
 PyLops-mpi will figure this out!
 
+Finally, if NCCL is available, a ``cupy.cuda.nccl.NcclCommunicator`` can be initialized and passed to :class:`pylops_mpi.DistributedArray`
+as follows:
+
+.. code-block:: python
+
+    from pylops_mpi.utils._nccl import initialize_nccl_comm
+
+    # Initilize NCCL Communicator
+    nccl_comm = initialize_nccl_comm()
+
+    # Create distributed data (broadcast)
+    nxl, nt = 20, 20
+    dtype = np.float32
+    d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
+                                         base_comm_nccl=nccl_comm,
+                                         partition=pylops_mpi.Partition.BROADCAST,
+                                         engine="cupy", dtype=dtype)
+    d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)
+
+    # Create and apply VStack operator
+    Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
+    HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
+    y_dist = HOp @ d_dist
+
+Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to 
+one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL
+
 .. note::
 
-   The CuPy backend is in active development, with many examples not yet in the docs.
-   You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
+   The CuPy and NCCL backend is in active development, with many examples not yet in the docs.
+   You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
+
+Supports for NCCL Backend
+----------------------------
+In the following, we provide a list of modules (i.e., operators and solvers) where we plan to support NCCL and the current status:
+
+.. list-table::
+   :widths: 50 25 
+   :header-rows: 1
+
+   * - modules
+     - NCCL supported
+   * - :class:`pylops_mpi.DistributedArray`
+     - / 
+   * - :class:`pylops_mpi.basicoperators.MPIVStack`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIHStack`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIBlockDiag`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIGradient`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIFirstDerivative`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPISecondDerivative`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPILaplacian`
+     - Ongoing
+   * - :class:`pylops_mpi.optimization.basic.cg`
+     - Ongoing
+   * - :class:`pylops_mpi.optimization.basic.cgls`
+     - Ongoing
+   * - ISTA Solver
+     - Planned 
+   * - Complex Numeric Data Type for NCCL 
+     - Planned 
@@ -14,6 +14,10 @@ By integrating MPI (Message Passing Interface), PyLops-MPI optimizes the collabo
 computing nodes, enabling large and intricate tasks to be divided, solved, and aggregated in an efficient and
 parallelized manner.
 
+PyLops-MPI also supports the Nvidia's Collective Communication Library `(NCCL) <https://developer.nvidia.com/nccl>`_ for high-performance
+GPU-to-GPU communications. The PyLops-MPI's NCCL engine works congruently with MPI by delegating the GPU-to-GPU communication tasks to 
+highly-optimized NCCL, while leveraging MPI for CPU-side coordination and orchestration.
+
 Get started by :ref:`installing PyLops-MPI <Installation>` and following our quick tour.
 
 Terminology