diff --git a/Makefile b/Makefile index 33065808..bb33445f 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ PIP := $(shell command -v pip3 2> /dev/null || command which pip 2> /dev/null) PYTHON := $(shell command -v python3 2> /dev/null || command which python 2> /dev/null) NUM_PROCESSES = 3 -.PHONY: install dev-install install_conda dev-install_conda tests doc docupdate run_examples run_tutorials +.PHONY: install dev-install dev-install_nccl install_conda install_conda_nccl dev-install_conda dev-install_conda_nccl tests tests_nccl doc docupdate run_examples run_tutorials pipcheck: ifndef PIP @@ -24,19 +24,29 @@ dev-install: make pipcheck $(PIP) install -r requirements-dev.txt && $(PIP) install -e . +dev-install_nccl: + make pipcheck + $(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12 $(PIP) install -e . + install_conda: conda env create -f environment.yml && conda activate pylops_mpi && pip install . +install_conda_nccl: + conda env create -f environment.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install . + dev-install_conda: conda env create -f environment-dev.yml && conda activate pylops_mpi && pip install -e . +dev-install_conda_nccl: + conda env create -f environment-dev.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install -e . + lint: flake8 pylops_mpi/ tests/ examples/ tutorials/ tests: mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi -# assuming NUM_PRCESS <= number of gpus available +# assuming NUM_PROCESSES <= number of gpus available tests_nccl: mpiexec -n $(NUM_PROCESSES) pytest tests_nccl/ --with-mpi diff --git a/README.md b/README.md index 4a9a9bfe..81f24fb0 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,10 @@ and running the following command: ``` make install_conda ``` +Optionally, if you work with multi-GPU environment and want to use Nvidia's collective communication calls (NCCL) enabled, install your environment with + ``` + make install_conda_nccl + ``` ## Run Pylops-MPI Once you have installed the prerequisites and pylops-mpi, you can run pylops-mpi using the `mpiexec` command. diff --git a/docs/source/gpu.rst b/docs/source/gpu.rst index 43c9e768..7afe24aa 100644 --- a/docs/source/gpu.rst +++ b/docs/source/gpu.rst @@ -22,6 +22,15 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with cupy arrays. +Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized +collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the +proprietary technology like NVLink that might be available in their infrastructure for fast data communication. + +.. note:: + + Set environment variable ``NCCL_PYLOPS_MPI=0`` to explicitly force PyLops-MPI to ignore the ``NCCL`` backend. + However, this is optional as users may opt-out for NCCL by skip passing `cupy.cuda.nccl.NcclCommunicator` to + the :class:`pylops_mpi.DistributedArray` Example ------- @@ -79,7 +88,69 @@ your GPU: The code is almost unchanged apart from the fact that we now use ``cupy`` arrays, PyLops-mpi will figure this out! +Finally, if NCCL is available, a ``cupy.cuda.nccl.NcclCommunicator`` can be initialized and passed to :class:`pylops_mpi.DistributedArray` +as follows: + +.. code-block:: python + + from pylops_mpi.utils._nccl import initialize_nccl_comm + + # Initilize NCCL Communicator + nccl_comm = initialize_nccl_comm() + + # Create distributed data (broadcast) + nxl, nt = 20, 20 + dtype = np.float32 + d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt, + base_comm_nccl=nccl_comm, + partition=pylops_mpi.Partition.BROADCAST, + engine="cupy", dtype=dtype) + d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype) + + # Create and apply VStack operator + Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, )) + HOp = pylops_mpi.MPIVStack(ops=[Sop, ]) + y_dist = HOp @ d_dist + +Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to +one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL + .. note:: - The CuPy backend is in active development, with many examples not yet in the docs. - You can find many `other examples `_ from the `PyLops Notebooks repository `_. \ No newline at end of file + The CuPy and NCCL backend is in active development, with many examples not yet in the docs. + You can find many `other examples `_ from the `PyLops Notebooks repository `_. + +Supports for NCCL Backend +---------------------------- +In the following, we provide a list of modules (i.e., operators and solvers) where we plan to support NCCL and the current status: + +.. list-table:: + :widths: 50 25 + :header-rows: 1 + + * - modules + - NCCL supported + * - :class:`pylops_mpi.DistributedArray` + - / + * - :class:`pylops_mpi.basicoperators.MPIVStack` + - Ongoing + * - :class:`pylops_mpi.basicoperators.MPIHStack` + - Ongoing + * - :class:`pylops_mpi.basicoperators.MPIBlockDiag` + - Ongoing + * - :class:`pylops_mpi.basicoperators.MPIGradient` + - Ongoing + * - :class:`pylops_mpi.basicoperators.MPIFirstDerivative` + - Ongoing + * - :class:`pylops_mpi.basicoperators.MPISecondDerivative` + - Ongoing + * - :class:`pylops_mpi.basicoperators.MPILaplacian` + - Ongoing + * - :class:`pylops_mpi.optimization.basic.cg` + - Ongoing + * - :class:`pylops_mpi.optimization.basic.cgls` + - Ongoing + * - ISTA Solver + - Planned + * - Complex Numeric Data Type for NCCL + - Planned \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index b5d538ee..e044cd9b 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -14,6 +14,10 @@ By integrating MPI (Message Passing Interface), PyLops-MPI optimizes the collabo computing nodes, enabling large and intricate tasks to be divided, solved, and aggregated in an efficient and parallelized manner. +PyLops-MPI also supports the Nvidia's Collective Communication Library `(NCCL) `_ for high-performance +GPU-to-GPU communications. The PyLops-MPI's NCCL engine works congruently with MPI by delegating the GPU-to-GPU communication tasks to +highly-optimized NCCL, while leveraging MPI for CPU-side coordination and orchestration. + Get started by :ref:`installing PyLops-MPI ` and following our quick tour. Terminology diff --git a/docs/source/installation.rst b/docs/source/installation.rst index 1ba5509e..127acfb3 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -45,6 +45,14 @@ Fork the `PyLops-MPI repository `_ and clo We recommend installing dependencies into a separate environment. For that end, we provide a `Makefile` with useful commands for setting up the environment. +Enable Nvidia Collective Communication Library +======================================================= +To obtain highly-optimized performance on GPU clusters, PyLops-MPI also supports the Nvidia's collective communication calls +`(NCCL) `_. Two additional dependencies are required: CuPy and NCCL + +* `CuPy with NCCL `_ + + Step-by-step installation for users *********************************** @@ -89,6 +97,12 @@ For a ``conda`` environment, run This will create and activate an environment called ``pylops_mpi``, with all required and optional dependencies. +If you want to enable `NCCL `_ in PyLops-MPI, run this instead + +.. code-block:: bash + + >> make dev-install_conda_nccl + Pip --- If you prefer a ``pip`` installation, we provide the following command @@ -100,6 +114,23 @@ If you prefer a ``pip`` installation, we provide the following command Note that, differently from the ``conda`` command, the above **will not** create a virtual environment. Make sure you create and activate your environment previously. +Simlarly, if you want to enable `NCCL `_ but prefer using pip, +you must first check the CUDA version of your system: + +.. code-block:: bash + + >> nvidia-smi + +The `Makefile` is pre-configured with CUDA 12.x. If you use this version, run + +.. code-block:: bash + + >> make dev-install_nccl + +Otherwise, you can change the command in `Makefile` to an appropriate CUDA version +i.e., If you use CUDA 11.x, change ``cupy-cuda12x`` and ``nvidia-nccl-cu12`` to ``cupy-cuda11x`` and ``nvidia-nccl-cu11`` +and run the command. + Run tests ========= To ensure that everything has been setup correctly, run tests: @@ -110,6 +141,12 @@ To ensure that everything has been setup correctly, run tests: Make sure no tests fail, this guarantees that the installation has been successful. +If PyLops-MPI is installed with NCCL, also run tests: + +.. code-block:: bash + + >> make tests_nccl + Run examples and tutorials ========================== Since the sphinx-gallery creates examples/tutorials using only a single process, it is highly recommended to test the