diff --git a/Makefile b/Makefile
index 33065808..bb33445f 100644
--- a/Makefile
+++ b/Makefile
@@ -2,7 +2,7 @@ PIP := $(shell command -v pip3 2> /dev/null || command which pip 2> /dev/null)
 PYTHON := $(shell command -v python3 2> /dev/null || command which python 2> /dev/null)
 NUM_PROCESSES = 3
 
-.PHONY: install dev-install install_conda dev-install_conda tests doc docupdate run_examples run_tutorials
+.PHONY: install dev-install dev-install_nccl install_conda install_conda_nccl dev-install_conda dev-install_conda_nccl tests tests_nccl doc docupdate run_examples run_tutorials
 
 pipcheck:
 ifndef PIP
@@ -24,19 +24,29 @@ dev-install:
 	make pipcheck
 	$(PIP) install -r requirements-dev.txt && $(PIP) install -e .
 
+dev-install_nccl:
+	make pipcheck
+	$(PIP) install -r requirements-dev.txt && $(PIP) install cupy-cuda12x nvidia-nccl-cu12  $(PIP) install -e .
+
 install_conda:
 	conda env create -f environment.yml && conda activate pylops_mpi && pip install .
 
+install_conda_nccl:
+	conda env create -f environment.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install .
+
 dev-install_conda:
 	conda env create -f environment-dev.yml && conda activate pylops_mpi && pip install -e .
 
+dev-install_conda_nccl:
+	conda env create -f environment-dev.yml && conda activate pylops_mpi && conda install -c conda-forge cupy nccl && pip install -e .
+
 lint:
 	flake8 pylops_mpi/ tests/ examples/ tutorials/
 
 tests:
 	mpiexec -n $(NUM_PROCESSES) pytest tests/ --with-mpi
 
-# assuming NUM_PRCESS <= number of gpus available
+# assuming NUM_PROCESSES <= number of gpus available
 tests_nccl:	
 	mpiexec -n $(NUM_PROCESSES) pytest tests_nccl/ --with-mpi
 
diff --git a/README.md b/README.md
index 4a9a9bfe..81f24fb0 100644
--- a/README.md
+++ b/README.md
@@ -34,6 +34,10 @@ and running the following command:
       ```
       make install_conda
       ```
+Optionally, if you work with multi-GPU environment and want to use Nvidia's collective communication calls (NCCL) enabled, install your environment with
+   ```
+   make install_conda_nccl 
+   ```
    
 ## Run Pylops-MPI
 Once you have installed the prerequisites and pylops-mpi, you can run pylops-mpi using the `mpiexec` command. 
diff --git a/docs/source/gpu.rst b/docs/source/gpu.rst
index 43c9e768..7afe24aa 100644
--- a/docs/source/gpu.rst
+++ b/docs/source/gpu.rst
@@ -22,6 +22,15 @@ can handle both scenarios. Note that, since most operators in PyLops-mpi are thi
 some of the operators in PyLops that lack a GPU implementation cannot be used also in PyLops-mpi when working with
 cupy arrays.
 
+Moreover, PyLops-MPI also supports the Nvidia's Collective Communication Library (NCCL) for highly-optimized
+collective operations, such as AllReduce, AllGather, etc. This allows PyLops-MPI users to leverage the
+proprietary technology like NVLink that might be available in their infrastructure for fast data communication.
+
+.. note::
+
+   Set environment variable ``NCCL_PYLOPS_MPI=0`` to explicitly force PyLops-MPI to ignore the ``NCCL`` backend.
+   However, this is optional as users may opt-out for NCCL by skip passing `cupy.cuda.nccl.NcclCommunicator` to
+   the :class:`pylops_mpi.DistributedArray` 
 
 Example
 -------
@@ -79,7 +88,69 @@ your GPU:
 The code is almost unchanged apart from the fact that we now use ``cupy`` arrays,
 PyLops-mpi will figure this out!
 
+Finally, if NCCL is available, a ``cupy.cuda.nccl.NcclCommunicator`` can be initialized and passed to :class:`pylops_mpi.DistributedArray`
+as follows:
+
+.. code-block:: python
+
+    from pylops_mpi.utils._nccl import initialize_nccl_comm
+
+    # Initilize NCCL Communicator
+    nccl_comm = initialize_nccl_comm()
+
+    # Create distributed data (broadcast)
+    nxl, nt = 20, 20
+    dtype = np.float32
+    d_dist = pylops_mpi.DistributedArray(global_shape=nxl * nt,
+                                         base_comm_nccl=nccl_comm,
+                                         partition=pylops_mpi.Partition.BROADCAST,
+                                         engine="cupy", dtype=dtype)
+    d_dist[:] = cp.ones(d_dist.local_shape, dtype=dtype)
+
+    # Create and apply VStack operator
+    Sop = pylops.MatrixMult(cp.ones((nxl, nxl)), otherdims=(nt, ))
+    HOp = pylops_mpi.MPIVStack(ops=[Sop, ])
+    y_dist = HOp @ d_dist
+
+Under the hood, PyLops-MPI use both MPI Communicator and NCCL Communicator to manage distributed operations. Each GPU is logically binded to 
+one MPI process. In fact, minor communications like those dealing with array-related shapes and sizes are still performed using MPI, while collective calls on array like AllReduce are carried through NCCL
+
 .. note::
 
-   The CuPy backend is in active development, with many examples not yet in the docs.
-   You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
\ No newline at end of file
+   The CuPy and NCCL backend is in active development, with many examples not yet in the docs.
+   You can find many `other examples <https://github.com/PyLops/pylops_notebooks/tree/master/developement-mpi/Cupy_MPI>`_ from the `PyLops Notebooks repository <https://github.com/PyLops/pylops_notebooks>`_.
+
+Supports for NCCL Backend
+----------------------------
+In the following, we provide a list of modules (i.e., operators and solvers) where we plan to support NCCL and the current status:
+
+.. list-table::
+   :widths: 50 25 
+   :header-rows: 1
+
+   * - modules
+     - NCCL supported
+   * - :class:`pylops_mpi.DistributedArray`
+     - / 
+   * - :class:`pylops_mpi.basicoperators.MPIVStack`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIHStack`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIBlockDiag`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIGradient`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPIFirstDerivative`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPISecondDerivative`
+     - Ongoing
+   * - :class:`pylops_mpi.basicoperators.MPILaplacian`
+     - Ongoing
+   * - :class:`pylops_mpi.optimization.basic.cg`
+     - Ongoing
+   * - :class:`pylops_mpi.optimization.basic.cgls`
+     - Ongoing
+   * - ISTA Solver
+     - Planned 
+   * - Complex Numeric Data Type for NCCL 
+     - Planned 
\ No newline at end of file
diff --git a/docs/source/index.rst b/docs/source/index.rst
index b5d538ee..e044cd9b 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -14,6 +14,10 @@ By integrating MPI (Message Passing Interface), PyLops-MPI optimizes the collabo
 computing nodes, enabling large and intricate tasks to be divided, solved, and aggregated in an efficient and
 parallelized manner.
 
+PyLops-MPI also supports the Nvidia's Collective Communication Library `(NCCL) <https://developer.nvidia.com/nccl>`_ for high-performance
+GPU-to-GPU communications. The PyLops-MPI's NCCL engine works congruently with MPI by delegating the GPU-to-GPU communication tasks to 
+highly-optimized NCCL, while leveraging MPI for CPU-side coordination and orchestration.
+
 Get started by :ref:`installing PyLops-MPI <Installation>` and following our quick tour.
 
 Terminology
diff --git a/docs/source/installation.rst b/docs/source/installation.rst
index 1ba5509e..127acfb3 100644
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@@ -45,6 +45,14 @@ Fork the `PyLops-MPI repository <https://github.com/PyLops/pylops-mpi>`_ and clo
 We recommend installing dependencies into a separate environment.
 For that end, we provide a `Makefile` with useful commands for setting up the environment.
 
+Enable Nvidia Collective Communication Library
+=======================================================
+To obtain highly-optimized performance on GPU clusters, PyLops-MPI also supports the Nvidia's collective communication calls
+`(NCCL) <https://developer.nvidia.com/nccl>`_. Two additional dependencies are required: CuPy and NCCL 
+
+* `CuPy with NCCL <https://docs.cupy.dev/en/stable/install.html>`_
+
+
 Step-by-step installation for users
 ***********************************
 
@@ -89,6 +97,12 @@ For a ``conda`` environment, run
 
 This will create and activate an environment called ``pylops_mpi``, with all required and optional dependencies.
 
+If you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ in PyLops-MPI, run this instead
+
+.. code-block:: bash
+
+   >> make dev-install_conda_nccl
+
 Pip
 ---
 If you prefer a ``pip`` installation, we provide the following command
@@ -100,6 +114,23 @@ If you prefer a ``pip`` installation, we provide the following command
 Note that, differently from the  ``conda`` command, the above **will not** create a virtual environment.
 Make sure you create and activate your environment previously.
 
+Simlarly, if you want to enable `NCCL <https://developer.nvidia.com/nccl>`_ but prefer using pip,
+you must first check the CUDA version of your system:
+
+.. code-block:: bash
+
+   >> nvidia-smi
+
+The `Makefile` is pre-configured with CUDA 12.x. If you use this version, run
+
+.. code-block:: bash
+
+   >> make dev-install_nccl
+
+Otherwise, you can change the command in `Makefile` to an appropriate CUDA version
+i.e., If you use CUDA 11.x, change ``cupy-cuda12x`` and ``nvidia-nccl-cu12`` to ``cupy-cuda11x`` and ``nvidia-nccl-cu11``  
+and run the command.
+
 Run tests
 =========
 To ensure that everything has been setup correctly, run tests:
@@ -110,6 +141,12 @@ To ensure that everything has been setup correctly, run tests:
 
 Make sure no tests fail, this guarantees that the installation has been successful.
 
+If PyLops-MPI is installed with NCCL, also run tests:
+
+.. code-block:: bash
+
+   >> make tests_nccl
+
 Run examples and tutorials
 ==========================
 Since the sphinx-gallery creates examples/tutorials using only a single process, it is highly recommended to test the