Allow multiple workers to share a CUDA device, intended for use with MPS mode (#3509)

cms21 · web-flow · commit 1e618fa1e120 · 2024-07-15T21:44:16.000Z
This change allows in the case of CUDA devices the ability to set the same value of CUDA_VISIBLE_DEVICES for multiple Parsl workers on a node when using the high throughput executor. This allows the user to make use of the MPS mode for CUDA devices to partition a GPU to run multiple processes per GPU.

To use MPS mode with this functionality several settings must be set by the user in their config.

* available_accelerators should be set to the total number of GPU processes to be run on the node. For example, for a node with 4 Nvidia GPUS, if you wish to run 4 processes per GPU, available_accelerators should be set to 16.

* worker_init should include commands to start the MPS service and set any associated environment variables. For example on the ALCF machine Polaris, it is recommended the user make use of a bash script that starts the MPS service on a node called enable_mps_polaris.sh. worker_init should then contain:

worker_init='export NNODES='wc -l &lt; $PBS_NODEFILE'; mpiexec -n ${NNODES} --ppn 1 /path/to/mps/script/enable_mps_polaris.sh'
diff --git a/docs/userguide/configuring.rst b/docs/userguide/configuring.rst
@@ -346,7 +346,8 @@ Provide either the number of executors (Parsl will assume they are named in inte
         strategy='none',
     )
 
-
+For hardware that uses Nvidia devices, Parsl allows for the oversubscription of workers to GPUS.  This is intended to make use of Nvidia's `Multi-Process Service (MPS) <https://docs.nvidia.com/deploy/mps/>`_ available on many of their GPUs that allows users to run multiple concurrent processes on a single GPU.  The user needs to set in the ``worker_init`` commands to start MPS on every node in the block (this is machine dependent).  The ``available_accelerators`` option should then be set to the total number of GPU partitions run on a single node in the block.  For example, for a node with 4 Nvidia GPUs, to create 8 workers per GPU, set ``available_accelerators=32``.  GPUs will be assigned to workers in ascending order in contiguous blocks.  In the example, workers 0-7 will be placed on GPU 0, workers 8-15 on GPU 1, workers 16-23 on GPU 2, and workers 24-31 on GPU 3. 
+    
 Multi-Threaded Applications
 ---------------------------
 
diff --git a/parsl/executors/high_throughput/process_worker_pool.py b/parsl/executors/high_throughput/process_worker_pool.py
@@ -9,6 +9,7 @@
 import pickle
 import platform
 import queue
+import subprocess
 import sys
 import threading
 import time
@@ -731,9 +732,27 @@ def worker(
         os.sched_setaffinity(0, my_cores)  # type: ignore[attr-defined, unused-ignore]
         logger.info("Set worker CPU affinity to {}".format(my_cores))
 
+    # If CUDA devices, find total number of devices to allow for MPS
+    # See: https://developer.nvidia.com/system-management-interface
+    nvidia_smi_cmd = "nvidia-smi -L > /dev/null && nvidia-smi -L | wc -l"
+    nvidia_smi_ret = subprocess.run(nvidia_smi_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    if nvidia_smi_ret.returncode == 0:
+        num_cuda_devices = int(nvidia_smi_ret.stdout.split()[0])
+    else:
+        num_cuda_devices = None
+
     # If desired, pin to accelerator
     if accelerator is not None:
-        os.environ["CUDA_VISIBLE_DEVICES"] = accelerator
+        try:
+            if num_cuda_devices is not None:
+                procs_per_cuda_device = pool_size // num_cuda_devices
+                partitioned_accelerator = str(int(accelerator) // procs_per_cuda_device)  # multiple workers will share a GPU
+                os.environ["CUDA_VISIBLE_DEVICES"] = partitioned_accelerator
+                logger.info(f'Pinned worker to partitioned cuda device: {partitioned_accelerator}')
+            else:
+                os.environ["CUDA_VISIBLE_DEVICES"] = accelerator
+        except (TypeError, ValueError, ZeroDivisionError):
+            os.environ["CUDA_VISIBLE_DEVICES"] = accelerator
         os.environ["ROCR_VISIBLE_DEVICES"] = accelerator
         os.environ["ZE_AFFINITY_MASK"] = accelerator
         os.environ["ZE_ENABLE_PCI_ID_DEVICE_ORDER"] = '1'

Original file line number	Diff line number	Diff line change
`@@ -346,7 +346,8 @@ Provide either the number of executors (Parsl will assume they are named in inte`
`346`	`346`	`strategy='none',`
`347`	`347`	`)`
`348`	`348`
`349`		`-`
	`349`	+For hardware that uses Nvidia devices, Parsl allows for the oversubscription of workers to GPUS. This is intended to make use of Nvidia's `Multi-Process Service (MPS) <https://docs.nvidia.com/deploy/mps/>`_ available on many of their GPUs that allows users to run multiple concurrent processes on a single GPU. The user needs to set in the ``worker_init`` commands to start MPS on every node in the block (this is machine dependent). The ``available_accelerators`` option should then be set to the total number of GPU partitions run on a single node in the block. For example, for a node with 4 Nvidia GPUs, to create 8 workers per GPU, set ``available_accelerators=32``. GPUs will be assigned to workers in ascending order in contiguous blocks. In the example, workers 0-7 will be placed on GPU 0, workers 8-15 on GPU 1, workers 16-23 on GPU 2, and workers 24-31 on GPU 3.
	`350`	`+`
`350`	`351`	`Multi-Threaded Applications`
`351`	`352`	`---------------------------`
`352`	`353`