-
Notifications
You must be signed in to change notification settings - Fork 314
Upgrade the NVIDIA GPU driver on a Slurm cluster managed with AWS ParallelCluster
An AWS ParallelCluster release comes with a set of AMIs for the supported operating systems and EC2 platforms. Each AMI contains a software stack, including the NVIDIA Drivers, that has been validated at ParallelCluster release time.
It’s likely that other versions of the NVIDIA Drivers can successfully work with the rest of the software stack but technical support will be limited.
If you wish to upgrade the NVIDIA GPU Driver on your cluster you can follow this guide.
To upgrade the NVIDIA GPU Driver and CUDA version, it is advised to create a new custom AMI with the new versions via the pcluster build-image command.
After having successfully built the custom AMI, you can use this AMI for a new cluster, or update your compute nodes of a running cluster by using the Scheduling/SlurmQueues/Queue/Image/CustomAmi cluster configuration parameter and launching a pcluster update-cluster command.
Once the update is applied and the compute nodes have been started with the new custom AMI, please verify that the new version of the driver is installed by launching the nvidia-smi command.
To build the custom AMI, you need to provide to the image configuration file a custom component which upgrades both the NVIDIA and CUDA versions. Here is a configuration snippet with the custom component:
Image:
# Due to the large size of files, make sure to have a large enough root volume size.
RootVolume:
Size: 50
Build:
InstanceType: g4dn.xlarge # instance type with NVIDIA GPUs
ParentImage: ami-04823729c75214919 # base AMI of your desired OS, e.g. alinux2
Components:
- Type: arn
Value: arn:{{PARTITION}}:imagebuilder:{{REGION}}:{{ACCOUNT_ID}}:component/nvidiacudainstall/1.0.0/1
The following component should be used for your custom component (The Nvidia driver version, CUDA version, and architecture can be adapted to your needs):
name: NvidiaAndCudaInstall
description: Install nvidia and cuda
schemaVersion: 1.0
phases:
- name: build
steps:
- name: InstallNvida
action: ExecuteBash
inputs:
commands:
- |
#!/bin/bash
set -ex
NVIDIA_DRIVER_VERSION="580.95.05"
ARCH="x86_64"
# Create temporary directory
TMP_DIR="/pcluster-tmp/$(date +"%Y-%m-%dT%H-%M-%S")"
COMPILER_PATH="/usr/bin/gcc"
export CC="${COMPILER_PATH}"
NVIDIA_RUNFILE="NVIDIA-Linux-${ARCH}-${NVIDIA_DRIVER_VERSION}.run"
wget -P "${TMP_DIR}" "https://us.download.nvidia.com/tesla/${NVIDIA_DRIVER_VERSION}/${NVIDIA_RUNFILE}"
chmod +x "${TMP_DIR}/${NVIDIA_RUNFILE}"
"${TMP_DIR}/${NVIDIA_RUNFILE}" --silent --dkms --disable-nouveau -m="kernel-open"
# Cleanup
rm -rf "${TMP_DIR}"
- name: InstallCuda
action: ExecuteBash
inputs:
commands:
- |
#!/bin/bash
set -ex
CUDA_VERSION="13.0.2"
CUDA_SAMPLES_VERSION="13.0"
CUDA_RELEASE_NVIDIA_VERSION="580.95.05"
# Create temporary directory
TMP_DIR="/pcluster-tmp/$(date +"%Y-%m-%dT%H-%M-%S")"
CUDA_RUNFILE="cuda_${CUDA_VERSION}_${CUDA_RELEASE_NVIDIA_VERSION}_linux.run"
wget -P "${TMP_DIR}" "https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/local_installers/${CUDA_RUNFILE}"
chmod +x "${TMP_DIR}/${CUDA_RUNFILE}"
CUDA_TMP_INSTALL_DIR="${TMP_DIR}/cuda-install"
mkdir -p "${CUDA_TMP_INSTALL_DIR}"
"${TMP_DIR}/${CUDA_RUNFILE}" --silent --toolkit --samples --tmpdir="${CUDA_TMP_INSTALL_DIR}"
CUDA_SAMPLES_ARCHIVE="v${CUDA_SAMPLES_VERSION}.tar.gz"
wget -P "${TMP_DIR}" "https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v${CUDA_SAMPLES_VERSION}.tar.gz"
tar xf "${TMP_DIR}/${CUDA_SAMPLES_ARCHIVE}" --directory "/usr/local/"
# Cleanup
rm -rf "${TMP_DIR}"
## Add CUDA to PATH
CUDA_PATH="/usr/local/cuda"
echo "export PATH=${CUDA_PATH}/bin:\${PATH}" > /etc/profile.d/pcluster_cuda.sh
echo "export LD_LIBRARY_PATH=${CUDA_PATH}/lib64:\${LD_LIBRARY_PATH}" >> /etc/profile.d/pcluster_cuda.sh
chmod +x /etc/profile.d/pcluster_cuda.sh
- name: Validation
action: ExecuteBash
inputs:
commands:
- |
#!/bin/bash
set -ex
## Validation
source /etc/profile.d/pcluster_cuda.sh
ls -l /usr/local
which nvcc
nvcc --version
which nvidia-smi
nvidia-smi