-
Notifications
You must be signed in to change notification settings - Fork 218
Description
I was surprised to see we don't run the CUDA sanity check on CUDA itself.
eb --cuda-compute-capabilities=10.0 --accept-eula-for=CUDA CUDA-12.6.0.eb
...
cat <eblog>
== 2025-12-24 16:08:27,790 easyblock.py:4431 DEBUG Skipping CUDA sanity check: CUDA is not in dependencies
even though e.g. $EBROOTCUDA/lib/libcublas.so contains CUDA device code.
The reason is that we only run the sanity check if CUDA is in the dependencies. We should probably add to that the case that CUDA is the actual software we are installing.
While it may seem silly at first to run the sanity check on CUDA itself, it does provide a clear warning mechanism for people trying to run an older CUDA on a newer GPU arch. E.g.
[casparl@tcn78 software-layer]$ cuobjdump /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/lib/libcublas.so | grep sm_100 | wc -l
0
[casparl@tcn78 software-layer]$ cuobjdump /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.8.0/lib/libcublas.so | grep sm_100 | wc -l
195
I.e. it'd be great if the CUDA sanity check would tell you that
eb --cuda-compute-capabilities=10.0 --accept-eula-for=CUDA CUDA-12.6.0.eb
is actually not such a great idea, since 12.6.0 doesn't support 10.0 - information that is surprisingly hard to find. The only place where I found it is in the release notes where they state that support for certain archs have been added in a particular version (e.g. 12.8.0 states that support for 10.0 was added, see https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html#new-features ). That's not very easily 'findable', and it's thus easy to get things wrong. There is a nice, third-party overview table https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ but that also got it wrong (states that CC 10.0 is supported from CUDA 12.6 onwards - which is wrong).
Note that specifically for CUDA installations, there is another check which could actually be done:
$ which nvcc
~/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/bin/nvcc
$ nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
But since we have the general CUDA sanity check already, I think it's easier to just enable that one.