Skip to content

CUDA sanity check is not run on installations of CUDA itself #5084

@casparvl

Description

@casparvl

I was surprised to see we don't run the CUDA sanity check on CUDA itself.

eb --cuda-compute-capabilities=10.0 --accept-eula-for=CUDA CUDA-12.6.0.eb
...
cat <eblog>
== 2025-12-24 16:08:27,790 easyblock.py:4431 DEBUG Skipping CUDA sanity check: CUDA is not in dependencies

even though e.g. $EBROOTCUDA/lib/libcublas.so contains CUDA device code.

The reason is that we only run the sanity check if CUDA is in the dependencies. We should probably add to that the case that CUDA is the actual software we are installing.

While it may seem silly at first to run the sanity check on CUDA itself, it does provide a clear warning mechanism for people trying to run an older CUDA on a newer GPU arch. E.g.

[casparl@tcn78 software-layer]$ cuobjdump /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/lib/libcublas.so | grep sm_100 | wc -l
0
[casparl@tcn78 software-layer]$ cuobjdump /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.8.0/lib/libcublas.so | grep sm_100 | wc -l
195

I.e. it'd be great if the CUDA sanity check would tell you that

eb --cuda-compute-capabilities=10.0 --accept-eula-for=CUDA CUDA-12.6.0.eb

is actually not such a great idea, since 12.6.0 doesn't support 10.0 - information that is surprisingly hard to find. The only place where I found it is in the release notes where they state that support for certain archs have been added in a particular version (e.g. 12.8.0 states that support for 10.0 was added, see https://docs.nvidia.com/cuda/archive/12.8.0/cuda-toolkit-release-notes/index.html#new-features ). That's not very easily 'findable', and it's thus easy to get things wrong. There is a nice, third-party overview table https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ but that also got it wrong (states that CC 10.0 is supported from CUDA 12.6 onwards - which is wrong).

Note that specifically for CUDA installations, there is another check which could actually be done:

$ which nvcc
~/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/bin/nvcc
$ nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90

But since we have the general CUDA sanity check already, I think it's easier to just enable that one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions