Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ansible/roles/cuda/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Requires OFED to be installed to provide required kernel-* packages.

- `cuda_repo_url`: Optional. URL of `.repo` file. Default is upstream for appropriate OS/architecture.
- `cuda_nvidia_driver_stream`: Optional. Version of `nvidia-driver` stream to enable. This controls whether the open or proprietary drivers are installed and the major version. Changing this once the drivers are installed does not change the version.
- `cuda_nvidia_driver_version`: Optional. Version of `nvidia-driver` module to install.
- `cuda_packages`: Optional. Default: `['cuda', 'nvidia-gds', 'cmake', 'cuda-toolkit-12-8']`.
- `cuda_package_version`: Optional. Default `latest` which will install the latest packages if not installed but won't upgrade already-installed packages. Use `'none'` to skip installing CUDA.
- `cuda_persistenced_state`: Optional. State of systemd `nvidia-persistenced` service. Values as [ansible.builtin.systemd:state](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_module.html#parameter-state). Default `started`.
1 change: 1 addition & 0 deletions ansible/roles/cuda/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
cuda_repo_url: "https://developer.download.nvidia.com/compute/cuda/repos/rhel{{ ansible_distribution_major_version }}/{{ ansible_architecture }}/cuda-rhel{{ ansible_distribution_major_version }}.repo"
cuda_nvidia_driver_stream: '570-open'
cuda_nvidia_driver_version: '570.133.20-1'
cuda_package_version: '12.8.1-1'
cuda_version_short: '12.8'
cuda_packages:
Expand Down
21 changes: 13 additions & 8 deletions ansible/roles/cuda/tasks/install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,22 @@
when: "'No matching Modules to list' in _cuda_driver_module_enabled.stderr"
changed_when: "'Nothing to do' not in _cuda_driver_module_enable.stdout"

- name: Check if nvidia driver module is installed
ansible.builtin.command: dnf module list --installed nvidia-driver
- name: Read module info for list of packages
ansible.builtin.shell:
cmd: >-
dnf module info nvidia-driver:{{ cuda_nvidia_driver_stream }} |
grep -F {{ cuda_nvidia_driver_version }}.el{{ ansible_distribution_major_version }}.{{ ansible_architecture }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some packages don't have el9 suffix (distro independent) so does that mean this isn't a complete list?

Copy link
Collaborator Author

@sjpb sjpb May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah like nvidia-imex-570-0:570.124.06-1.x86_64

So dnf module info nvidia-driver:570-open only appears to show packages ending .el9.x86_64, el9.noarch and .x86_64. So maybe its enough to just suffix the version with a . when grepping - I was trying to avoid the case where you are after e.g. 570-0:570.124.06-1 and just grepping for that also gets you 570-0:570.124.06-10.

edit: I'd missed the fact its an el9 repo, so I think this should be ok

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoying there are a few i686 packages too (which I guess we don't need)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good spot, I'd missed those 🤦

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So might need to filter for x86_64 and noarch on top of that you suggested? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can just | unique the list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh no grep just version+ . obvs returns the entire package name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the other option is to just install the i686 packages too (not sure how much bloat that adds)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, even with this we get:

Depsolve Error occurred: 
 Problem 1: conflicting requests
  - nothing provides cuda-drivers-570 = 570.133.20 needed by cuda-drivers-fabricmanager-570-570.133.20-1.x86_64 from cuda-rhel9-x86_64
 Problem 2: package cuda-drivers-fabricmanager-570.133.20-1.x86_64 from cuda-rhel9-x86_64 requires cuda-drivers-fabricmanager-570 = 570.133.20, but none of the providers can be installed
  - conflicting requests
  - nothing provides cuda-drivers-570 = 570.133.20 needed by cuda-drivers-fabricmanager-570-570.133.20-1.x86_64 from cuda-rhel9-x86_64

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stange, I do note that one isn't installed with the standard dnf module install nvidia-driver

changed_when: false
failed_when: false
register: _cuda_driver_module_installed
register: _cuda_driver_module_packages
# returns a list of lines like ' : libnvidia-cfg-3:570.133.20-1.el9.x86_64'

- name: Install nvidia drivers
ansible.builtin.command: dnf module install -y nvidia-driver
- name: Install nvidia driver packages
# its not possible to install a version of a module
# apparently this is the best way of approximating that
# but it is more idempotent than the module install anyway
ansible.builtin.dnf:
name: "{{ _cuda_driver_module_packages.stdout_lines | map('trim', ': ') }}"
register: _cuda_driver_install
when: "'No matching Modules to list' in _cuda_driver_module_installed.stderr"
changed_when: "'Nothing to do' not in _cuda_driver_install.stdout"

- name: Check kernel has not been modified
assert:
Expand Down
Loading