Skip to content

Example rocky 9 image doesn't work with Nvidia 570 #69

@brandonbiggs

Description

@brandonbiggs

I took the example rocky linux 9 nvidia container file, changed it to be a def file and tried to build it.

Example file: https://github.com/warewulf/warewulf-node-images/blob/main/examples/rockylinux-9-nvidia/Containerfile

Def:

Bootstrap: docker
From: ghcr.io/warewulf/warewulf-rockylinux:9

%post
  dnf -y install dnf-plugins-core epel-release kernel-headers \
    && dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(arch)/cuda-rhel9.repo \
    && dnf -y module install nvidia-driver:latest-dkms \
    && dnf -y install datacenter-gpu-manager \
    && dnf clean all \
    && for dir in /usr/src/kernels/*; do dkms autoinstall --kernelver $(basename $dir); done \
    && dkms status

apptainer build test.sif test.def

Error while building:

+ dkms autoinstall --kernelver 5.14.0-503.22.1.el9_5.x86_64
Autoinstall of module nvidia/570.86.15 for kernel 5.14.0-503.22.1.el9_5.x86_64 (x86_64)

Sign command: /lib/modules/5.14.0-503.22.1.el9_5.x86_64/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub

Cleaning build area...(bad exit status: 2)
Failed command:
'make' clean
Building module(s)...(bad exit status: 2)
Failed command:
'make' -j2 modules

Error! Bad return status for module build on kernel: 5.14.0-503.22.1.el9_5.x86_64 (x86_64)
Consult /var/lib/dkms/nvidia/570.86.15/build/make.log for more information.
Autoinstall on 5.14.0-503.22.1.el9_5.x86_64 failed for module(s) nvidia(10).

Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
FATAL:   While performing build: while running engine: while running %post section: exit status 11

If I change the nvidia driver from latest back to 565, the apptainer build finishes successfully.
&& dnf -y module install nvidia-driver:latest-dkms \
to
&& dnf -y module install nvidia-driver:565-dkms \

New output:

Complete!
+ dnf clean all
49 files removed
+ for dir in /usr/src/kernels/*
++ basename /usr/src/kernels/5.14.0-503.22.1.el9_5.x86_64
+ dkms autoinstall --kernelver 5.14.0-503.22.1.el9_5.x86_64
+ dkms status
nvidia/565.57.01, 5.14.0-503.22.1.el9_5.x86_64, x86_64: installed
INFO:    Creating SIF file...
INFO:    Build complete: test.sif

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions