Skip to content

DSS status incorrectly reports the NVIDIA GPU operator is enabled when microk8s failed to enable the GPU operator #206

@kenvandine

Description

@kenvandine

Bug Description

On my laptop, with a NVIDIA Quadro T2000 the GPU operator consistently fails to enable however the microk8s command doesn't fail and DSS status continues to report that the GPU is enabled.

From microk8s

Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-f7f2k" is waiting to start: PodInitializing

To Reproduce

Follow the getting started guide https://documentation.ubuntu.com/data-science-stack/en/latest/

Environment

DSS from latest stable:

latest/stable:    0.1-8742e6d3c0a5450c6dbc4ea3788a 2024-09-10 (36)

microk8s from 1.28 stable:

1.28/stable:           v1.28.15 2024-11-09 (7399) 186MB classic

NOTE: I also tried microk8s v1.31.5 classic as well

Relevant Log Output

NAME                                                          READY   STATUS                  RESTARTS        AGE
gpu-feature-discovery-lxsr7                                   0/1     Init:0/1                0               2d20h
gpu-operator-86765669fc-zvbn6                                 1/1     Running                 5 (41h ago)     2d20h
gpu-operator-node-feature-discovery-gc-555ccf7687-h6phm       1/1     Running                 3 (22m ago)     2d20h
gpu-operator-node-feature-discovery-master-68d694564d-d22bg   1/1     Running                 3 (22m ago)     2d20h
gpu-operator-node-feature-discovery-worker-lfcnw              1/1     Running                 3 (41h ago)     2d20h
nvidia-container-toolkit-daemonset-hjlg8                      0/1     Init:CrashLoopBackOff   352 (40s ago)   2d20h
nvidia-dcgm-exporter-9p89v                                    0/1     Init:0/1                0               2d20h
nvidia-device-plugin-daemonset-5bscs                          0/1     Init:0/1                0               2d20h
nvidia-operator-validator-pq4x7                               0/1     Init:0/4                0               2d20h

Additional Context

DSS should detect the GPU operator isn't functional and report so

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions