Skip to content

fix: add talos support#695

Open
hydazz wants to merge 2 commits intoNVIDIA:mainfrom
hydazz:main
Open

fix: add talos support#695
hydazz wants to merge 2 commits intoNVIDIA:mainfrom
hydazz:main

Conversation

@hydazz
Copy link

@hydazz hydazz commented Oct 19, 2025

This is a starter PR to add support for Talos OS's different nvidia paths.

Tested the gpu component with the changes here in my environment and it works.

Feedback is needed, as i'm unsure how to add the usr/local/glibc path to CDI nicely, I don't believe getTalosLibrarySearchPaths will cut it globally...

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@asymingt
Copy link

I can independently confirm that this works on my Talos cluster. After installing with helm I finally see resource slices being made available on a machine with five RTX A4000 GPUs. Thank you, @hydazz!

$ kubectl get resourceslices
NAME                                 NODE            DRIVER           POOL            AGE
talos-pxs-ia1-gpu.nvidia.com-gsvft   talos-pxs-ia1   gpu.nvidia.com   talos-pxs-ia1   11m

For repeatability, you will need a container image to be built. I have pushed one to asymingt/k8s-dra-driver-gpu. You will need to modify this line to asymingt/k8s-dra-driver-gpu:v25.8.0-dev before installing the chart this way:

helm upgrade -i nvidia-dra-driver-gpu ./k8s-dra-driver-gpu/deployments/helm/nvidia-dra-driver-gpu   \
   --create-namespace  --namespace drivers  \
   --set gpuResourcesEnabledOverride=true   \
   --set resources.gpus.enabled=true  \
   --set resources.computeDomains.enabled=false 
   --wait

To optionally rebuild the container image, install docker + qemu-binfmt + buildx, checkout this code and run:

export IMAGE_NAME=<your_docker_hub>/k8s-dra-driver-gpu
export VERSION=v25.8.0-dev
export PUSH_ON_BUILD=true
export BUILD_MULTI_ARCH_IMAGES=true

make -f deployments/container/Makefile build

@hydazz
Copy link
Author

hydazz commented Nov 15, 2025

I believe this is set to be fixed on the Talos side, by them installing the nvidia stuff where this expects it to go, not the other way around

@asymingt
Copy link

asymingt commented Nov 16, 2025

While we wait for Talos to update its driver install location, I've been trying to get MPS working on Talos using this PR branch and the following helm values.

gpuResourcesEnabledOverride: true
resources:
  gpus:
    enabled: true
  computeDomains:
    enabled: false
featureGates:
  MPSSupport: true

Looks like the mps-control-daemon keeps restarting with the following error:

$ k logs mps-control-daemon-49e0f7b0-e884-4ab0-ac35-b50bca50f681-e4dlqgl -n drivers
chroot: can't execute 'sh': No such file or directory

It's probably related to this issue: #469

I've opened a PR to fix it on your branch: hydazz#1

@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Nov 24, 2025
@klueska klueska added this to the unscheduled milestone Nov 24, 2025
@klueska
Copy link
Collaborator

klueska commented Nov 24, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@hydazz
Copy link
Author

hydazz commented Nov 25, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@klueska I don't have definitive knowledge, I just inferred that conclusion based on:
https://discord.com/channels/673534664354430999/942576972943491113/1434096797562703983

if we could move this to a github discussion under extensions repo, we could collaborate more (I believe /usr/local/glibc/usr/lib was a wrong choice of path from first place [thinking about merged /usr/ though it was kind of done to fix issues with musl libs co-existing), but let's do a discussion on a good path moving forward and trying to use the operator and dra plugins as much as possible without platform specific hacks

(I could not find such referenced discussion)

siderolabs/extensions#836
siderolabs/extensions#476
#605

I don't know if there is talks between nvidia/talos, or whats outside of linked above, but it could easily be fixed here, just with something better than getTalosLibrarySearchPaths (pretty easily?), or on the extension side, but thats probably a larger change.

Perhaps @frezbo would have more insight?

Comment on lines +49 to +55
func getTalosLibrarySearchPaths() []string {
return []string{
"/driver-root/usr/local/glibc/usr/lib",
"/driver-root/usr/local/glibc/lib",
"/driver-root/usr/local/glibc/lib64",
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar is this something we would want to add directly to nvcdi in the nvidia-container-toolkit as a standard search path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a problem in adding this to the toolkit. At the moment the defaults are defined at quite a low level (which is where @hydazz has added them in NVIDIA/nvidia-container-toolkit#1621) and we may want to consider making these easier to specify at a higher level.

@frezbo
Copy link

frezbo commented Dec 2, 2025

@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR?

@klueska I don't have definitive knowledge, I just inferred that conclusion based on: https://discord.com/channels/673534664354430999/942576972943491113/1434096797562703983

if we could move this to a github discussion under extensions repo, we could collaborate more (I believe /usr/local/glibc/usr/lib was a wrong choice of path from first place [thinking about merged /usr/ though it was kind of done to fix issues with musl libs co-existing), but let's do a discussion on a good path moving forward and trying to use the operator and dra plugins as much as possible without platform specific hacks

(I could not find such referenced discussion)

siderolabs/extensions#836 siderolabs/extensions#476 #605

I don't know if there is talks between nvidia/talos, or whats outside of linked above, but it could easily be fixed here, just with something better than getTalosLibrarySearchPaths (pretty easily?), or on the extension side, but thats probably a larger change.

Perhaps @frezbo would have more insight?

It would be nice if these paths are supported on the nvidia side, we're (SideroLabs) is open to using better paths, but we have a constraint that it cannot be standard /usr/lib since Talos base is musl, so glibc and musl libs needs to exist at different paths it could be as well usr/local/nvidia/lib for example

@jgehrcke
Copy link
Collaborator

@klueska what do you think: can we still do something here for the next release? I feel like we should. But now it's rather tight again.

Signed-off-by: hydazz <alexanderhyde@icloud.com>
Signed-off-by: hydazz <alexanderhyde@icloud.com>
@hydazz
Copy link
Author

hydazz commented Jan 30, 2026

opened PR in the toolkit to remove the overwrite here
NVIDIA/nvidia-container-toolkit#1621

please review and lmk if any changes are needed, keen to jump onto the DRA train 🙂

@elezar
Copy link
Member

elezar commented Jan 30, 2026

@hydazz @jgehrcke if we're expecting the toolkit to be updated to include this change, I don't think that's something that can be done for the upcoming release.

Would a middle ground be adding a configurable option (envvar / config file option) that allows a user to specify the paths in the container to be searched for libraries explicilty? This can be passed to the CDI library on construction and used in the prestart scripts.

Note that although NVIDIA/nvidia-container-toolkit#1621 gets us some of the way there, we may need to also update the detection logic to also be able to locate nvidia-smi at the non-standard location.

@jpsalvesen
Copy link

Thank you, @asymingt, for sharing enough to get started with a PoC!

@frezbo
Copy link

frezbo commented Mar 10, 2026

Talos 1.13 now ships /etc/ld.* files, so this might not be a problem anymore, the gpu operator works now

@rajatchopra
Copy link

@hydazz do we need this addendum to search list based on this comment? Close the PR?

We may still want the cdiRoot as a overriding parameter - we need a real example though.

@hydazz
Copy link
Author

hydazz commented Mar 17, 2026

@hydazz do we need this addendum to search list based on this comment? Close the PR?

We may still want the cdiRoot as a overriding parameter - we need a real example though.

Will play around with 1.13, doubt it'll just work but will confirm minimum changes needed here to get it to work.

arsac added a commit to arsac/containers that referenced this pull request Mar 18, 2026
Builds a patched version of nvcr.io/nvidia/k8s-dra-driver-gpu that adds
/usr/local/glibc/usr/lib and /usr/local/bin to the library/binary search
paths, matching upstream NVIDIA/k8s-dra-driver-gpu#695.

Renovate will track NVIDIA/k8s-dra-driver-gpu releases to keep the VERSION
in sync. Remove this app once PR #695 is released upstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

8 participants