Conversation
|
I can independently confirm that this works on my Talos cluster. After installing with For repeatability, you will need a container image to be built. I have pushed one to asymingt/k8s-dra-driver-gpu. You will need to modify this line to To optionally rebuild the container image, install docker + qemu-binfmt + buildx, checkout this code and run: |
|
I believe this is set to be fixed on the Talos side, by them installing the nvidia stuff where this expects it to go, not the other way around |
|
While we wait for Talos to update its driver install location, I've been trying to get MPS working on Talos using this PR branch and the following helm values. gpuResourcesEnabledOverride: true
resources:
gpus:
enabled: true
computeDomains:
enabled: false
featureGates:
MPSSupport: trueLooks like the It's probably related to this issue: #469 I've opened a PR to fix it on your branch: hydazz#1 |
|
@hydazz given your comment about Talos adjusting themselves to accommodate the existing search paths, how would you propose moving forward with this PR? |
@klueska I don't have definitive knowledge, I just inferred that conclusion based on:
(I could not find such referenced discussion) siderolabs/extensions#836 I don't know if there is talks between nvidia/talos, or whats outside of linked above, but it could easily be fixed here, just with something better than Perhaps @frezbo would have more insight? |
| func getTalosLibrarySearchPaths() []string { | ||
| return []string{ | ||
| "/driver-root/usr/local/glibc/usr/lib", | ||
| "/driver-root/usr/local/glibc/lib", | ||
| "/driver-root/usr/local/glibc/lib64", | ||
| } | ||
| } |
There was a problem hiding this comment.
@elezar is this something we would want to add directly to nvcdi in the nvidia-container-toolkit as a standard search path?
There was a problem hiding this comment.
I don't think there's a problem in adding this to the toolkit. At the moment the defaults are defined at quite a low level (which is where @hydazz has added them in NVIDIA/nvidia-container-toolkit#1621) and we may want to consider making these easier to specify at a higher level.
It would be nice if these paths are supported on the nvidia side, we're (SideroLabs) is open to using better paths, but we have a constraint that it cannot be standard |
|
@klueska what do you think: can we still do something here for the next release? I feel like we should. But now it's rather tight again. |
Signed-off-by: hydazz <alexanderhyde@icloud.com>
Signed-off-by: hydazz <alexanderhyde@icloud.com>
|
opened PR in the toolkit to remove the overwrite here please review and lmk if any changes are needed, keen to jump onto the DRA train 🙂 |
|
@hydazz @jgehrcke if we're expecting the toolkit to be updated to include this change, I don't think that's something that can be done for the upcoming release. Would a middle ground be adding a configurable option (envvar / config file option) that allows a user to specify the paths in the container to be searched for libraries explicilty? This can be passed to the CDI library on construction and used in the prestart scripts. Note that although NVIDIA/nvidia-container-toolkit#1621 gets us some of the way there, we may need to also update the detection logic to also be able to locate |
|
Thank you, @asymingt, for sharing enough to get started with a PoC! |
|
Talos 1.13 now ships |
Builds a patched version of nvcr.io/nvidia/k8s-dra-driver-gpu that adds /usr/local/glibc/usr/lib and /usr/local/bin to the library/binary search paths, matching upstream NVIDIA/k8s-dra-driver-gpu#695. Renovate will track NVIDIA/k8s-dra-driver-gpu releases to keep the VERSION in sync. Remove this app once PR #695 is released upstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This is a starter PR to add support for Talos OS's different nvidia paths.
Tested the gpu component with the changes here in my environment and it works.
Feedback is needed, as i'm unsure how to add the
usr/local/glibcpath to CDI nicely, I don't believegetTalosLibrarySearchPathswill cut it globally...