Does anyone have a deployment with GPU and nvidia-smi working on AWS? #3073
-
I'm having trouble finding nvidia-smi on a gpu node group on aws. this seems counter to the documentation. does anyone have a config that works with the latest nebari release? also relatedly the daemonset image being used is quite old and discontinued from dockerhub. has there been any attempt to use a more recent nvidia-device-plugin (e.g, nvcr.io/nvidia/k8s-device-plugin:v0.17.2)? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hey @satra, thanks for bringing this up! I was about to share a working configuration with you, but after testing it again I realized it no longer works as expected. I’ve opened an issue to track it, as well as a PR with a fix to address it: #3075. The PR includes detailed testing instructions, and following them should get you to an AWS deployment with working GPU support. For reference, here's the relevant configuration: # ...
amazon_web_services:
# ...
node_groups:
# ...
gpu-t4-x1:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 5
gpu: true
profiles:
jupyterlab:
- display_name: T4 GPU Instance
description: Stable environment with 4 cpu / 16 GB ram and 1 T4 GPU / 16 GB GPU memory
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_resource_limits:
nvidia.com/gpu: 1
image: quay.io/nebari/nebari-jupyterlab-gpu:2025.6.1
node_selector:
"dedicated": "gpu-t4-x1" You can deploy Nebari from this branch to try it out. This fix will be included in the next release.
I'll go ahead and open an issue for this. I think it would certainly be beneficial to upgrade the image being used. |
Beta Was this translation helpful? Give feedback.
Hey @satra, thanks for bringing this up!
I was about to share a working configuration with you, but after testing it again I realized it no longer works as expected. I’ve opened an issue to track it, as well as a PR with a fix to address it: #3075.
The PR includes detailed testing instructions, and following them should get you to an AWS deployment with working GPU support. For reference, here's the relevant configuration: