Does anyone have a deployment with GPU and nvidia-smi working on AWS? #3073

satra · 2025-06-18T01:57:30Z

satra
Jun 18, 2025

I'm having trouble finding nvidia-smi on a gpu node group on aws. this seems counter to the documentation. does anyone have a config that works with the latest nebari release? also relatedly the daemonset image being used is quite old and discontinued from dockerhub. has there been any attempt to use a more recent nvidia-device-plugin (e.g, nvcr.io/nvidia/k8s-device-plugin:v0.17.2)?

Answered by marcelovilla

Jun 19, 2025

Hey @satra, thanks for bringing this up!

I was about to share a working configuration with you, but after testing it again I realized it no longer works as expected. I’ve opened an issue to track it, as well as a PR with a fix to address it: #3075.

The PR includes detailed testing instructions, and following them should get you to an AWS deployment with working GPU support. For reference, here's the relevant configuration:

# ...
amazon_web_services:
  # ...
  node_groups:
    # ...
    gpu-t4-x1:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 5
      gpu: true
profiles:
  jupyterlab:
  - display_name: T4 GPU Instance
    description: Stable environment with 4 cpu / 16 GB …

View full answer

marcelovilla · 2025-06-19T12:56:32Z

marcelovilla
Jun 19, 2025
Maintainer

Hey @satra, thanks for bringing this up!

I was about to share a working configuration with you, but after testing it again I realized it no longer works as expected. I’ve opened an issue to track it, as well as a PR with a fix to address it: #3075.

The PR includes detailed testing instructions, and following them should get you to an AWS deployment with working GPU support. For reference, here's the relevant configuration:

# ...
amazon_web_services:
  # ...
  node_groups:
    # ...
    gpu-t4-x1:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 5
      gpu: true
profiles:
  jupyterlab:
  - display_name: T4 GPU Instance
    description: Stable environment with 4 cpu / 16 GB ram and 1 T4 GPU / 16 GB GPU memory
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
      extra_resource_limits:
        nvidia.com/gpu: 1
      image: quay.io/nebari/nebari-jupyterlab-gpu:2025.6.1
      node_selector:
        "dedicated": "gpu-t4-x1"

You can deploy Nebari from this branch to try it out. This fix will be included in the next release.

also relatedly the daemonset image being used is quite old and discontinued from dockerhub. has there been any attempt to use a more recent nvidia-device-plugin (e.g, nvcr.io/nvidia/k8s-device-plugin:v0.17.2)

I'll go ahead and open an issue for this. I think it would certainly be beneficial to upgrade the image being used.

1 reply

satra Jun 29, 2025
Author

thank you. this works out of the box.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nebari-dev

Does anyone have a deployment with GPU and nvidia-smi working on AWS? #3073

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

nebari-dev

Does anyone have a deployment with GPU and nvidia-smi working on AWS? #3073

Uh oh!

satra Jun 18, 2025

Replies: 1 comment · 1 reply

Uh oh!

marcelovilla Jun 19, 2025 Maintainer

Uh oh!

satra Jun 29, 2025 Author

satra
Jun 18, 2025

Replies: 1 comment 1 reply

marcelovilla
Jun 19, 2025
Maintainer

satra Jun 29, 2025
Author