ML environments on GPU instances #675
Replies: 6 comments 8 replies
-
Here is the relevant documentation for the same: https://github.com/Quansight/qhub/blob/main/docs/source/04_how_to_guides/7_qhub_gpu.md#amazon-web-services
You'll create a new node group as mentioned in the documentation above.
You can create a new JupyterLab profile to select the GPU node, so that you have a separate profile for GPU which user can select and it will always run on the GPU node, the doc for the same is in the link mentioned above. |
Beta Was this translation helpful? Give feedback.
-
Are you able to spin up a GPU instance but it doesn't work in Python? i.e. after spinning up a GPU instance does running Also I assume you are on aws. |
Beta Was this translation helpful? Give feedback.
-
@dharhas, no, we were not sure how to set up the nebari-config to allow users to select either a GPU-driven or CPU-driven server. Would GPU and CPU be different jhub-apps? |
Beta Was this translation helpful? Give feedback.
-
Ok let me share a config from an AWS deployment, I need to sanitize it. |
Beta Was this translation helpful? Give feedback.
-
amazon_web_services:
region: eu-west-1
kubernetes_version: '1.26'
node_groups:
general:
instance: m5.2xlarge
min_nodes: 2
max_nodes: 5
user:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
worker:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
fly-weight:
instance: m5.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
middle-weight:
instance: m5.2xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
gpu-1x-t4:
instance: g4dn.xlarge
min_nodes: 0
max_nodes: 50
single_subnet: false
profiles:
jupyterlab:
- display_name: Small Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
default: true
kubespawner_override:
cpu_limit: 2
cpu_guarantee: 1.5
mem_limit: 8G
mem_guarantee: 6G
node_selector:
"dedicated": "fly-weight"
- display_name: Medium Instance
description: Stable environment with 1.5-2 cpu / 6-8 GB ram
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 2
mem_limit: 12G
mem_guarantee: 8G
node_selector:
"dedicated": "middle-weight"
- display_name: G4 GPU Instance 1x
description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
kubespawner_override:
image: quay.io/nebari/nebari-jupyterlab-gpu:2024.1.1
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G
extra_pod_config:
volumes:
- name: "dshm"
emptyDir:
medium: "Memory"
sizeLimit: "2Gi"
extra_container_config:
volumeMounts:
- name: "dshm"
mountPath: "/dev/shm"
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
"dedicated": "gpu-1x-t4" The extra_container_config is important. PyTorch requires /dev/shm > 1GB if you are using multi gpu, not sure it matters for single GPU. Also you have to specify the number of gpu in the extra_resource_limits and that needs to match the number of gpu's in the instance you have selected i.e. the We will work on getting this into the docs. A secondary issue is the released conda-store currently can't install GPU versions of pytorch from conda-forge because we are not able to set env variables (see conda-incubator/conda-store#759) this will be in the upcoming release. Until then you need to get pytorch using the following pinning: channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.10
- pytorch::pytorch
- pytorch::pytorch-cuda
- etc |
Beta Was this translation helpful? Give feedback.
-
Thanks to @pt247 and @marcelovilla who helped me get this going!
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have some users who would like to run their ML workflows on a single GPU using
tensorflow-gpu
. I can create a custom environment with that package, but how can I specify the gpu instances type needed to run it? Our config for AWS looks like below, but if I change the user group to, say, ag4dn.4xlarge
image, then all user will get them, which is of course not what I want. Is there a way for users to choose instance type as well as environment at server launch time?Beta Was this translation helpful? Give feedback.
All reactions