@@ -70,11 +70,8 @@ they generally outperform the best desktop GPUs.
7070
7171
7272
73- Running a typical GPU program
74- -----------------------------
75-
7673Reserving resources for GPU programs
77- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
74+ ------------------------------------
7875
7976Slurm keeps track of the GPU resources as generic resources (GRES) or
8077trackable resources (TRES). They are basically limited resources that you
@@ -83,28 +80,69 @@ can request in addition to normal resources such as CPUs and RAM.
8380To request GPUs on Slurm, you should use the ``--gpus=1 `` or ``--gres=gpu:1 ``
8481-flags.
8582
86- You can also use syntax ``--gpus=GPU_TYPE:1 `` (or ``--gres=gpu:GPU_TYPE:1 ``),
87- where ``GPU_TYPE `` is a name chosen by the admins for the GPU.
88- For example, ``--gpus=v100:1 `` would give you a V100 card. See section on
89- :ref: `reserving specific GPU architectures <gpu-constraint >` for more information.
83+ Choosing a specific type of GPU
84+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
85+
86+ In most cases you will want to choose a GPU that suits your specific use case.
87+
88+ There are three ways you can use to choose the GPU type:
89+
90+ 1. All compute nodes with GPUs are separated into partitions based on their GPU
91+ architectures.
92+
93+ Thus you can choose the GPU type by limiting your job to the partitions that
94+ have GPUs that you want to use with ``--partition=GPU_PARTITION ``, where
95+ ``GPU_PARTITION `` is the name of the partition. You can specify multiple
96+ partition, separated by commas.
97+
98+ For example ``--partition=gpu-a100-80g,gpu-h100-80g `` would give you a
99+ A100 or H100 GPU.
100+
101+ 2. You can restrict yourself to a certain type of GPU card by using
102+ using the ``--constraint `` option. For example, to restrict the submission to
103+ Ampere generation GPUs only you can use ``--constraint='ampere' ``.
104+
105+ For choosing between multiple generations, you can use the ``| ``-character
106+ between generations. For example, if you want to restrict the submission
107+ Volta or Ampere generations you can use ``--constraint='volta|ampere' ``.
108+ Remember to use the quotes since ``| `` is the shell pipe.
109+
110+ 3. You can use the syntax ``--gpus=GPU_TYPE:1 `` (or ``--gres=gpu:GPU_TYPE:1 ``),
111+ where ``GPU_TYPE `` is a name chosen by the admins for the GPU.
112+
113+ For example, ``--gpus=v100:1 `` would give you a V100 card.
114+
115+ See the :ref: `available GPUs reference <available-gpus >` for more information on
116+ available partitions and feature names.
117+
118+ In the cluster you can run ``slurm features `` or
119+ ``sinfo -o '%50N %18F %26f %30G' `` to see what GPU resources are available.
120+
121+
122+
123+ Reserving more than one GPU
124+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
90125
91126You can request more than one GPU with ``--gpus=G ``, where ``G `` is
92127the number of the requested GPUs.
93128
94- Some GPUs are placed in a quick debugging queue. See section on
95- :ref: `reserving quick debugging resources <gpu-debug >` for more
96- information.
97-
98129.. note ::
99130
100131 Most GPU programs cannot utilize more than one GPU at a time. Before
101132 trying to reserve multiple GPUs you should verify that your code
102133 can utilize them.
103134
135+ Reserving a GPU from the debug queue
136+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137+
138+
139+ There is a ``gpu-debug ``-partition that you can use to run short jobs
140+ (30 minutes or less) for quick tests and debugging. Use
141+ ``--partition=gpu-debug `` for this.
104142
105143
106144Running an example program that utilizes GPU
107- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145+ --------------------------------------------
108146
109147.. include :: ../ref/examples-repo.rst
110148
@@ -162,43 +200,14 @@ Using a slurm script setting the requirements and loading the correct modules be
162200 :ref: `section on missing CUDA libraries <cuda-missing >`.
163201
164202
165- Special cases and common pitfalls
166- ---------------------------------
167-
168203Monitoring efficient use of GPUs
169- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
204+ --------------------------------
170205
171206.. include :: ../examples/monitoring/gpu.rst
172207
173- .. _gpu-constraint :
174-
175- Reserving specific GPU types
176- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
177-
178- You can restrict yourself to a certain type of GPU card by using
179- using the ``--constraint `` option. For example, to restrict the submission to
180- Pascal generation GPUs only you can use ``--constraint='pascal' ``.
181-
182- For choosing between multiple generations, you can use the ``| ``-character
183- between generations. For example, if you want to restrict the submission
184- Volta or Ampere generations you can use ``--constraint='volta|ampere' ``.
185- Remember to use the quotes since ``| `` is the shell pipe.
186208
187- To see what GPU resources are available, run ``slurm features `` or
188- ``sinfo -o '%50N %18F %26f %30G' ``.
189-
190- Alternative way is to use syntax ``--gres=gpu:GPU_TYPE:1 ``, where ``GPU_TYPE ``
191- is a name chosen by the admins for the GPU. For example, ``--gres=gpu:v100:1 ``
192- would give you a V100 card.
193-
194- .. _gpu-debug :
195-
196- Reserving resources from the short job queue for quick debugging
197- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
198-
199- There is a ``gpu-debug ``-partition that you can use to run short jobs
200- (30 minutes or less) for quick tests and debugging. Use
201- ``--partition=gpu-debug `` for this.
209+ Special cases and common pitfalls
210+ ---------------------------------
202211
203212.. _cuda-missing :
204213
@@ -324,6 +333,8 @@ Additionally, PyTorch offers its own set of profilers, like torch.profiler, whic
324333
325334For a detailed introduction to both Torch and NVIDIA profilers, please refer to GPU profiling section :ref: `gpu-profiling `.
326335
336+ .. _available-gpus :
337+
327338Available GPUs and architectures
328339--------------------------------
329340
0 commit comments