Skip to content

Commit a8eb8be

Browse files
Merge pull request #764 from AaltoSciComp/winter-kickstart-gpu-update
gpu: Small updates to GPU pages for the winter kickstart 2025.
2 parents 4e1ed30 + 33659a4 commit a8eb8be

File tree

2 files changed

+84
-52
lines changed

2 files changed

+84
-52
lines changed

triton/examples/monitoring/gpu.rst

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,38 @@ When your job has started, you can ``ssh`` to the node and run
77
Once the job has finished, you can use ``slurm history`` to obtain the
88
``jobID`` and run::
99

10+
$ module load seff-gpu
11+
$ seff JOBID
12+
seff 5817422
13+
Job ID: 5817422
14+
Cluster: triton
15+
User/Group: tuomiss1/tuomiss1
16+
State: COMPLETED (exit code 0)
17+
Nodes: 1
18+
Cores per node: 2
19+
CPU Utilized: 00:08:25
20+
CPU Efficiency: 63.28% of 00:13:18 core-walltime
21+
Job Wall-clock time: 00:06:39
22+
Memory Utilized: 2.10 GB
23+
Memory Efficiency: 26.31% of 8.00 GB
24+
GPUs reserved: v100 (x1)
25+
GPU Utilized: 10%
26+
GPU VRAM Utilized: 15114 MB
27+
28+
Alternatively, you can run::
29+
1030
$ sacct -j JOBID -o TRESUsageInAve -p
11-
cpu=01:09:20,energy=909169,fs/disk=192466115,gres/gpumem=1648M,gres/gpuutil=66,mem=2810884K,pages=8,vmem=0|
31+
cpu=00:08:24,energy=95240,fs/disk=147861134,gres/gpumem=15114M,gres/gpuutil=10,mem=2207116K,pages=3473,vmem=0|
1232

33+
This shows the GPU utilization.
1334

14-
This also shows the GPU utilization.
35+
In the example, you can see that the GPU utilization is low.
1536

16-
If the GPU utilization of your job is low, you should check whether
17-
its CPU utilization is close to 100% with ``seff JOBID``. Having a high
18-
CPU utilization and a low GPU utilization can indicate that the CPUs are
19-
trying to keep the GPU occupied with calculations, but the workload
20-
is too much for the CPUs and thus GPUs are not constantly working.
37+
If this is the case you should check whether job's CPU utilization is close
38+
to 100% with ``seff JOBID``. Having a high CPU utilization and a low GPU
39+
utilization can indicate that the CPUs are trying to keep the GPU occupied
40+
with calculations, but the workload is too much for the CPUs and thus GPUs
41+
are not constantly working.
2142

2243
Increasing the number of CPUs you request can help, especially in tasks
2344
that involve data loading or preprocessing, but your program must know how

triton/tut/gpu.rst

Lines changed: 56 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -70,11 +70,8 @@ they generally outperform the best desktop GPUs.
7070

7171

7272

73-
Running a typical GPU program
74-
-----------------------------
75-
7673
Reserving resources for GPU programs
77-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
74+
------------------------------------
7875

7976
Slurm keeps track of the GPU resources as generic resources (GRES) or
8077
trackable resources (TRES). They are basically limited resources that you
@@ -83,28 +80,69 @@ can request in addition to normal resources such as CPUs and RAM.
8380
To request GPUs on Slurm, you should use the ``--gpus=1`` or ``--gres=gpu:1``
8481
-flags.
8582

86-
You can also use syntax ``--gpus=GPU_TYPE:1`` (or ``--gres=gpu:GPU_TYPE:1``),
87-
where ``GPU_TYPE`` is a name chosen by the admins for the GPU.
88-
For example, ``--gpus=v100:1`` would give you a V100 card. See section on
89-
:ref:`reserving specific GPU architectures <gpu-constraint>` for more information.
83+
Choosing a specific type of GPU
84+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
85+
86+
In most cases you will want to choose a GPU that suits your specific use case.
87+
88+
There are three ways you can use to choose the GPU type:
89+
90+
1. All compute nodes with GPUs are separated into partitions based on their GPU
91+
architectures.
92+
93+
Thus you can choose the GPU type by limiting your job to the partitions that
94+
have GPUs that you want to use with ``--partition=GPU_PARTITION``, where
95+
``GPU_PARTITION`` is the name of the partition. You can specify multiple
96+
partition, separated by commas.
97+
98+
For example ``--partition=gpu-a100-80g,gpu-h100-80g`` would give you a
99+
A100 or H100 GPU.
100+
101+
2. You can restrict yourself to a certain type of GPU card by using
102+
using the ``--constraint`` option. For example, to restrict the submission to
103+
Ampere generation GPUs only you can use ``--constraint='ampere'``.
104+
105+
For choosing between multiple generations, you can use the ``|``-character
106+
between generations. For example, if you want to restrict the submission
107+
Volta or Ampere generations you can use ``--constraint='volta|ampere'``.
108+
Remember to use the quotes since ``|`` is the shell pipe.
109+
110+
3. You can use the syntax ``--gpus=GPU_TYPE:1`` (or ``--gres=gpu:GPU_TYPE:1``),
111+
where ``GPU_TYPE`` is a name chosen by the admins for the GPU.
112+
113+
For example, ``--gpus=v100:1`` would give you a V100 card.
114+
115+
See the :ref:`available GPUs reference <available-gpus>` for more information on
116+
available partitions and feature names.
117+
118+
In the cluster you can run ``slurm features`` or
119+
``sinfo -o '%50N %18F %26f %30G'`` to see what GPU resources are available.
120+
121+
122+
123+
Reserving more than one GPU
124+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
90125

91126
You can request more than one GPU with ``--gpus=G``, where ``G`` is
92127
the number of the requested GPUs.
93128

94-
Some GPUs are placed in a quick debugging queue. See section on
95-
:ref:`reserving quick debugging resources <gpu-debug>` for more
96-
information.
97-
98129
.. note::
99130

100131
Most GPU programs cannot utilize more than one GPU at a time. Before
101132
trying to reserve multiple GPUs you should verify that your code
102133
can utilize them.
103134

135+
Reserving a GPU from the debug queue
136+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137+
138+
139+
There is a ``gpu-debug``-partition that you can use to run short jobs
140+
(30 minutes or less) for quick tests and debugging. Use
141+
``--partition=gpu-debug`` for this.
104142

105143

106144
Running an example program that utilizes GPU
107-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145+
--------------------------------------------
108146

109147
.. include:: ../ref/examples-repo.rst
110148

@@ -162,43 +200,14 @@ Using a slurm script setting the requirements and loading the correct modules be
162200
:ref:`section on missing CUDA libraries <cuda-missing>`.
163201

164202

165-
Special cases and common pitfalls
166-
---------------------------------
167-
168203
Monitoring efficient use of GPUs
169-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
204+
--------------------------------
170205

171206
.. include:: ../examples/monitoring/gpu.rst
172207

173-
.. _gpu-constraint:
174-
175-
Reserving specific GPU types
176-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
177-
178-
You can restrict yourself to a certain type of GPU card by using
179-
using the ``--constraint`` option. For example, to restrict the submission to
180-
Pascal generation GPUs only you can use ``--constraint='pascal'``.
181-
182-
For choosing between multiple generations, you can use the ``|``-character
183-
between generations. For example, if you want to restrict the submission
184-
Volta or Ampere generations you can use ``--constraint='volta|ampere'``.
185-
Remember to use the quotes since ``|`` is the shell pipe.
186208

187-
To see what GPU resources are available, run ``slurm features`` or
188-
``sinfo -o '%50N %18F %26f %30G'``.
189-
190-
Alternative way is to use syntax ``--gres=gpu:GPU_TYPE:1``, where ``GPU_TYPE``
191-
is a name chosen by the admins for the GPU. For example, ``--gres=gpu:v100:1``
192-
would give you a V100 card.
193-
194-
.. _gpu-debug:
195-
196-
Reserving resources from the short job queue for quick debugging
197-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
198-
199-
There is a ``gpu-debug``-partition that you can use to run short jobs
200-
(30 minutes or less) for quick tests and debugging. Use
201-
``--partition=gpu-debug`` for this.
209+
Special cases and common pitfalls
210+
---------------------------------
202211

203212
.. _cuda-missing:
204213

@@ -324,6 +333,8 @@ Additionally, PyTorch offers its own set of profilers, like torch.profiler, whic
324333

325334
For a detailed introduction to both Torch and NVIDIA profilers, please refer to GPU profiling section :ref:`gpu-profiling`.
326335

336+
.. _available-gpus:
337+
327338
Available GPUs and architectures
328339
--------------------------------
329340

0 commit comments

Comments
 (0)