Skip to content

Commit 8d4abfe

Browse files
boeschfmsimberg
andauthored
jobreport (#113)
* port from confluence * Update docs/running/jobreport.md Co-authored-by: Mikael Simberg <[email protected]> * Update docs/running/jobreport.md Co-authored-by: Mikael Simberg <[email protected]> * Update docs/running/jobreport.md Co-authored-by: Mikael Simberg <[email protected]> * Update docs/running/jobreport.md Co-authored-by: Mikael Simberg <[email protected]> * Update docs/running/jobreport.md Co-authored-by: Mikael Simberg <[email protected]> * Update mkdocs.yml Co-authored-by: Mikael Simberg <[email protected]> * typo * typo --------- Co-authored-by: Mikael Simberg <[email protected]>
1 parent 8cafd88 commit 8d4abfe

File tree

2 files changed

+273
-0
lines changed

2 files changed

+273
-0
lines changed

docs/running/jobreport.md

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
[](){#ref-jobreport}
2+
# Job report
3+
4+
A batch job summary report is often requested in project proposals at CSCS to demonstrate the effective use of GPUs.
5+
[jobreport](https://github.com/eth-cscs/alps-jobreport/releases) is used in two stages.
6+
The first stage monitors an application and records the GPU usage statistics.
7+
The monitoring stage must be executed within a `slurm` environment.
8+
The information is recorded as `.csv` data within a directory `jobreport_${SLURM_JOB_ID}` or a directory supplied on the command line.
9+
The second stage prints this information in a tabular form that can be inserted into a project proposal.
10+
11+
## Downloading the job summary report tool
12+
13+
A precompiled binary for the `jobreport` utility can be obtained directly from the [repository](https://github.com/eth-cscs/alps-jobreport/releases) or via the command line:
14+
15+
```console
16+
$ wget https://github.com/eth-cscs/alps-jobreport/releases/download/v0.1/jobreport
17+
$ chmod +x ./jobreport
18+
```
19+
## Command line options
20+
21+
A full list of command line options with explanations can be obtained by running the command with the `--help` option:
22+
23+
```console
24+
$ ./jobreport --help
25+
Usage: jobreport [-v -h] [subcommand] -- COMMAND
26+
27+
Options:
28+
-h, --help Show this help message
29+
-v, --version Show version information
30+
31+
Subcommands:
32+
monitor Monitor the performance metrics for a job. (Default)
33+
-h, --help Shows help message
34+
-o, --output <path> Specify output directory (default: ./jobreport_<SLURM_JOB_ID>)
35+
-u, --sampling_time <seconds> Set the time between samples (default: automatically determined)
36+
-t, --max_time <time> Set the maximum monitoring time (format: DD-HH:MM:SS, default: 24:00:00)
37+
print Print a job report
38+
-h, --help Shows help message
39+
-o, --output <path> Output path for the report file
40+
container-hook Write enroot hook for jobreport
41+
-h, --help Shows help message
42+
-o, --output <path> Output path for the enroot hook file
43+
(default: $HOME/.config/enroot/hooks.d/cscs_jobreport_dcgm_hook.sh)
44+
45+
Arguments:
46+
COMMAND The command to run as the workload
47+
```
48+
49+
## Reported information
50+
51+
The final output from `jobreport` is a table summarizing the most important details of how your application used the compute resources during its execution.
52+
The report is divided into two parts: a general summary and GPU specific values.
53+
54+
### Job statistics
55+
56+
| Field | Description |
57+
| ----- | ----------- |
58+
| Job Id | The Slurm job id |
59+
| Step Id | The slurm step id. A job step in SLURM is a subdivision of a job started with srun |
60+
| User | The user account that submitted the job |
61+
| SLURM Account | The project account that will be billed |
62+
| Start Time, End Time, Elapsed Time | The time the job started and ended, and how long it ran |
63+
| Number of Nodes | The number of nodes allocated to the job |
64+
| Number of GPUs | The number of GPUs allocated to the job |
65+
| Total Energy Consumed | The total energy consumed based on the average power usage (below) over the elapsed time |
66+
| Average Power Usage | The average power draw over the elapsed time in Watts (W), summed over all GPUs |
67+
| Average SM Utilization | The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel, averaged over all GPUs |
68+
| Average Memory Utilization | The percentage of a process's lifetime during which global (device) memory was being read or written, averaged over all GPUs |
69+
70+
### GPU specific values
71+
72+
| Field | Description |
73+
| ----- | ----------- |
74+
| Host | The compute node executing a job step |
75+
| GPU | The GPU id on a node |
76+
| Elapsed | The elapsed time |
77+
| SM Utilization % | The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel |
78+
| Memory Utilization % | The percentage of process's lifetime during which global (device) memory was being read or written |
79+
80+
## Example with slurm: srun
81+
82+
The simplest example to test `jobreport` is to run it with the sleep command.
83+
It is important to separate `jobreport` (and its options) and your command with `--`.
84+
85+
```console
86+
$ srun -A my_account -t 5:00 --nodes=1 ./jobreport -- sleep 5
87+
$ ls
88+
jobreport_16133
89+
$ ./jobreport print jobreport_16133
90+
Summary of Job Statistics
91+
+-----------------------------------------+-----------------------------------------+
92+
| Job Id | 16133 |
93+
+-----------------------------------------+-----------------------------------------+
94+
| Step Id | 0 |
95+
+-----------------------------------------+-----------------------------------------+
96+
| User | jpcoles |
97+
+-----------------------------------------+-----------------------------------------+
98+
| SLURM Account | unknown_account |
99+
+-----------------------------------------+-----------------------------------------+
100+
| Start Time | 03-07-2024 15:32:24 |
101+
+-----------------------------------------+-----------------------------------------+
102+
| End Time | 03-07-2024 15:32:29 |
103+
+-----------------------------------------+-----------------------------------------+
104+
| Elapsed Time | 5s |
105+
+-----------------------------------------+-----------------------------------------+
106+
| Number of Nodes | 1 |
107+
+-----------------------------------------+-----------------------------------------+
108+
| Number of GPUs | 4 |
109+
+-----------------------------------------+-----------------------------------------+
110+
| Total Energy Consumed | 0.5 Wh |
111+
+-----------------------------------------+-----------------------------------------+
112+
| Average Power Usage | 348.8 W |
113+
+-----------------------------------------+-----------------------------------------+
114+
| Average SM Utilization | 0% |
115+
+-----------------------------------------+-----------------------------------------+
116+
| Average Memory Utilization | 0% |
117+
+-----------------------------------------+-----------------------------------------+
118+
119+
GPU Specific Values
120+
+---------------+------+------------------+------------------+----------------------+
121+
| Host | GPU | Elapsed | SM Utilization % | Memory Utilization % |
122+
| | | | (avg/min/max) | (avg/min/max) |
123+
+---------------+------+------------------+------------------+----------------------+
124+
| nid006212 | 0 | 5s | 0 / 0 / 0 | 0 / 0 / 0 |
125+
| nid006212 | 1 | 5s | 0 / 0 / 0 | 0 / 0 / 0 |
126+
| nid006212 | 2 | 5s | 0 / 0 / 0 | 0 / 0 / 0 |
127+
| nid006212 | 3 | 5s | 0 / 0 / 0 | 0 / 0 / 0 |
128+
+---------------+------+------------------+------------------+----------------------+
129+
```
130+
131+
!!! warning "`jobreport` requires successful completion of the application"
132+
133+
The `jobreport` tool requires the application to complete successfully.
134+
If the application crashes or the job is killed by `slurm` prematurely, `jobreport` will not be able to write any output.
135+
136+
!!! warning "workaround known issue on macOS"
137+
Currently, there is an issue when generating the report file via `jobreport print` on the macOS terminal:
138+
139+
```console
140+
what(): locale::facet::_S_create_c_locale name not valid
141+
/var/spool/slurmd/job32394/slurm_script: line 21: 199992 Aborted (core dumped) ./jobreport print report
142+
```
143+
144+
To fix this follow these steps:
145+
146+
1. Open the terminal application
147+
2. In the top-left corner menu select Terminal -> Settings
148+
3. Select your default profile
149+
4. Uncheck "Set locale environment variables on startup"
150+
5. Quit and reopen the terminal and try again. This should fix the issue.
151+
152+
## Example with slurm: batch script
153+
154+
The `jobreport` command can be used in a batch script
155+
The report printing, too, can be included in the script and does not need the `srun` command.
156+
157+
```bash title="submit script with jobreport"
158+
#!/bin/bash
159+
#SBATCH -t 5:00
160+
#SBATCH --nodes=2
161+
162+
srun ./jobreport -o report -- my_command
163+
./jobreport print report
164+
```
165+
166+
167+
When used within an job script, `jobreport` will work across multiple calls to `srun`.
168+
Each time `srun` is called, `slurm` creates a new job step and `jobreport` records data for each one.
169+
Multiple job steps running simultaneously are also allowed.
170+
The job report generated contains sections for each `slurm` job step.
171+
172+
```bash title="submit script with multiple steps"
173+
#!/bin/bash
174+
#SBATCH -t 5:00
175+
#SBATCH --nodes=2
176+
177+
srun ./jobreport -o report -- my_command_1
178+
srun ./jobreport -o report -- my_command_2
179+
180+
srun --nodes=1 ./jobreport -o report -- my_command_3 &
181+
srun --nodes=1 ./jobreport -o report -- my_command_4 &
182+
183+
wait
184+
```
185+
186+
187+
## Example with uenv
188+
189+
The following example runs a program called `burn` that computes repeated matrix multiplications to stress the GPUs.
190+
It was built with, and requires to run the [prgenv-gnu][ref-uenv-prgenv-gnu].
191+
192+
```console
193+
$ srun --uenv=prgenv-gnu/24.2:v1 -t 5:00 --nodes=1 --ntasks-per-node=4 --gpus-per-task=1 ${JOBREPORT} -o report -- ./burn --gpu=gemm -d 30
194+
195+
$ ./jobreport print report
196+
Summary of Job Statistics
197+
+-----------------------------------------+-----------------------------------------+
198+
| Job Id | 15923 |
199+
+-----------------------------------------+-----------------------------------------+
200+
| Step Id | 0 |
201+
+-----------------------------------------+-----------------------------------------+
202+
| User | jpcoles |
203+
+-----------------------------------------+-----------------------------------------+
204+
| SLURM Account | unknown_account |
205+
+-----------------------------------------+-----------------------------------------+
206+
| Start Time | 03-07-2024 14:54:48 |
207+
+-----------------------------------------+-----------------------------------------+
208+
| End Time | 03-07-2024 14:55:25 |
209+
+-----------------------------------------+-----------------------------------------+
210+
| Elapsed Time | 36s |
211+
+-----------------------------------------+-----------------------------------------+
212+
| Number of Nodes | 1 |
213+
+-----------------------------------------+-----------------------------------------+
214+
| Number of GPUs | 4 |
215+
+-----------------------------------------+-----------------------------------------+
216+
| Total Energy Consumed | 18.7 Wh |
217+
+-----------------------------------------+-----------------------------------------+
218+
| Average Power Usage | 1.8 kW |
219+
+-----------------------------------------+-----------------------------------------+
220+
| Average SM Utilization | 88% |
221+
+-----------------------------------------+-----------------------------------------+
222+
| Average Memory Utilization | 43% |
223+
+-----------------------------------------+-----------------------------------------+
224+
225+
GPU Specific Values
226+
+---------------+------+------------------+------------------+----------------------+
227+
| Host | GPU | Elapsed | SM Utilization % | Memory Utilization % |
228+
| | | | (avg/min/max) | (avg/min/max) |
229+
+---------------+------+------------------+------------------+----------------------+
230+
| nid007044 | 0 | 36s | 83 / 0 / 100 | 39 / 0 / 50 |
231+
| nid007044 | 0 | 36s | 90 / 0 / 100 | 43 / 0 / 50 |
232+
| nid007044 | 0 | 36s | 90 / 0 / 100 | 43 / 0 / 48 |
233+
| nid007044 | 0 | 36s | 90 / 0 / 100 | 47 / 0 / 54 |
234+
+---------------+------+------------------+------------------+----------------------+
235+
```
236+
237+
!!! note "Using `jobreport` with other uenvs"
238+
239+
`jobreport` works with any uenv, not just `prgenv-gnu`.
240+
241+
## Example with container-engine (CE)
242+
243+
Running `jobreport` with the [container-engine (CE)][ref-container-engine] requires a little more setup to allow the CE to mount the required GPU library paths inside the container.
244+
245+
A script to set up the mount points needs to be placed in `${HOME}/.config/enroot/hooks.d/`.
246+
This can be generated with the `jobreport` tool, and by default, the script will be placed in `${HOME}/.config/enroot/hooks.d/cscs_jobreport.sh`.
247+
248+
```console title="Generate DCGM hook"
249+
$ ./jobreport container-hook
250+
Writing enroot hook to "/users/myuser/.config/enroot/hooks.d/cscs_jobreport_dcgm_hook.sh"
251+
Add the following to your container .toml file:
252+
253+
[annotations]
254+
com.hooks.dcgm.enabled = "true"
255+
```
256+
257+
As indicated by the output, the hook must be added to the container `.toml` file.
258+
259+
```toml title="Example .toml file"
260+
[annotations]
261+
com.hooks.dcgm.enabled = "true"
262+
```
263+
264+
Once the CE is configured, only the EDF file (here `my-edf.toml`) needs to be specified along with a call to `jobreport`:
265+
266+
```console title="Run jobreport in a container"
267+
$ srun --environment=my-edf.toml ./jobreport -- sleep 5
268+
```
269+
270+
!!! note "Using `jobreport` with other container images"
271+
272+
`jobreport` works with any container image, as long as the hook is set up and the EDF file has the correct annotation.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ nav:
8989
- 'Running Jobs':
9090
- running/index.md
9191
- 'slurm': running/slurm.md
92+
- 'Job report': running/jobreport.md
9293
- 'Data Management and Storage':
9394
- storage/index.md
9495
- 'File Systems': storage/filesystems.md

0 commit comments

Comments
 (0)