|
| 1 | +[](){#ref-jobreport} |
| 2 | +# Job report |
| 3 | + |
| 4 | +A batch job summary report is often requested in project proposals at CSCS to demonstrate the effective use of GPUs. |
| 5 | +[jobreport](https://github.com/eth-cscs/alps-jobreport/releases) is used in two stages. |
| 6 | +The first stage monitors an application and records the GPU usage statistics. |
| 7 | +The monitoring stage must be executed within a `slurm` environment. |
| 8 | +The information is recorded as `.csv` data within a directory `jobreport_${SLURM_JOB_ID}` or a directory supplied on the command line. |
| 9 | +The second stage prints this information in a tabular form that can be inserted into a project proposal. |
| 10 | + |
| 11 | +## Downloading the job summary report tool |
| 12 | + |
| 13 | +A precompiled binary for the `jobreport` utility can be obtained directly from the [repository](https://github.com/eth-cscs/alps-jobreport/releases) or via the command line: |
| 14 | + |
| 15 | +```console |
| 16 | +$ wget https://github.com/eth-cscs/alps-jobreport/releases/download/v0.1/jobreport |
| 17 | +$ chmod +x ./jobreport |
| 18 | +``` |
| 19 | +## Command line options |
| 20 | + |
| 21 | +A full list of command line options with explanations can be obtained by running the command with the `--help` option: |
| 22 | + |
| 23 | +```console |
| 24 | +$ ./jobreport --help |
| 25 | +Usage: jobreport [-v -h] [subcommand] -- COMMAND |
| 26 | + |
| 27 | +Options: |
| 28 | + -h, --help Show this help message |
| 29 | + -v, --version Show version information |
| 30 | + |
| 31 | +Subcommands: |
| 32 | + monitor Monitor the performance metrics for a job. (Default) |
| 33 | + -h, --help Shows help message |
| 34 | + -o, --output <path> Specify output directory (default: ./jobreport_<SLURM_JOB_ID>) |
| 35 | + -u, --sampling_time <seconds> Set the time between samples (default: automatically determined) |
| 36 | + -t, --max_time <time> Set the maximum monitoring time (format: DD-HH:MM:SS, default: 24:00:00) |
| 37 | + print Print a job report |
| 38 | + -h, --help Shows help message |
| 39 | + -o, --output <path> Output path for the report file |
| 40 | + container-hook Write enroot hook for jobreport |
| 41 | + -h, --help Shows help message |
| 42 | + -o, --output <path> Output path for the enroot hook file |
| 43 | + (default: $HOME/.config/enroot/hooks.d/cscs_jobreport_dcgm_hook.sh) |
| 44 | + |
| 45 | +Arguments: |
| 46 | + COMMAND The command to run as the workload |
| 47 | +``` |
| 48 | + |
| 49 | +## Reported information |
| 50 | + |
| 51 | +The final output from `jobreport` is a table summarizing the most important details of how your application used the compute resources during its execution. |
| 52 | +The report is divided into two parts: a general summary and GPU specific values. |
| 53 | + |
| 54 | +### Job statistics |
| 55 | + |
| 56 | +| Field | Description | |
| 57 | +| ----- | ----------- | |
| 58 | +| Job Id | The Slurm job id | |
| 59 | +| Step Id | The slurm step id. A job step in SLURM is a subdivision of a job started with srun | |
| 60 | +| User | The user account that submitted the job | |
| 61 | +| SLURM Account | The project account that will be billed | |
| 62 | +| Start Time, End Time, Elapsed Time | The time the job started and ended, and how long it ran | |
| 63 | +| Number of Nodes | The number of nodes allocated to the job | |
| 64 | +| Number of GPUs | The number of GPUs allocated to the job | |
| 65 | +| Total Energy Consumed | The total energy consumed based on the average power usage (below) over the elapsed time | |
| 66 | +| Average Power Usage | The average power draw over the elapsed time in Watts (W), summed over all GPUs | |
| 67 | +| Average SM Utilization | The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel, averaged over all GPUs | |
| 68 | +| Average Memory Utilization | The percentage of a process's lifetime during which global (device) memory was being read or written, averaged over all GPUs | |
| 69 | + |
| 70 | +### GPU specific values |
| 71 | + |
| 72 | +| Field | Description | |
| 73 | +| ----- | ----------- | |
| 74 | +| Host | The compute node executing a job step | |
| 75 | +| GPU | The GPU id on a node | |
| 76 | +| Elapsed | The elapsed time | |
| 77 | +| SM Utilization % | The percentage of the process's lifetime during which Streaming Multiprocessors (SM) were executing a kernel | |
| 78 | +| Memory Utilization % | The percentage of process's lifetime during which global (device) memory was being read or written | |
| 79 | + |
| 80 | +## Example with slurm: srun |
| 81 | + |
| 82 | +The simplest example to test `jobreport` is to run it with the sleep command. |
| 83 | +It is important to separate `jobreport` (and its options) and your command with `--`. |
| 84 | + |
| 85 | +```console |
| 86 | +$ srun -A my_account -t 5:00 --nodes=1 ./jobreport -- sleep 5 |
| 87 | +$ ls |
| 88 | +jobreport_16133 |
| 89 | +$ ./jobreport print jobreport_16133 |
| 90 | +Summary of Job Statistics |
| 91 | ++-----------------------------------------+-----------------------------------------+ |
| 92 | +| Job Id | 16133 | |
| 93 | ++-----------------------------------------+-----------------------------------------+ |
| 94 | +| Step Id | 0 | |
| 95 | ++-----------------------------------------+-----------------------------------------+ |
| 96 | +| User | jpcoles | |
| 97 | ++-----------------------------------------+-----------------------------------------+ |
| 98 | +| SLURM Account | unknown_account | |
| 99 | ++-----------------------------------------+-----------------------------------------+ |
| 100 | +| Start Time | 03-07-2024 15:32:24 | |
| 101 | ++-----------------------------------------+-----------------------------------------+ |
| 102 | +| End Time | 03-07-2024 15:32:29 | |
| 103 | ++-----------------------------------------+-----------------------------------------+ |
| 104 | +| Elapsed Time | 5s | |
| 105 | ++-----------------------------------------+-----------------------------------------+ |
| 106 | +| Number of Nodes | 1 | |
| 107 | ++-----------------------------------------+-----------------------------------------+ |
| 108 | +| Number of GPUs | 4 | |
| 109 | ++-----------------------------------------+-----------------------------------------+ |
| 110 | +| Total Energy Consumed | 0.5 Wh | |
| 111 | ++-----------------------------------------+-----------------------------------------+ |
| 112 | +| Average Power Usage | 348.8 W | |
| 113 | ++-----------------------------------------+-----------------------------------------+ |
| 114 | +| Average SM Utilization | 0% | |
| 115 | ++-----------------------------------------+-----------------------------------------+ |
| 116 | +| Average Memory Utilization | 0% | |
| 117 | ++-----------------------------------------+-----------------------------------------+ |
| 118 | + |
| 119 | +GPU Specific Values |
| 120 | ++---------------+------+------------------+------------------+----------------------+ |
| 121 | +| Host | GPU | Elapsed | SM Utilization % | Memory Utilization % | |
| 122 | +| | | | (avg/min/max) | (avg/min/max) | |
| 123 | ++---------------+------+------------------+------------------+----------------------+ |
| 124 | +| nid006212 | 0 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | |
| 125 | +| nid006212 | 1 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | |
| 126 | +| nid006212 | 2 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | |
| 127 | +| nid006212 | 3 | 5s | 0 / 0 / 0 | 0 / 0 / 0 | |
| 128 | ++---------------+------+------------------+------------------+----------------------+ |
| 129 | +``` |
| 130 | + |
| 131 | +!!! warning "`jobreport` requires successful completion of the application" |
| 132 | + |
| 133 | + The `jobreport` tool requires the application to complete successfully. |
| 134 | + If the application crashes or the job is killed by `slurm` prematurely, `jobreport` will not be able to write any output. |
| 135 | + |
| 136 | +!!! warning "workaround known issue on macOS" |
| 137 | + Currently, there is an issue when generating the report file via `jobreport print` on the macOS terminal: |
| 138 | + |
| 139 | + ```console |
| 140 | + what(): locale::facet::_S_create_c_locale name not valid |
| 141 | + /var/spool/slurmd/job32394/slurm_script: line 21: 199992 Aborted (core dumped) ./jobreport print report |
| 142 | + ``` |
| 143 | + |
| 144 | + To fix this follow these steps: |
| 145 | + |
| 146 | + 1. Open the terminal application |
| 147 | + 2. In the top-left corner menu select Terminal -> Settings |
| 148 | + 3. Select your default profile |
| 149 | + 4. Uncheck "Set locale environment variables on startup" |
| 150 | + 5. Quit and reopen the terminal and try again. This should fix the issue. |
| 151 | + |
| 152 | +## Example with slurm: batch script |
| 153 | + |
| 154 | +The `jobreport` command can be used in a batch script |
| 155 | +The report printing, too, can be included in the script and does not need the `srun` command. |
| 156 | + |
| 157 | +```bash title="submit script with jobreport" |
| 158 | +#!/bin/bash |
| 159 | +#SBATCH -t 5:00 |
| 160 | +#SBATCH --nodes=2 |
| 161 | + |
| 162 | +srun ./jobreport -o report -- my_command |
| 163 | +./jobreport print report |
| 164 | +``` |
| 165 | + |
| 166 | + |
| 167 | +When used within an job script, `jobreport` will work across multiple calls to `srun`. |
| 168 | +Each time `srun` is called, `slurm` creates a new job step and `jobreport` records data for each one. |
| 169 | +Multiple job steps running simultaneously are also allowed. |
| 170 | +The job report generated contains sections for each `slurm` job step. |
| 171 | + |
| 172 | +```bash title="submit script with multiple steps" |
| 173 | +#!/bin/bash |
| 174 | +#SBATCH -t 5:00 |
| 175 | +#SBATCH --nodes=2 |
| 176 | + |
| 177 | +srun ./jobreport -o report -- my_command_1 |
| 178 | +srun ./jobreport -o report -- my_command_2 |
| 179 | + |
| 180 | +srun --nodes=1 ./jobreport -o report -- my_command_3 & |
| 181 | +srun --nodes=1 ./jobreport -o report -- my_command_4 & |
| 182 | + |
| 183 | +wait |
| 184 | +``` |
| 185 | + |
| 186 | + |
| 187 | +## Example with uenv |
| 188 | + |
| 189 | +The following example runs a program called `burn` that computes repeated matrix multiplications to stress the GPUs. |
| 190 | +It was built with, and requires to run the [prgenv-gnu][ref-uenv-prgenv-gnu]. |
| 191 | + |
| 192 | +```console |
| 193 | +$ srun --uenv=prgenv-gnu/24.2:v1 -t 5:00 --nodes=1 --ntasks-per-node=4 --gpus-per-task=1 ${JOBREPORT} -o report -- ./burn --gpu=gemm -d 30 |
| 194 | + |
| 195 | +$ ./jobreport print report |
| 196 | +Summary of Job Statistics |
| 197 | ++-----------------------------------------+-----------------------------------------+ |
| 198 | +| Job Id | 15923 | |
| 199 | ++-----------------------------------------+-----------------------------------------+ |
| 200 | +| Step Id | 0 | |
| 201 | ++-----------------------------------------+-----------------------------------------+ |
| 202 | +| User | jpcoles | |
| 203 | ++-----------------------------------------+-----------------------------------------+ |
| 204 | +| SLURM Account | unknown_account | |
| 205 | ++-----------------------------------------+-----------------------------------------+ |
| 206 | +| Start Time | 03-07-2024 14:54:48 | |
| 207 | ++-----------------------------------------+-----------------------------------------+ |
| 208 | +| End Time | 03-07-2024 14:55:25 | |
| 209 | ++-----------------------------------------+-----------------------------------------+ |
| 210 | +| Elapsed Time | 36s | |
| 211 | ++-----------------------------------------+-----------------------------------------+ |
| 212 | +| Number of Nodes | 1 | |
| 213 | ++-----------------------------------------+-----------------------------------------+ |
| 214 | +| Number of GPUs | 4 | |
| 215 | ++-----------------------------------------+-----------------------------------------+ |
| 216 | +| Total Energy Consumed | 18.7 Wh | |
| 217 | ++-----------------------------------------+-----------------------------------------+ |
| 218 | +| Average Power Usage | 1.8 kW | |
| 219 | ++-----------------------------------------+-----------------------------------------+ |
| 220 | +| Average SM Utilization | 88% | |
| 221 | ++-----------------------------------------+-----------------------------------------+ |
| 222 | +| Average Memory Utilization | 43% | |
| 223 | ++-----------------------------------------+-----------------------------------------+ |
| 224 | + |
| 225 | +GPU Specific Values |
| 226 | ++---------------+------+------------------+------------------+----------------------+ |
| 227 | +| Host | GPU | Elapsed | SM Utilization % | Memory Utilization % | |
| 228 | +| | | | (avg/min/max) | (avg/min/max) | |
| 229 | ++---------------+------+------------------+------------------+----------------------+ |
| 230 | +| nid007044 | 0 | 36s | 83 / 0 / 100 | 39 / 0 / 50 | |
| 231 | +| nid007044 | 0 | 36s | 90 / 0 / 100 | 43 / 0 / 50 | |
| 232 | +| nid007044 | 0 | 36s | 90 / 0 / 100 | 43 / 0 / 48 | |
| 233 | +| nid007044 | 0 | 36s | 90 / 0 / 100 | 47 / 0 / 54 | |
| 234 | ++---------------+------+------------------+------------------+----------------------+ |
| 235 | +``` |
| 236 | + |
| 237 | +!!! note "Using `jobreport` with other uenvs" |
| 238 | + |
| 239 | + `jobreport` works with any uenv, not just `prgenv-gnu`. |
| 240 | + |
| 241 | +## Example with container-engine (CE) |
| 242 | + |
| 243 | +Running `jobreport` with the [container-engine (CE)][ref-container-engine] requires a little more setup to allow the CE to mount the required GPU library paths inside the container. |
| 244 | + |
| 245 | +A script to set up the mount points needs to be placed in `${HOME}/.config/enroot/hooks.d/`. |
| 246 | +This can be generated with the `jobreport` tool, and by default, the script will be placed in `${HOME}/.config/enroot/hooks.d/cscs_jobreport.sh`. |
| 247 | + |
| 248 | +```console title="Generate DCGM hook" |
| 249 | +$ ./jobreport container-hook |
| 250 | +Writing enroot hook to "/users/myuser/.config/enroot/hooks.d/cscs_jobreport_dcgm_hook.sh" |
| 251 | +Add the following to your container .toml file: |
| 252 | + |
| 253 | +[annotations] |
| 254 | +com.hooks.dcgm.enabled = "true" |
| 255 | +``` |
| 256 | + |
| 257 | +As indicated by the output, the hook must be added to the container `.toml` file. |
| 258 | + |
| 259 | +```toml title="Example .toml file" |
| 260 | +[annotations] |
| 261 | +com.hooks.dcgm.enabled = "true" |
| 262 | +``` |
| 263 | + |
| 264 | +Once the CE is configured, only the EDF file (here `my-edf.toml`) needs to be specified along with a call to `jobreport`: |
| 265 | + |
| 266 | +```console title="Run jobreport in a container" |
| 267 | +$ srun --environment=my-edf.toml ./jobreport -- sleep 5 |
| 268 | +``` |
| 269 | + |
| 270 | +!!! note "Using `jobreport` with other container images" |
| 271 | + |
| 272 | + `jobreport` works with any container image, as long as the hook is set up and the EDF file has the correct annotation. |
0 commit comments