Skip to content

Commit cbe0908

Browse files
author
cerlane
committed
added gssr documentation
1 parent 83642a8 commit cbe0908

File tree

6 files changed

+193
-0
lines changed

6 files changed

+193
-0
lines changed

docs/images/gssr/heatmap_eg.png

69.3 KB
Loading

docs/images/gssr/timeseries_eg.png

29.1 KB
Loading

docs/software/gssr/containers.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
[](){#ref-gssr-containers}
2+
# gssr - Containers Guide
3+
4+
CSCS highly recommends that all users leverage on container solutions on our Alps platforms so as to flexibly configure any required user environments of their choice within the containers. Users thus have maximum flexibility as they are not tied to any specific operating systems and/or software stacks.
5+
6+
The following guide will explain how to install and use `gssr` within a container.
7+
8+
Most CSCS users leverage on the base containers with pre-installed CUDA from Nvidia. As such, in the following documentation, we will use a PyTorch base container as an example.
9+
10+
## Preparing a container with `gssr`
11+
12+
### Base Container from Nvidia
13+
14+
The most commonly used Nvidia container used on Alps is the [Nvidia's PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). Typically the latest version is preferred for the most up-to-date functionalities of PyTorch.
15+
16+
#### Example: Preparing a Nvidia PyTorch ContainerFile
17+
```
18+
FROM --platform=linux/arm64 nvcr.io/nvidia/pytorch:25.08-py3
19+
20+
ENV DEBIAN_FRONTEND=noninteractive
21+
22+
RUN apt-get update \
23+
&& apt-get install -y wget rsync rclone vim git htop nvtop nano \
24+
&& apt-get clean \
25+
&& rm -rf /var/lib/apt/lists/*
26+
27+
# Installing gssr
28+
RUN pip install gssr
29+
30+
# Install your application and dependencies as required
31+
...
32+
```
33+
As you can see from the above example, gssr can easily be installed with a `RUN pip install gssr` command. Your application and dependencies should be written where the `...` is.
34+
35+
Once your `ContainerFile` is ready, you can build it on any Alps platform with the following commands to create a container with label `mycontainer`.
36+
37+
```bash
38+
srun -A {groupID} --pty bash
39+
# Once you have an interactive session, use podman command to build your container
40+
# -v is to mount the fast storage on Alps into the container.
41+
podman build -v $SCRATCH:$SCRATCH -t mycontainer:0.1 .
42+
# Export the container from the podman's cache to a local sqshfs file with enroot
43+
enroot import -x mount -o mycontainer.sqsh podman://local:mycontainer:0.1
44+
```
45+
46+
Now you should have a sqsh file of your container. Please note that you should replace `mycontainer` label to any other label of your choice. The version `0.1` can also be omitted or replaced with another version as required.
47+
48+
## Create CSCS configuration for Container
49+
50+
Now you only need to tell CSCS container engine solution where your container is and how you would like to run it. To do so, you will have to create a`{label}.toml` file in your `$HOME/.edf` directory.
51+
52+
### Example of a `mycontainer.toml` file
53+
```
54+
image = "/capstor/scratch/cscs/username/directoryWhereYourContainerIs/mycontainer.sqsh"
55+
mounts = ["/capstor/scratch/cscs/username:/capstor/scratch/cscs/username"]
56+
workdir = "/capstor/scratch/cscs/username"
57+
writable = true
58+
59+
[annotations]
60+
com.hooks.dcgm.enabled = "true"
61+
```
62+
63+
Please note that the `mounts` line is important if you want $SCRATCH to be available in your container. You can also mount a specific directory or file in $HOME and/or $SCRATCH as required. You should modify the username and the image directory as per your setup.
64+
65+
To use `gssr` in a container, you will need the `dcgm` hook that is configured in the `[annotations]` section to enable DCGM libraries to be available within the container.
66+
67+
### Run the application and container with gssr
68+
69+
To invoke `gssr`, you can do the following in your sbatch file.
70+
71+
#### Example of a mycontainer.sbatch file
72+
```
73+
#!/bin/bash
74+
#SBATCH -N4
75+
#SBATCH -A groupname
76+
#SBATCH -J mycontainer
77+
#SBATCH -t 1:00:00
78+
#SBATCH ...
79+
80+
srun --environment=mycontainer bash -c 'gssr --wrap="python mycode.py"'
81+
82+
```
83+
84+
Please replace the text `...` for any other SBATCH configuration that your job requires.
85+
The `--environment` flag tells Slurm which container (name of the toml file) you would like to run.
86+
The `bash -c` requirement is to initialise the bash environment within your container.
87+
88+
If no `gssr` is used, the `srun` command in your container should like that.:
89+
90+
```
91+
srun --environment=mycontainer bash -c 'python mycode.py'.
92+
```
93+
94+
Now you are ready to submit your sbatch file to slurm with `sbatch` command.
95+
96+
## Analyze the output
97+
98+
Once your job successfully concluded. You should find a folder named `profile_out_{slurm_jobid}` where `gssr` json output is in. To generate output for analysis.
99+
100+
To do so, you can do so interactively within your container where `gssr` is installed.
101+
102+
To get an interactive session of our container
103+
104+
```
105+
srun -A groupname --environment=mycontainer --pty bash
106+
cd {directory where the gssr output data is generated}
107+
```
108+
Alternatively, you can install `gssr` locally and copy the `profile_out_{slurm_jobid}` to your computer and visualise it locally.
109+
110+
#### Metric Output
111+
The profiled output can be analysed as follows.:
112+
113+
gssr analyze -i ./profile_out
114+
115+
#### PDF File Output with Plots
116+
117+
gssr analyze -i ./profile_out --report
118+
119+
A/Multiple PDF report(s) will be generated.
120+

docs/software/gssr/index.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[](){#ref-gssr-overview}
2+
# gssr
3+
4+
GPU Saturation Scorer (gssr) provides a simple way to profile your code and get the results in both tables and plots for easy visualisation. gssr works on top of [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) and thus only NVIDIA GPUs are currently supported.
5+
6+
The following documentations will be available.:
7+
8+
* [Quickstart Guide][ref-gssr-quickstart]
9+
* [Container Guide][ref-gssr-containers]
10+
11+
This tool will produce time-series and heatmaps of the profiled metric values. Here is an example of one set of plots generated by the tool from the application Megatron-LLM from EPFL.
12+
13+
![gssr timeseries](../../images/gssr/timeseries_eg.png)
14+
![gssr heatmap](../../images/gssr/heatmap_eg.png)

docs/software/gssr/quickstart.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
[](){#ref-gssr-quickstart}
2+
# gssr - Quickstart Guide
3+
4+
## Installation
5+
6+
### From Pypi
7+
8+
`gssr` can be easily installed as follows.:
9+
10+
pip install gssr
11+
12+
### From GitHub Source
13+
14+
To install directly from the source:
15+
16+
pip install git+https://github.com/eth-cscs/GPU-saturation-scorer.git
17+
18+
To install from a specific branch, e.g. the development branch, from the source:
19+
20+
pip install git+https://github.com/eth-cscs/GPU-saturation-scorer.git@dev
21+
22+
To install a specific release tag, e.g. gssr-v0.3, from the source:
23+
24+
pip install git+https://github.com/eth-cscs/[email protected]
25+
26+
## Profile
27+
28+
### Example
29+
30+
If you are submitting a batch job and the command you are executing is:
31+
32+
srun python test.py
33+
34+
The corresponding srun command should be modified as follows.:
35+
36+
srun gssr profile -wrap="python abc.py"
37+
38+
* The `gssr` option to run is `profile`
39+
* The `"--wrap"` flag will wrap the command that you would like to run
40+
* The default output directory is `profile_out_{slurm_job_id}`
41+
* A label to the output data can be set with the `-l` flag
42+
43+
## Analyze
44+
45+
### Metric Output
46+
The profiled output can be analysed as follows.:
47+
48+
gssr analyze -i ./profile_out
49+
50+
### PDF File Output with Plots
51+
52+
gssr analyze -i ./profile_out --report
53+
54+
A/Multiple PDF report(s) will be generated.
55+

mkdocs.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,10 @@ nav:
8282
- 'Building uenv': software/uenv/build.md
8383
- 'Deploying uenv': software/uenv/deploy.md
8484
- 'Release notes': software/uenv/release-notes.md
85+
- 'gssr':
86+
- software/gssr/index.md
87+
- 'Quickstart Guide': software/gssr/quickstart.md
88+
- 'Container Guide': software/gssr/containers.md
8589
- 'Debugging and Performance Analysis':
8690
- software/devtools/index.md
8791
- 'Using NVIDIA Nsight': software/devtools/nvidia-nsight.md

0 commit comments

Comments
 (0)