You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge pull request #23 from UnityHPC/feature/add-requested-vram
Feature/add requested vram
Add ability to remove private user data from reports and notebooks
Polish main README
Refactor notebooks
Refactor preprocessing code
Merge after comments from Ayush
This repository is a place to contain the tools developed over the course of the DS4CG 2025 summer
4
-
internship project with Unity.
5
3
6
-
## DS4CG Job Analytics
7
-
8
-
9
-
DS4CG Job Analytics is a data analytics and reporting platform developed during the DS4CG 2025 summer internship with Unity. It provides tools for analyzing HPC job data, generating interactive reports, and visualizing resource usage and efficiency.
4
+
This repository is a data analytics and reporting platform developed as part of the [Summer 2025 Data Science for the Common Good (DS4CG) program](https://ds.cs.umass.edu/programs/ds4cg/ds4cg-team-2025) in partnership with the Unity Research Computing Platform. It provides tools for analyzing HPC job data, generating interactive reports, and visualizing resource usage and efficiency.
10
5
11
6
## Motivation
12
7
High-performance GPUs are a critical resource on shared clusters, but they are often underutilized due to inefficient job scheduling, over-allocation, or lack of user awareness. Many jobs request more GPU memory or compute than they actually use, leading to wasted resources and longer queue times for others. This project aims to address these issues by providing analytics and reporting tools that help users and administrators understand GPU usage patterns, identify inefficiencies, and make data-driven decisions to improve overall cluster utilization.
13
8
14
9
## Project Overview
15
10
This project includes:
16
11
- Python scripts and modules for data preprocessing, analysis, and report generation
17
-
- Jupyter notebooks for interactive exploration and visualization
12
+
- Jupyter notebooks for interactive analysis and visualization
18
13
- Automated report generation scripts (see the `feature/reports` branch for the latest versions)
19
14
- Documentation built with MkDocs and Quarto
20
15
21
16
## Example Notebooks
22
-
The following notebooks demonstrate key analyses and visualizations:
23
-
24
-
-`notebooks/Basic Visualization.ipynb`: Basic plots and metrics
25
-
-`notebooks/Efficiency Analysis.ipynb`: Efficiency metrics and user comparisons
26
-
-`notebooks/Resource Hoarding.ipynb`: Analysis of resource hoarding
27
-
-`notebooks/SlurmGPU.ipynb`: GPU job analysis
17
+
The following notebooks generate comprehensive analysis for two subsets of the data:
28
18
29
-
See the `notebooks/` directory for more examples.
19
+
-[`notebooks/analysis/No VRAM Use Analysis.ipynb`](notebooks/analysis/No%20VRAM%20Use%20Analysis.ipynb): Analysis of GPU jobs that end up using no VRAM.
20
+
-[`notebooks/analysis/Requested and Used VRAM.ipynb`](notebooks/analysis/Requested%20and%20Used%20VRAM.ipynb): Analysis of GPU jobs that request a specific amount of VRAM.
30
21
31
-
## Contributing to this repository
22
+
The following notebooks generate demonstrate key analyses, visualizations:
32
23
33
-
The following guidelines may prove helpful in maximizing the utility of this repository:
24
+
-[`notebooks/module_demos/Basic Visualization.ipynb`](notebooks/module_demos/Basic%20Visualization.ipynb): Basic plots and metrics
25
+
-[`notebooks/module_demos/Efficiency Analysis.ipynb`](notebooks/module_demos/Efficiency%20Analysis.ipynb): Calculation of efficiency metrics and user comparisons
26
+
-[`notebooks/module_demos/Resource Hoarding.ipynb`](notebooks/module_demos/Resource%20Hoarding.ipynb`): Analysis of CPU core and RAM overallocation
34
27
35
-
- Please avoid committing code unless it is meant to be used by the rest of the team.
36
-
- New code should first be committed in a dedicated branch (```feature/newanalysis``` or ```bugfix/typo```), and later merged into ```main``` following a code review.
37
-
- Shared datasets should usually be managed with a shared folder on Unity, not committed to Git.
38
-
- Prefer committing Python modules with plotting routines like ```scripts/gpu_metrics.py``` instead of Jupyter notebooks, when possible.
28
+
The [`notebooks`](notebooks) directory contains all Jupyter notebooks.
39
29
40
-
## Getting started on Unity
41
30
42
-
You'll need to first install a few dependencies, which include DuckDB, Pandas, and some plotting libraries. More details for running the project will need be added here later.
31
+
## Documentation
43
32
44
-
### Version Control
45
-
To provide the path of the git configuration file of this project to git, run:
33
+
This repository uses [MkDocs](https://www.mkdocs.org/) for project documentation. The documentation source files are located in the `docs/` directory and the configuration is in `mkdocs.yml`.
46
34
47
-
git config --local include.path ../.gitconfig
35
+
To build and serve the documentation locally:
48
36
49
-
To ensure consistent LF line endings across all platforms, run the following command when developing on Windows machines:
37
+
pip install -r dev-requirements.txt
38
+
mkdocs serve
50
39
51
-
git config --local core.autocrlf input
40
+
To build the static site:
52
41
53
-
### Jupyter notebooks
42
+
mkdocs build
54
43
55
-
You can run Jupyter notebooks on Unity through the OpenOnDemand portal. To make your environment
56
-
visible in Jupyter, run
44
+
To deploy the documentation (e.g., to GitHub Pages):
from within the environment. This will add "Duck DB" as a kernel option in the dropdown.
48
+
See the [MkDocs documentation](https://www.mkdocs.org/user-guide/) for more details and advanced usage.
49
+
50
+
### Documenting New Features
51
+
52
+
For any new features, modules, or major changes, please add a corresponding `.md` file under the `docs/` directory. This helps keep the project documentation up to date and useful for all users and contributors.
61
53
62
-
By default, Jupyter Notebook outoputs are removed via a git filter before the notebook is committed to git. To add an exception and keep the output of a notebook, add the following line to [notebooks/.gitattributes](```notebooks/.gitattributes```):
The primary dataset for this project is a DuckDB database that contains information about jobs on
57
+
Unity. It is located under ```unity.rc.umass.edu:/modules/admin-resources/reporting/slurm_data.db``` and is updated daily.
58
+
The schema is provided below. In addition to the columns in the DuckDB file, this repository contains tools to add a number of useful derived columns for visualization and analysis.
59
+
60
+
| Column | Type | Description |
61
+
| :--- | :--- | :------------ |
62
+
| UUID | VARCHAR | Unique identifier |
63
+
| JobID | INTEGER | Slurm job ID |
64
+
| ArrayID | INTEGER | Position in job array |
65
+
|ArrayJobID| INTEGER | Slurm job ID within array|
66
+
| JobName | VARCHAR | Name of job |
67
+
| IsArray | BOOLEAN | Indicator if job is part of an array |
68
+
| Interactive | VARCHAR | Indicator if job was interactive
69
+
| Preempted | BOOLEAN | Was job preempted |
70
+
| Account | VARCHAR | Slurm account (PI group) |
71
+
| User | VARCHAR | Unity user |
72
+
| Constraints | VARCHAR[]| Job constraints |
73
+
| QOS | VARCHAR | Job QOS |
74
+
| Status | VARCHAR | Job status on termination |
75
+
| ExitCode | VARCHAR | Job exit code |
76
+
| SubmitTime | TIMESTAMP_NS | Job submission time |
77
+
| StartTime | TIMESTAMP_NS | Job start time
78
+
| EndTime | TIMESTAMP_NS | Job end time |
79
+
| Elapsed | INTEGER | Job runtime (seconds) |
80
+
| TimeLimit | INTEGER | Job time limit (seconds) |
| CPUComputeUsage | FLOAT | CPU compute usage (pct) |
65
92
66
93
67
94
## Development Environment
68
95
69
-
To set up your development environment, use the provided `dev-requirements.txt` for all development dependencies (including linting, testing, and documentation tools).
96
+
To set up your development environment, use the provided [`dev-requirements.txt`](dev-requirements.txt) for all development dependencies (including linting, testing, and documentation tools).
70
97
71
98
This project requires **Python 3.11**. Make sure you have Python 3.11 installed before creating the virtual environment.
72
99
@@ -84,7 +111,27 @@ This project requires **Python 3.11**. Make sure you have Python 3.11 installed
84
111
pip install -r requirements.txt
85
112
pip install -r dev-requirements.txt
86
113
87
-
If you need to reset your environment, you can delete the `duckdb` folder and recreate it as above.
114
+
If you need to reset your environment, you can delete the `duckdb` directory and recreate it as above.
115
+
116
+
### Version Control
117
+
To provide the path of the git configuration file of this project to git, run:
118
+
119
+
git config --local include.path ../.gitconfig
120
+
121
+
To ensure consistent LF line endings across all platforms, run the following command when developing on Windows machines:
122
+
123
+
git config --local core.autocrlf input
124
+
125
+
### Jupyter notebooks
126
+
127
+
You can run Jupyter notebooks on Unity through the OpenOnDemand portal. To make your environment
from within the environment. This will add "Duck DB" as a kernel option in the dropdown.
133
+
134
+
By default, Jupyter Notebook outputs are removed via a git filter before the notebook is committed to git. To add an exception and keep the output of a notebook, add the file name of the notebook to [`scripts/strip_notebook_exclude.txt`](```scripts/.strip_notebook_exclude```).
88
135
89
136
## Code Style & Linting
90
137
@@ -142,80 +189,14 @@ All Python code should use [**Google-style docstrings**](https://google.github.i
142
189
"""
143
190
# ...function code...
144
191
145
-
## Documentation
146
-
147
-
This repository uses [MkDocs](https://www.mkdocs.org/) for project documentation. The documentation source files are located in the `docs/` directory and the configuration is in `mkdocs.yml`.
148
-
149
-
To build and serve the documentation locally:
150
-
151
-
pip install -r dev-requirements.txt
152
-
mkdocs serve
153
-
154
-
To build the static site:
155
-
156
-
mkdocs build
157
-
158
-
To deploy the documentation (e.g., to GitHub Pages):
159
-
160
-
mkdocs gh-deploy
161
-
162
-
See the [MkDocs documentation](https://www.mkdocs.org/user-guide/) for more details and advanced usage.
163
-
164
-
### Documenting New Features
165
-
166
-
For any new features, modules, or major changes, please add a corresponding `.md` file under the `docs/` directory. This helps keep the project documentation up to date and useful for all users and contributors.
167
-
168
192
## Testing
169
193
170
194
To run tests, use the provided test scripts or `pytest` (if available):
171
195
172
196
pytest
173
197
174
198
175
-
### Support
176
-
177
-
The Unity documentation (https://docs.unity.rc.umass.edu/) has a lot of useful
178
-
background information about Unity in particular and HPC in general. It will help explain a lot of
179
-
the terms used in the dataset schema below. For specific issues with the code in this repo or the
180
-
DuckDB dataset, feel free to reach out to Benjamin Pachev on the Unity Slack.
181
-
182
-
## The dataset
199
+
## Support
183
200
184
-
The primary dataset for this project is a DuckDB database that contains information about jobs on
185
-
Unity. It is contained under ```/modules/admin-resources/reporting/slurm_data.db``` and is updated daily.
186
-
A schema is provided below. In addition to the columns in the DuckDB file, ```scripts/gpu_metrics.py```
187
-
contains tools to add a number of useful derived columns for plotting and analysis.
188
-
189
-
| Column | Type | Description |
190
-
| :--- | :--- | :------------ |
191
-
| UUID | VARCHAR | Unique identifier |
192
-
| JobID | INTEGER | Slurm job ID |
193
-
| ArrayID | INTEGER | Position in job array |
194
-
|ArrayJobID| INTEGER | Slurm job ID within array|
195
-
| JobName | VARCHAR | Name of job |
196
-
| IsArray | BOOLEAN | Indicator if job is part of an array |
197
-
| Interactive | VARCHAR | Indicator if job was interactive
198
-
| Preempted | BOOLEAN | Was job preempted |
199
-
| Account | VARCHAR | Slurm account (PI group) |
200
-
| User | VARCHAR | Unity user |
201
-
| Constraints | VARCHAR[]| Job constraints |
202
-
| QOS | VARCHAR | Job QOS |
203
-
| Status | VARCHAR | Job status on termination |
204
-
| ExitCode | VARCHAR | Job exit code |
205
-
| SubmitTime | TIMESTAMP_NS | Job submission time |
206
-
| StartTime | TIMESTAMP_NS | Job start time
207
-
| EndTime | TIMESTAMP_NS | Job end time |
208
-
| Elapsed | INTEGER | Job runtime (seconds) |
209
-
| TimeLimit | INTEGER | Job time limit (seconds) |
| CPUComputeUsage | FLOAT | CPU compute usage (pct) |
201
+
The Unity documentation (https://docs.unity.rc.umass.edu/) has plenty of useful information about Unity and Slurm which would be helpful in understanding the data. For specific issues with the code in this repo or the DuckDB dataset, feel free to reach out to Benjamin Pachev on the Unity Slack.
0 commit comments