Skip to content

Commit 36945ba

Browse files
Merge pull request #23 from UnityHPC/feature/add-requested-vram
Feature/add requested vram Add ability to remove private user data from reports and notebooks Polish main README Refactor notebooks Refactor preprocessing code Merge after comments from Ayush
2 parents 64f84c9 + 23ab7aa commit 36945ba

31 files changed

+13917
-1657
lines changed

README.md

Lines changed: 86 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -1,72 +1,99 @@
1-
## Introduction
1+
# Unity GPU Efficiency Analytics Suite
22

3-
This repository is a place to contain the tools developed over the course of the DS4CG 2025 summer
4-
internship project with Unity.
53

6-
## DS4CG Job Analytics
7-
8-
9-
DS4CG Job Analytics is a data analytics and reporting platform developed during the DS4CG 2025 summer internship with Unity. It provides tools for analyzing HPC job data, generating interactive reports, and visualizing resource usage and efficiency.
4+
This repository is a data analytics and reporting platform developed as part of the [Summer 2025 Data Science for the Common Good (DS4CG) program](https://ds.cs.umass.edu/programs/ds4cg/ds4cg-team-2025) in partnership with the Unity Research Computing Platform. It provides tools for analyzing HPC job data, generating interactive reports, and visualizing resource usage and efficiency.
105

116
## Motivation
127
High-performance GPUs are a critical resource on shared clusters, but they are often underutilized due to inefficient job scheduling, over-allocation, or lack of user awareness. Many jobs request more GPU memory or compute than they actually use, leading to wasted resources and longer queue times for others. This project aims to address these issues by providing analytics and reporting tools that help users and administrators understand GPU usage patterns, identify inefficiencies, and make data-driven decisions to improve overall cluster utilization.
138

149
## Project Overview
1510
This project includes:
1611
- Python scripts and modules for data preprocessing, analysis, and report generation
17-
- Jupyter notebooks for interactive exploration and visualization
12+
- Jupyter notebooks for interactive analysis and visualization
1813
- Automated report generation scripts (see the `feature/reports` branch for the latest versions)
1914
- Documentation built with MkDocs and Quarto
2015

2116
## Example Notebooks
22-
The following notebooks demonstrate key analyses and visualizations:
23-
24-
- `notebooks/Basic Visualization.ipynb`: Basic plots and metrics
25-
- `notebooks/Efficiency Analysis.ipynb`: Efficiency metrics and user comparisons
26-
- `notebooks/Resource Hoarding.ipynb`: Analysis of resource hoarding
27-
- `notebooks/SlurmGPU.ipynb`: GPU job analysis
17+
The following notebooks generate comprehensive analysis for two subsets of the data:
2818

29-
See the `notebooks/` directory for more examples.
19+
- [`notebooks/analysis/No VRAM Use Analysis.ipynb`](notebooks/analysis/No%20VRAM%20Use%20Analysis.ipynb): Analysis of GPU jobs that end up using no VRAM.
20+
- [`notebooks/analysis/Requested and Used VRAM.ipynb`](notebooks/analysis/Requested%20and%20Used%20VRAM.ipynb): Analysis of GPU jobs that request a specific amount of VRAM.
3021

31-
## Contributing to this repository
22+
The following notebooks generate demonstrate key analyses, visualizations:
3223

33-
The following guidelines may prove helpful in maximizing the utility of this repository:
24+
- [`notebooks/module_demos/Basic Visualization.ipynb`](notebooks/module_demos/Basic%20Visualization.ipynb): Basic plots and metrics
25+
- [`notebooks/module_demos/Efficiency Analysis.ipynb`](notebooks/module_demos/Efficiency%20Analysis.ipynb): Calculation of efficiency metrics and user comparisons
26+
- [`notebooks/module_demos/Resource Hoarding.ipynb`](notebooks/module_demos/Resource%20Hoarding.ipynb`): Analysis of CPU core and RAM overallocation
3427

35-
- Please avoid committing code unless it is meant to be used by the rest of the team.
36-
- New code should first be committed in a dedicated branch (```feature/newanalysis``` or ```bugfix/typo```), and later merged into ```main``` following a code review.
37-
- Shared datasets should usually be managed with a shared folder on Unity, not committed to Git.
38-
- Prefer committing Python modules with plotting routines like ```scripts/gpu_metrics.py``` instead of Jupyter notebooks, when possible.
28+
The [`notebooks`](notebooks) directory contains all Jupyter notebooks.
3929

40-
## Getting started on Unity
4130

42-
You'll need to first install a few dependencies, which include DuckDB, Pandas, and some plotting libraries. More details for running the project will need be added here later.
31+
## Documentation
4332

44-
### Version Control
45-
To provide the path of the git configuration file of this project to git, run:
33+
This repository uses [MkDocs](https://www.mkdocs.org/) for project documentation. The documentation source files are located in the `docs/` directory and the configuration is in `mkdocs.yml`.
4634

47-
git config --local include.path ../.gitconfig
35+
To build and serve the documentation locally:
4836

49-
To ensure consistent LF line endings across all platforms, run the following command when developing on Windows machines:
37+
pip install -r dev-requirements.txt
38+
mkdocs serve
5039

51-
git config --local core.autocrlf input
40+
To build the static site:
5241

53-
### Jupyter notebooks
42+
mkdocs build
5443

55-
You can run Jupyter notebooks on Unity through the OpenOnDemand portal. To make your environment
56-
visible in Jupyter, run
44+
To deploy the documentation (e.g., to GitHub Pages):
5745

58-
python -m ipykernel install --user --name "Duck DB"
46+
mkdocs gh-deploy
5947

60-
from within the environment. This will add "Duck DB" as a kernel option in the dropdown.
48+
See the [MkDocs documentation](https://www.mkdocs.org/user-guide/) for more details and advanced usage.
49+
50+
### Documenting New Features
51+
52+
For any new features, modules, or major changes, please add a corresponding `.md` file under the `docs/` directory. This helps keep the project documentation up to date and useful for all users and contributors.
6153

62-
By default, Jupyter Notebook outoputs are removed via a git filter before the notebook is committed to git. To add an exception and keep the output of a notebook, add the following line to [notebooks/.gitattributes](```notebooks/.gitattributes```):
54+
## Dataset
6355

64-
<NOTEBOOK_NAME>.ipynb !filter=strip-notebook-output
56+
The primary dataset for this project is a DuckDB database that contains information about jobs on
57+
Unity. It is located under ```unity.rc.umass.edu:/modules/admin-resources/reporting/slurm_data.db``` and is updated daily.
58+
The schema is provided below. In addition to the columns in the DuckDB file, this repository contains tools to add a number of useful derived columns for visualization and analysis.
59+
60+
| Column | Type | Description |
61+
| :--- | :--- | :------------ |
62+
| UUID | VARCHAR | Unique identifier |
63+
| JobID | INTEGER | Slurm job ID |
64+
| ArrayID | INTEGER | Position in job array |
65+
|ArrayJobID| INTEGER | Slurm job ID within array|
66+
| JobName | VARCHAR | Name of job |
67+
| IsArray | BOOLEAN | Indicator if job is part of an array |
68+
| Interactive | VARCHAR | Indicator if job was interactive
69+
| Preempted | BOOLEAN | Was job preempted |
70+
| Account | VARCHAR | Slurm account (PI group) |
71+
| User | VARCHAR | Unity user |
72+
| Constraints | VARCHAR[] | Job constraints |
73+
| QOS | VARCHAR | Job QOS |
74+
| Status | VARCHAR | Job status on termination |
75+
| ExitCode | VARCHAR | Job exit code |
76+
| SubmitTime | TIMESTAMP_NS | Job submission time |
77+
| StartTime | TIMESTAMP_NS | Job start time
78+
| EndTime | TIMESTAMP_NS | Job end time |
79+
| Elapsed | INTEGER | Job runtime (seconds) |
80+
| TimeLimit | INTEGER | Job time limit (seconds) |
81+
| Partition | VARCHAR | Job partition |
82+
| Nodes | VARCHAR | Job nodes as compact string |
83+
| NodeList | VARCHAR[] | List of job nodes |
84+
| CPUs | SMALLINT | Number of CPU cores |
85+
| Memory | INTEGER | Job allocated memory (bytes) |
86+
| GPUs | SMALLINT | Number of GPUs requested |
87+
| GPUType | DICT | Dictionary with keys as type of GPU (str) and the values as number of GPUs corresponding to that type (int) |
88+
| GPUMemUsage | FLOAT | GPU memory usage (bytes) |
89+
| GPUComputeUsage | FLOAT | GPU compute usage (pct) |
90+
| CPUMemUsage | FLOAT | GPU memory usage (bytes) |
91+
| CPUComputeUsage | FLOAT | CPU compute usage (pct) |
6592

6693

6794
## Development Environment
6895

69-
To set up your development environment, use the provided `dev-requirements.txt` for all development dependencies (including linting, testing, and documentation tools).
96+
To set up your development environment, use the provided [`dev-requirements.txt`](dev-requirements.txt) for all development dependencies (including linting, testing, and documentation tools).
7097

7198
This project requires **Python 3.11**. Make sure you have Python 3.11 installed before creating the virtual environment.
7299

@@ -84,7 +111,27 @@ This project requires **Python 3.11**. Make sure you have Python 3.11 installed
84111
pip install -r requirements.txt
85112
pip install -r dev-requirements.txt
86113

87-
If you need to reset your environment, you can delete the `duckdb` folder and recreate it as above.
114+
If you need to reset your environment, you can delete the `duckdb` directory and recreate it as above.
115+
116+
### Version Control
117+
To provide the path of the git configuration file of this project to git, run:
118+
119+
git config --local include.path ../.gitconfig
120+
121+
To ensure consistent LF line endings across all platforms, run the following command when developing on Windows machines:
122+
123+
git config --local core.autocrlf input
124+
125+
### Jupyter notebooks
126+
127+
You can run Jupyter notebooks on Unity through the OpenOnDemand portal. To make your environment
128+
visible in Jupyter, run
129+
130+
python -m ipykernel install --user --name "Duck DB"
131+
132+
from within the environment. This will add "Duck DB" as a kernel option in the dropdown.
133+
134+
By default, Jupyter Notebook outputs are removed via a git filter before the notebook is committed to git. To add an exception and keep the output of a notebook, add the file name of the notebook to [`scripts/strip_notebook_exclude.txt`](```scripts/.strip_notebook_exclude```).
88135

89136
## Code Style & Linting
90137

@@ -142,80 +189,14 @@ All Python code should use [**Google-style docstrings**](https://google.github.i
142189
"""
143190
# ...function code...
144191

145-
## Documentation
146-
147-
This repository uses [MkDocs](https://www.mkdocs.org/) for project documentation. The documentation source files are located in the `docs/` directory and the configuration is in `mkdocs.yml`.
148-
149-
To build and serve the documentation locally:
150-
151-
pip install -r dev-requirements.txt
152-
mkdocs serve
153-
154-
To build the static site:
155-
156-
mkdocs build
157-
158-
To deploy the documentation (e.g., to GitHub Pages):
159-
160-
mkdocs gh-deploy
161-
162-
See the [MkDocs documentation](https://www.mkdocs.org/user-guide/) for more details and advanced usage.
163-
164-
### Documenting New Features
165-
166-
For any new features, modules, or major changes, please add a corresponding `.md` file under the `docs/` directory. This helps keep the project documentation up to date and useful for all users and contributors.
167-
168192
## Testing
169193

170194
To run tests, use the provided test scripts or `pytest` (if available):
171195

172196
pytest
173197

174198

175-
### Support
176-
177-
The Unity documentation (https://docs.unity.rc.umass.edu/) has a lot of useful
178-
background information about Unity in particular and HPC in general. It will help explain a lot of
179-
the terms used in the dataset schema below. For specific issues with the code in this repo or the
180-
DuckDB dataset, feel free to reach out to Benjamin Pachev on the Unity Slack.
181-
182-
## The dataset
199+
## Support
183200

184-
The primary dataset for this project is a DuckDB database that contains information about jobs on
185-
Unity. It is contained under ```/modules/admin-resources/reporting/slurm_data.db``` and is updated daily.
186-
A schema is provided below. In addition to the columns in the DuckDB file, ```scripts/gpu_metrics.py```
187-
contains tools to add a number of useful derived columns for plotting and analysis.
188-
189-
| Column | Type | Description |
190-
| :--- | :--- | :------------ |
191-
| UUID | VARCHAR | Unique identifier |
192-
| JobID | INTEGER | Slurm job ID |
193-
| ArrayID | INTEGER | Position in job array |
194-
|ArrayJobID| INTEGER | Slurm job ID within array|
195-
| JobName | VARCHAR | Name of job |
196-
| IsArray | BOOLEAN | Indicator if job is part of an array |
197-
| Interactive | VARCHAR | Indicator if job was interactive
198-
| Preempted | BOOLEAN | Was job preempted |
199-
| Account | VARCHAR | Slurm account (PI group) |
200-
| User | VARCHAR | Unity user |
201-
| Constraints | VARCHAR[] | Job constraints |
202-
| QOS | VARCHAR | Job QOS |
203-
| Status | VARCHAR | Job status on termination |
204-
| ExitCode | VARCHAR | Job exit code |
205-
| SubmitTime | TIMESTAMP_NS | Job submission time |
206-
| StartTime | TIMESTAMP_NS | Job start time
207-
| EndTime | TIMESTAMP_NS | Job end time |
208-
| Elapsed | INTEGER | Job runtime (seconds) |
209-
| TimeLimit | INTEGER | Job time limit (seconds) |
210-
| Partition | VARCHAR | Job partition |
211-
| Nodes | VARCHAR | Job nodes as compact string |
212-
| NodeList | VARCHAR[] | List of job nodes |
213-
| CPUs | SMALLINT | Number of CPU cores |
214-
| Memory | INTEGER | Job allocated memory (bytes) |
215-
| GPUs | SMALLINT | Number of GPUs requested |
216-
| GPUType | DICT | Dictionary with keys as type of GPU (str) and the values as number of GPUs corresponding to that type (int) |
217-
| GPUMemUsage | FLOAT | GPU memory usage (bytes) |
218-
| GPUComputeUsage | FLOAT | GPU compute usage (pct) |
219-
| CPUMemUsage | FLOAT | GPU memory usage (bytes) |
220-
| CPUComputeUsage | FLOAT | CPU compute usage (pct) |
201+
The Unity documentation (https://docs.unity.rc.umass.edu/) has plenty of useful information about Unity and Slurm which would be helpful in understanding the data. For specific issues with the code in this repo or the DuckDB dataset, feel free to reach out to Benjamin Pachev on the Unity Slack.
221202

notebooks/.gitattributes

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1 @@
1-
# *.ipynb filter=strip-notebook-output
2-
# # keep the output of the following notebooks when committing
3-
# SlurmGPU.ipynb -filter=strip-notebook-output
4-
# notebooks/SlurmGPU.ipynb -filter=strip-notebook-output
1+
*.ipynb filter=strip-notebook-output

notebooks/Basic Visualization.ipynb

Lines changed: 0 additions & 119 deletions
This file was deleted.

0 commit comments

Comments
 (0)