Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
6d2a83b
Add requested vram and update EfficiencyAnalysis accordingly
MisterArdavan Aug 17, 2025
c2709de
Add analysis notebook for no vram usage
MisterArdavan Aug 18, 2025
66b17a4
Add vram_hours histogram plot
MisterArdavan Aug 18, 2025
d80a4f5
Add 'Requested and Used VRAM' notebook
MisterArdavan Aug 18, 2025
0308142
Add resource hoarding analysis to Requested and Used VRAM notebook
MisterArdavan Aug 18, 2025
c22fce6
Refactor and organize notebooks
MisterArdavan Aug 18, 2025
ffa0aa2
Add PI group visualization code as a new class
MisterArdavan Aug 18, 2025
980732c
Add histogram visualization for score metrics to the visualization class
MisterArdavan Aug 18, 2025
2a72f64
Add directory name changes ignores previously
MisterArdavan Aug 19, 2025
f60ea00
Add histogram visualization for positive metrics
MisterArdavan Aug 19, 2025
a57030a
Update directory names
MisterArdavan Aug 19, 2025
7c29e4a
Fix notebook directory names
MisterArdavan Aug 19, 2025
0ead8e8
Add anonymization option to visualization modules for efficiency anal…
MisterArdavan Aug 19, 2025
d5eeb0d
Add anonymization option to preprocessing
MisterArdavan Aug 19, 2025
4b15924
Update Requested and Used VRAM notebook to anonymize outputs
MisterArdavan Aug 20, 2025
c9e8b58
Update .gitattributes to add a notebook
MisterArdavan Aug 20, 2025
2b861be
Refactor jupyter notebook cleaning script
MisterArdavan Aug 20, 2025
248d36f
Add anonymization option to preprocessing to omit full paths in outpu…
MisterArdavan Aug 20, 2025
bfb003d
Monkeypatch showwarning in preproessing to not include the path where…
MisterArdavan Aug 20, 2025
05bc380
Update GPU VRAM Usage histogram and make number in ranking bar plot e…
MisterArdavan Aug 20, 2025
c02db24
Fix ruff error
MisterArdavan Aug 21, 2025
dddb413
Remove global variables in preprocessing
MisterArdavan Aug 21, 2025
cb119d8
Remove incorrect filter in job cpu core overallocation section of the…
MisterArdavan Aug 21, 2025
f7a9425
Remove methods in efficiency analysis that were dead code and were re…
MisterArdavan Aug 21, 2025
b558c57
Update mvp notebook metadata
MisterArdavan Sep 4, 2025
d4704f4
Resolve merge conflict
MisterArdavan Sep 4, 2025
a333a23
Add anonymization option to new utility functions and update configur…
MisterArdavan Sep 4, 2025
7caf692
Refactor no vram use notebook and keep its anonymized outputs
MisterArdavan Sep 4, 2025
33068e4
Refactor Efficiency Analysis notebook to use the new utility function…
MisterArdavan Sep 5, 2025
14c5a2f
Add Attribute Visualization.ipynb to the list of notebooks to keep pu…
MisterArdavan Sep 5, 2025
7dfdf00
Merge branch 'main' of github.com:UnityHPC/ds4cg-job-analytics into f…
MisterArdavan Sep 10, 2025
23ab7aa
Refactor README
MisterArdavan Sep 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 86 additions & 105 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,72 +1,99 @@
## Introduction
# Unity GPU Efficiency Analytics Suite

This repository is a place to contain the tools developed over the course of the DS4CG 2025 summer
internship project with Unity.

## DS4CG Job Analytics


DS4CG Job Analytics is a data analytics and reporting platform developed during the DS4CG 2025 summer internship with Unity. It provides tools for analyzing HPC job data, generating interactive reports, and visualizing resource usage and efficiency.
This repository is a data analytics and reporting platform developed as part of the [Summer 2025 Data Science for the Common Good (DS4CG) program](https://ds.cs.umass.edu/programs/ds4cg/ds4cg-team-2025) in partnership with the Unity Research Computing Platform. It provides tools for analyzing HPC job data, generating interactive reports, and visualizing resource usage and efficiency.

## Motivation
High-performance GPUs are a critical resource on shared clusters, but they are often underutilized due to inefficient job scheduling, over-allocation, or lack of user awareness. Many jobs request more GPU memory or compute than they actually use, leading to wasted resources and longer queue times for others. This project aims to address these issues by providing analytics and reporting tools that help users and administrators understand GPU usage patterns, identify inefficiencies, and make data-driven decisions to improve overall cluster utilization.

## Project Overview
This project includes:
- Python scripts and modules for data preprocessing, analysis, and report generation
- Jupyter notebooks for interactive exploration and visualization
- Jupyter notebooks for interactive analysis and visualization
- Automated report generation scripts (see the `feature/reports` branch for the latest versions)
- Documentation built with MkDocs and Quarto

## Example Notebooks
The following notebooks demonstrate key analyses and visualizations:

- `notebooks/Basic Visualization.ipynb`: Basic plots and metrics
- `notebooks/Efficiency Analysis.ipynb`: Efficiency metrics and user comparisons
- `notebooks/Resource Hoarding.ipynb`: Analysis of resource hoarding
- `notebooks/SlurmGPU.ipynb`: GPU job analysis
The following notebooks generate comprehensive analysis for two subsets of the data:

See the `notebooks/` directory for more examples.
- [`notebooks/analysis/No VRAM Use Analysis.ipynb`](notebooks/analysis/No%20VRAM%20Use%20Analysis.ipynb): Analysis of GPU jobs that end up using no VRAM.
- [`notebooks/analysis/Requested and Used VRAM.ipynb`](notebooks/analysis/Requested%20and%20Used%20VRAM.ipynb): Analysis of GPU jobs that request a specific amount of VRAM.

## Contributing to this repository
The following notebooks generate demonstrate key analyses, visualizations:

The following guidelines may prove helpful in maximizing the utility of this repository:
- [`notebooks/module_demos/Basic Visualization.ipynb`](notebooks/module_demos/Basic%20Visualization.ipynb): Basic plots and metrics
- [`notebooks/module_demos/Efficiency Analysis.ipynb`](notebooks/module_demos/Efficiency%20Analysis.ipynb): Calculation of efficiency metrics and user comparisons
- [`notebooks/module_demos/Resource Hoarding.ipynb`](notebooks/module_demos/Resource%20Hoarding.ipynb`): Analysis of CPU core and RAM overallocation

- Please avoid committing code unless it is meant to be used by the rest of the team.
- New code should first be committed in a dedicated branch (```feature/newanalysis``` or ```bugfix/typo```), and later merged into ```main``` following a code review.
- Shared datasets should usually be managed with a shared folder on Unity, not committed to Git.
- Prefer committing Python modules with plotting routines like ```scripts/gpu_metrics.py``` instead of Jupyter notebooks, when possible.
The [`notebooks`](notebooks) directory contains all Jupyter notebooks.

## Getting started on Unity

You'll need to first install a few dependencies, which include DuckDB, Pandas, and some plotting libraries. More details for running the project will need be added here later.
## Documentation

### Version Control
To provide the path of the git configuration file of this project to git, run:
This repository uses [MkDocs](https://www.mkdocs.org/) for project documentation. The documentation source files are located in the `docs/` directory and the configuration is in `mkdocs.yml`.

git config --local include.path ../.gitconfig
To build and serve the documentation locally:

To ensure consistent LF line endings across all platforms, run the following command when developing on Windows machines:
pip install -r dev-requirements.txt
mkdocs serve

git config --local core.autocrlf input
To build the static site:

### Jupyter notebooks
mkdocs build

You can run Jupyter notebooks on Unity through the OpenOnDemand portal. To make your environment
visible in Jupyter, run
To deploy the documentation (e.g., to GitHub Pages):

python -m ipykernel install --user --name "Duck DB"
mkdocs gh-deploy

from within the environment. This will add "Duck DB" as a kernel option in the dropdown.
See the [MkDocs documentation](https://www.mkdocs.org/user-guide/) for more details and advanced usage.

### Documenting New Features

For any new features, modules, or major changes, please add a corresponding `.md` file under the `docs/` directory. This helps keep the project documentation up to date and useful for all users and contributors.

By default, Jupyter Notebook outoputs are removed via a git filter before the notebook is committed to git. To add an exception and keep the output of a notebook, add the following line to [notebooks/.gitattributes](```notebooks/.gitattributes```):
## Dataset

<NOTEBOOK_NAME>.ipynb !filter=strip-notebook-output
The primary dataset for this project is a DuckDB database that contains information about jobs on
Unity. It is located under ```unity.rc.umass.edu:/modules/admin-resources/reporting/slurm_data.db``` and is updated daily.
The schema is provided below. In addition to the columns in the DuckDB file, this repository contains tools to add a number of useful derived columns for visualization and analysis.

| Column | Type | Description |
| :--- | :--- | :------------ |
| UUID | VARCHAR | Unique identifier |
| JobID | INTEGER | Slurm job ID |
| ArrayID | INTEGER | Position in job array |
|ArrayJobID| INTEGER | Slurm job ID within array|
| JobName | VARCHAR | Name of job |
| IsArray | BOOLEAN | Indicator if job is part of an array |
| Interactive | VARCHAR | Indicator if job was interactive
| Preempted | BOOLEAN | Was job preempted |
| Account | VARCHAR | Slurm account (PI group) |
| User | VARCHAR | Unity user |
| Constraints | VARCHAR[] | Job constraints |
| QOS | VARCHAR | Job QOS |
| Status | VARCHAR | Job status on termination |
| ExitCode | VARCHAR | Job exit code |
| SubmitTime | TIMESTAMP_NS | Job submission time |
| StartTime | TIMESTAMP_NS | Job start time
| EndTime | TIMESTAMP_NS | Job end time |
| Elapsed | INTEGER | Job runtime (seconds) |
| TimeLimit | INTEGER | Job time limit (seconds) |
| Partition | VARCHAR | Job partition |
| Nodes | VARCHAR | Job nodes as compact string |
| NodeList | VARCHAR[] | List of job nodes |
| CPUs | SMALLINT | Number of CPU cores |
| Memory | INTEGER | Job allocated memory (bytes) |
| GPUs | SMALLINT | Number of GPUs requested |
| GPUType | DICT | Dictionary with keys as type of GPU (str) and the values as number of GPUs corresponding to that type (int) |
| GPUMemUsage | FLOAT | GPU memory usage (bytes) |
| GPUComputeUsage | FLOAT | GPU compute usage (pct) |
| CPUMemUsage | FLOAT | GPU memory usage (bytes) |
| CPUComputeUsage | FLOAT | CPU compute usage (pct) |


## Development Environment

To set up your development environment, use the provided `dev-requirements.txt` for all development dependencies (including linting, testing, and documentation tools).
To set up your development environment, use the provided [`dev-requirements.txt`](dev-requirements.txt) for all development dependencies (including linting, testing, and documentation tools).

This project requires **Python 3.11**. Make sure you have Python 3.11 installed before creating the virtual environment.

Expand All @@ -84,7 +111,27 @@ This project requires **Python 3.11**. Make sure you have Python 3.11 installed
pip install -r requirements.txt
pip install -r dev-requirements.txt

If you need to reset your environment, you can delete the `duckdb` folder and recreate it as above.
If you need to reset your environment, you can delete the `duckdb` directory and recreate it as above.

### Version Control
To provide the path of the git configuration file of this project to git, run:

git config --local include.path ../.gitconfig

To ensure consistent LF line endings across all platforms, run the following command when developing on Windows machines:

git config --local core.autocrlf input

### Jupyter notebooks

You can run Jupyter notebooks on Unity through the OpenOnDemand portal. To make your environment
visible in Jupyter, run

python -m ipykernel install --user --name "Duck DB"

from within the environment. This will add "Duck DB" as a kernel option in the dropdown.

By default, Jupyter Notebook outputs are removed via a git filter before the notebook is committed to git. To add an exception and keep the output of a notebook, add the file name of the notebook to [`scripts/strip_notebook_exclude.txt`](```scripts/.strip_notebook_exclude```).

## Code Style & Linting

Expand Down Expand Up @@ -142,80 +189,14 @@ All Python code should use [**Google-style docstrings**](https://google.github.i
"""
# ...function code...

## Documentation

This repository uses [MkDocs](https://www.mkdocs.org/) for project documentation. The documentation source files are located in the `docs/` directory and the configuration is in `mkdocs.yml`.

To build and serve the documentation locally:

pip install -r dev-requirements.txt
mkdocs serve

To build the static site:

mkdocs build

To deploy the documentation (e.g., to GitHub Pages):

mkdocs gh-deploy

See the [MkDocs documentation](https://www.mkdocs.org/user-guide/) for more details and advanced usage.

### Documenting New Features

For any new features, modules, or major changes, please add a corresponding `.md` file under the `docs/` directory. This helps keep the project documentation up to date and useful for all users and contributors.

## Testing

To run tests, use the provided test scripts or `pytest` (if available):

pytest


### Support

The Unity documentation (https://docs.unity.rc.umass.edu/) has a lot of useful
background information about Unity in particular and HPC in general. It will help explain a lot of
the terms used in the dataset schema below. For specific issues with the code in this repo or the
DuckDB dataset, feel free to reach out to Benjamin Pachev on the Unity Slack.

## The dataset
## Support

The primary dataset for this project is a DuckDB database that contains information about jobs on
Unity. It is contained under ```/modules/admin-resources/reporting/slurm_data.db``` and is updated daily.
A schema is provided below. In addition to the columns in the DuckDB file, ```scripts/gpu_metrics.py```
contains tools to add a number of useful derived columns for plotting and analysis.

| Column | Type | Description |
| :--- | :--- | :------------ |
| UUID | VARCHAR | Unique identifier |
| JobID | INTEGER | Slurm job ID |
| ArrayID | INTEGER | Position in job array |
|ArrayJobID| INTEGER | Slurm job ID within array|
| JobName | VARCHAR | Name of job |
| IsArray | BOOLEAN | Indicator if job is part of an array |
| Interactive | VARCHAR | Indicator if job was interactive
| Preempted | BOOLEAN | Was job preempted |
| Account | VARCHAR | Slurm account (PI group) |
| User | VARCHAR | Unity user |
| Constraints | VARCHAR[] | Job constraints |
| QOS | VARCHAR | Job QOS |
| Status | VARCHAR | Job status on termination |
| ExitCode | VARCHAR | Job exit code |
| SubmitTime | TIMESTAMP_NS | Job submission time |
| StartTime | TIMESTAMP_NS | Job start time
| EndTime | TIMESTAMP_NS | Job end time |
| Elapsed | INTEGER | Job runtime (seconds) |
| TimeLimit | INTEGER | Job time limit (seconds) |
| Partition | VARCHAR | Job partition |
| Nodes | VARCHAR | Job nodes as compact string |
| NodeList | VARCHAR[] | List of job nodes |
| CPUs | SMALLINT | Number of CPU cores |
| Memory | INTEGER | Job allocated memory (bytes) |
| GPUs | SMALLINT | Number of GPUs requested |
| GPUType | DICT | Dictionary with keys as type of GPU (str) and the values as number of GPUs corresponding to that type (int) |
| GPUMemUsage | FLOAT | GPU memory usage (bytes) |
| GPUComputeUsage | FLOAT | GPU compute usage (pct) |
| CPUMemUsage | FLOAT | GPU memory usage (bytes) |
| CPUComputeUsage | FLOAT | CPU compute usage (pct) |
The Unity documentation (https://docs.unity.rc.umass.edu/) has plenty of useful information about Unity and Slurm which would be helpful in understanding the data. For specific issues with the code in this repo or the DuckDB dataset, feel free to reach out to Benjamin Pachev on the Unity Slack.

5 changes: 1 addition & 4 deletions notebooks/.gitattributes
Original file line number Diff line number Diff line change
@@ -1,4 +1 @@
# *.ipynb filter=strip-notebook-output
# # keep the output of the following notebooks when committing
# SlurmGPU.ipynb -filter=strip-notebook-output
# notebooks/SlurmGPU.ipynb -filter=strip-notebook-output
*.ipynb filter=strip-notebook-output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we not keeping the SlurmGPU output now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of notebooks to not filter out is handled by the script clean_notebook.sh so this was unnecessary.

119 changes: 0 additions & 119 deletions notebooks/Basic Visualization.ipynb

This file was deleted.

Loading