Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
ecbaf64
Merge branch 'fix/remove-cpu-partition-jobs' of github.com:UnityHPC/d…
MisterArdavan Aug 5, 2025
371198c
Add method to get memory of node from node_info.json alongside testing
MisterArdavan Aug 7, 2025
f35250a
Replace string with enum
MisterArdavan Aug 7, 2025
c68f1d4
Implement calculations for memory hoarding metrics including total ra…
MisterArdavan Aug 12, 2025
6588452
Add tests for new methods in remote_config.py and refactor test direc…
MisterArdavan Aug 12, 2025
ad6880d
Fix ANN errors in test_remote_config.py
MisterArdavan Aug 12, 2025
ef1b3ba
Add metric for sorting jobs based on RAM hoarding. Visualize the metr…
MisterArdavan Aug 12, 2025
7664948
Add metrics for sorting users hoarding too many CPU cores by compari…
MisterArdavan Aug 12, 2025
43f8802
Add static and dynamic type checking for enum types passed to Efficie…
MisterArdavan Aug 12, 2025
b702092
Refactor ResourceHoardingDataFrameNameEnum
MisterArdavan Aug 12, 2025
22abd89
Add tests for new MetricsDataFrameName enums
MisterArdavan Aug 12, 2025
3ae4aa2
Add metrics for the users DataFrame
MisterArdavan Aug 13, 2025
6242b23
Polish ResourceHoarding.ipynb
MisterArdavan Aug 13, 2025
50ac674
Merge branch 'main' of github.com:UnityHPC/ds4cg-job-analytics into f…
MisterArdavan Aug 13, 2025
1652079
Resolve merge conflict
MisterArdavan Aug 17, 2025
0d20681
Refactor preprocessing code
MisterArdavan Aug 17, 2025
de1c659
Refactor preprocessing error
MisterArdavan Aug 17, 2025
61d7257
Resolved merge conflicts.
Sep 3, 2025
829025c
Fix formatting errors.
Sep 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,233 changes: 617 additions & 616 deletions notebooks/Efficiency Analysis.ipynb

Large diffs are not rendered by default.

367 changes: 367 additions & 0 deletions notebooks/Resource Hoarding.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,367 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0",
"metadata": {},
"source": [
"# <a id='toc1_'></a>[Resource Hoarding Analysis](#toc0_)\n",
"This notebook demonstrates the use of `ResourceHoarding` class in `src/analysis/hoarding.py` for analyzing the jobs and users that hoard resources by requesting a disproportionate amount of CPU Memory and Cores."
]
},
{
"cell_type": "markdown",
"id": "1",
"metadata": {},
"source": [
"**Table of contents**<a id='toc0_'></a> \n",
"- [Resource Hoarding Analysis](#toc1_) \n",
" - [Setup](#toc1_1_) \n",
" - [Filter jobs for resource hoarding analysis](#toc1_1_1_) \n",
" - [Analyze Jobs Hoarding Resources:](#toc1_2_) \n",
" - [Generate all hoarding analysis metrics for jobs:](#toc1_2_1_1_) \n",
" - [Find most inefficient jobs hoarding node RAM based on `ram_hoarding_fraction_diff`](#toc1_2_1_2_) \n",
" - [Find most inefficient jobs hoarding CPU cores based on `core_hoarding_fraction_diff`](#toc1_2_1_3_) \n",
" - [Analyze Users Hoarding Resources:](#toc1_3_) \n",
" - [Generate all hoarding analysis metrics for users:](#toc1_3_1_1_) \n",
" - [Find most inefficient users hoarding node RAM based on `expected_value_ram_hoarding_fraction_diff`](#toc1_3_1_2_) \n",
" - [Find most inefficient users hoarding CPU cores based on `expected_value_core_hoarding_fraction_diff`](#toc1_3_1_3_) \n",
"\n",
"<!-- vscode-jupyter-toc-config\n",
"\tnumbering=false\n",
"\tanchor=true\n",
"\tflat=false\n",
"\tminLevel=1\n",
"\tmaxLevel=6\n",
"\t/vscode-jupyter-toc-config -->\n",
"<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->"
]
},
{
"cell_type": "markdown",
"id": "2",
"metadata": {},
"source": [
"## <a id='toc1_1_'></a>[Setup](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3",
"metadata": {},
"outputs": [],
"source": [
"# Import required modules\n",
"import sys\n",
"from pathlib import Path\n",
"import pandas as pd\n",
"\n",
"# import matplotlib.pyplot as plt\n",
"# import seaborn as sns\n",
"import os"
]
},
{
"cell_type": "markdown",
"id": "4",
"metadata": {},
"source": [
"Jupyter server should be run at the notebook directory, so the output of the following cell would be the project root:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5",
"metadata": {},
"outputs": [],
"source": [
"project_root = str(Path.cwd().resolve().parent)\n",
"print(f\"Project root: {project_root}\")\n",
"os.environ[\"OUTPUT_MODE\"] = \"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6",
"metadata": {},
"outputs": [],
"source": [
"# Automatically reload modules before executing code (set this up BEFORE imports)\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"# Add project root to sys.path for module imports\n",
"if project_root not in sys.path:\n",
" sys.path.insert(0, project_root)\n",
"\n",
"from src.analysis import ResourceHoarding as ResourceHoarding\n",
"from src.analysis import efficiency_analysis as ea\n",
"from src.visualization import JobsWithMetricsVisualizer, UsersWithMetricsVisualizer\n",
"from src.config.enum_constants import ResourceHoardingDataFrameNameEnum"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"metadata": {},
"outputs": [],
"source": [
"# Load the jobs DataFrame from DuckDB\n",
"preprocessed_jobs_df = ea.load_preprocessed_jobs_dataframe_from_duckdb(\n",
" db_path=\"../data/slurm_data.db\",\n",
" table_name=\"Jobs\",\n",
")\n",
"display(preprocessed_jobs_df.head(10))\n",
"print(preprocessed_jobs_df.shape)"
]
},
{
"cell_type": "markdown",
"id": "8",
"metadata": {},
"source": [
"### <a id='toc1_1_1_'></a>[Filter jobs for resource hoarding analysis](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9",
"metadata": {},
"outputs": [],
"source": [
"hoarding_analysis = ResourceHoarding(jobs_df=preprocessed_jobs_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10",
"metadata": {},
"outputs": [],
"source": [
"filtered_jobs = hoarding_analysis.filter_jobs_for_analysis()\n",
"filtered_jobs"
]
},
{
"cell_type": "markdown",
"id": "11",
"metadata": {},
"source": [
"## <a id='toc1_2_'></a>[Analyze Jobs Hoarding Resources:](#toc0_)\n"
]
},
{
"cell_type": "markdown",
"id": "12",
"metadata": {},
"source": [
"#### <a id='toc1_2_1_1_'></a>[Generate all hoarding analysis metrics for jobs:](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13",
"metadata": {},
"outputs": [],
"source": [
"memory_hoarding_jobs = hoarding_analysis.calculate_node_resource_hoarding_for_jobs(filtered_jobs)\n",
"\n",
"# Set option to display all columns\n",
"pd.set_option(\"display.max_columns\", None)\n",
"# Display the DataFrame\n",
"display(memory_hoarding_jobs.head(10))\n",
"# To revert to default settings (optional)\n",
"pd.reset_option(\"display.max_columns\")\n",
"\n",
"print(f\"Jobs found: {len(memory_hoarding_jobs)}\")"
]
},
{
"cell_type": "markdown",
"id": "14",
"metadata": {},
"source": [
"#### <a id='toc1_2_1_2_'></a>[Find most inefficient jobs hoarding node RAM based on `ram_hoarding_fraction_diff`](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15",
"metadata": {},
"outputs": [],
"source": [
"inefficient_jobs_hoarding_ram = hoarding_analysis.sort_and_filter_records_with_metrics(\n",
" metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.JOBS_WITH_RESOURCE_HOARDING_METRICS,\n",
" sorting_key=\"ram_hoarding_fraction_diff\",\n",
" ascending=False, # Sort in descending order\n",
" filter_criteria={\"ram_hoarding_fraction_diff\": {\"min\": 0, \"inclusive\": True}},\n",
")\n",
"# Display top inefficient users by RAM hoarding fraction\n",
"print(\"\\nTop inefficient Jobs by RAM hoarding fraction:\")\n",
"display(inefficient_jobs_hoarding_ram.head(10))\n",
"\n",
"# Plot top inefficient jobs by RAM hoarding fraction, with RAM hoarding fraction as labels\n",
"jobs_with_metrics_visualizer = JobsWithMetricsVisualizer(inefficient_jobs_hoarding_ram.head(20))\n",
"jobs_with_metrics_visualizer.visualize(\n",
" column=\"ram_hoarding_fraction_diff\",\n",
" bar_label_columns=[\"ram_hoarding_fraction_diff\", \"cpu_mem_efficiency\", \"alloc_vram_efficiency\"],\n",
" figsize=(12, 12),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "16",
"metadata": {},
"source": [
"#### <a id='toc1_2_1_3_'></a>[Find most inefficient jobs hoarding CPU cores based on `core_hoarding_fraction_diff`](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17",
"metadata": {},
"outputs": [],
"source": [
"inefficient_jobs_hoarding_cpu_cores = hoarding_analysis.sort_and_filter_records_with_metrics(\n",
" metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.JOBS_WITH_RESOURCE_HOARDING_METRICS,\n",
" sorting_key=\"core_hoarding_fraction_diff\",\n",
" ascending=False, # Sort in descending order\n",
" filter_criteria={\"core_hoarding_fraction_diff\": {\"min\": 0, \"inclusive\": True}},\n",
")\n",
"# Display top inefficient users by CPU core hoarding fraction\n",
"print(\"\\nTop inefficient Jobs by CPU core hoarding fraction:\")\n",
"display(inefficient_jobs_hoarding_cpu_cores.head(10))\n",
"\n",
"# Plot top inefficient jobs by CPU core hoarding fraction, with CPU core hoarding fraction as labels\n",
"jobs_with_metrics_visualizer = JobsWithMetricsVisualizer(inefficient_jobs_hoarding_cpu_cores.head(20))\n",
"jobs_with_metrics_visualizer.visualize(\n",
" column=\"core_hoarding_fraction_diff\",\n",
" bar_label_columns=[\"core_hoarding_fraction_diff\", \"ram_hoarding_fraction_diff\", \"alloc_vram_efficiency\"],\n",
" figsize=(12, 12),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "18",
"metadata": {},
"source": [
"## <a id='toc1_3_'></a>[Analyze Users Hoarding Resources:](#toc0_)\n"
]
},
{
"cell_type": "markdown",
"id": "19",
"metadata": {},
"source": [
"#### <a id='toc1_3_1_1_'></a>[Generate all hoarding analysis metrics for users:](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20",
"metadata": {},
"outputs": [],
"source": [
"memory_hoarding_users = hoarding_analysis.calculate_node_resource_hoarding_for_users(filtered_jobs)\n",
"display(memory_hoarding_users)"
]
},
{
"cell_type": "markdown",
"id": "21",
"metadata": {},
"source": [
"#### <a id='toc1_3_1_2_'></a>[Find most inefficient users hoarding node RAM based on `expected_value_ram_hoarding_fraction_diff`](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "22",
"metadata": {},
"outputs": [],
"source": [
"inefficient_users_hoarding_ram = hoarding_analysis.sort_and_filter_records_with_metrics(\n",
" metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.USERS_WITH_RESOURCE_HOARDING_METRICS,\n",
" sorting_key=\"expected_value_ram_hoarding_fraction_diff\",\n",
" ascending=False, # Sort in descending order\n",
" filter_criteria={\"expected_value_ram_hoarding_fraction_diff\": {\"min\": 0, \"inclusive\": True}},\n",
")\n",
"# Display top inefficient users by RAM hoarding fraction\n",
"\n",
"print(\"\\nTop inefficient Users by RAM hoarding fraction:\")\n",
"display(inefficient_users_hoarding_ram.head(10))\n",
"\n",
"# Plot top inefficient users by RAM hoarding fraction, with RAM hoarding fraction as labels\n",
"users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_hoarding_ram.head(20))\n",
"users_with_metrics_visualizer.visualize(\n",
" column=\"expected_value_ram_hoarding_fraction_diff\",\n",
" bar_label_columns=[\n",
" \"expected_value_ram_hoarding_fraction_diff\",\n",
" \"expected_value_core_hoarding_fraction_diff\",\n",
" \"expected_value_alloc_vram_efficiency\",\n",
" ],\n",
" figsize=(14, 12),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "23",
"metadata": {},
"source": [
"#### <a id='toc1_3_1_3_'></a>[Find most inefficient users hoarding CPU cores based on `expected_value_core_hoarding_fraction_diff`](#toc0_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24",
"metadata": {},
"outputs": [],
"source": [
"inefficient_users_hoarding_cpu_cores = hoarding_analysis.sort_and_filter_records_with_metrics(\n",
" metrics_df_name_enum=ResourceHoardingDataFrameNameEnum.USERS_WITH_RESOURCE_HOARDING_METRICS,\n",
" sorting_key=\"expected_value_core_hoarding_fraction_diff\",\n",
" ascending=False, # Sort in descending order\n",
" filter_criteria={\"expected_value_core_hoarding_fraction_diff\": {\"min\": 0, \"inclusive\": True}},\n",
")\n",
"# Display top inefficient users by CPU core hoarding fraction\n",
"\n",
"print(\"\\nTop inefficient Users by CPU core hoarding fraction:\")\n",
"display(inefficient_users_hoarding_cpu_cores.head(10))\n",
"\n",
"# Plot top inefficient users by CPU core hoarding fraction, with CPU core hoarding fraction as labels\n",
"users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_hoarding_cpu_cores.head(20))\n",
"users_with_metrics_visualizer.visualize(\n",
" column=\"expected_value_core_hoarding_fraction_diff\",\n",
" bar_label_columns=[\n",
" \"expected_value_core_hoarding_fraction_diff\",\n",
" \"expected_value_ram_hoarding_fraction_diff\",\n",
" \"expected_value_alloc_vram_efficiency\",\n",
" ],\n",
" figsize=(14, 12),\n",
")"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
6 changes: 5 additions & 1 deletion src/analysis/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
from .efficiency_analysis import EfficiencyAnalysis as EfficiencyAnalysis
from .efficiency_analysis import (
EfficiencyAnalysis as EfficiencyAnalysis,
load_preprocessed_jobs_dataframe_from_duckdb as load_preprocessed_jobs_dataframe_from_duckdb
)
from .hoarding import ResourceHoarding as ResourceHoarding
Loading