Skip to content
Closed
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
703379b
committing changes for a100 analysis generating plot of all request a…
naryasomayaj Jun 25, 2025
ae94ad4
committing changes to a-100 analysis
naryasomayaj Jun 26, 2025
4baadaa
Merge branch 'feature/1a-zero-vram' into feature/a-100-analysis
naryasomayaj Jun 26, 2025
a487bb6
committing changes for analysis of a100 gpus compared to top 3 users,…
naryasomayaj Jun 30, 2025
f6b6a33
committing changes for a100 analysis resolving all ruff checks
naryasomayaj Jun 30, 2025
46b2514
committing changes for a100 analysis
naryasomayaj Jul 8, 2025
9fefe76
resolve merge conflicts
naryasomayaj Jul 8, 2025
0da5e62
Merge branch 'main' into feature/a-100-analysis
naryasomayaj Jul 10, 2025
ace5dd5
merged main into branch
naryasomayaj Jul 10, 2025
a9892ac
committing changes for a100 analysis
naryasomayaj Jul 14, 2025
b3690b7
resolved ruff issues on a-100-analysis branch
naryasomayaj Jul 14, 2025
97eca33
resolved ruff issues on a-100-analysis branch
naryasomayaj Jul 14, 2025
1b7d9f6
committing changes for analysis of a100 gpus compared to top 3 users,…
naryasomayaj Jun 30, 2025
4979449
committing changes for a100 analysis resolving all ruff checks
naryasomayaj Jun 30, 2025
dd291b9
committing changes for a100 analysis
naryasomayaj Jul 8, 2025
bd0a418
merged main into branch
naryasomayaj Jul 10, 2025
56efc92
committing changes for a100 analysis
naryasomayaj Jul 14, 2025
0fbf2ce
committing changes toa100 analysis with refactored EfficiencyAnalysis
naryasomayaj Jul 14, 2025
afa331d
working with updated efficiency analysis
naryasomayaj Jul 15, 2025
d7bbd9d
committing changes for preprocess
naryasomayaj Jul 16, 2025
fe42dbd
committing changes for a100 analysis
naryasomayaj Jul 16, 2025
3eb7559
committing changes for a100 on the new dataset
naryasomayaj Jul 23, 2025
271bbc7
committing changes to a100 analysis
naryasomayaj Jul 28, 2025
6d4eaf6
committing changes to a100 just to switch branch
naryasomayaj Jul 28, 2025
be387c5
Readd dev-requirement file
naryasomayaj Jul 30, 2025
bcc49ed
Up to date efficiency Analysis notebook
naryasomayaj Jul 30, 2025
bff782b
merge remote-tracking branch 'origin/main' into feature/a-100-analysis
naryasomayaj Jul 30, 2025
36c30dc
committing changes to a-100 analysis notebook
naryasomayaj Jul 31, 2025
53d4e91
created A100 .ipynb with same structure as EfficiencyAnalysis
naryasomayaj Jul 31, 2025
0ee8bf9
A100_analysis.ipynb
naryasomayaj Jul 31, 2025
662785a
committing a100 formatting changes
naryasomayaj Aug 1, 2025
4c433a2
committing edits for a100 format
naryasomayaj Aug 3, 2025
25bf316
Merge branch 'main' into feature/a-100-analysis
naryasomayaj Aug 4, 2025
891c120
passing all pytests in a100
naryasomayaj Aug 4, 2025
4c2b94a
resolving all ruff checks for a100
naryasomayaj Aug 4, 2025
0d8279c
resolving all ruff checks for a100
naryasomayaj Aug 4, 2025
e9453de
pyproject.toml
naryasomayaj Aug 4, 2025
6c22061
resolving mypy errors
naryasomayaj Aug 5, 2025
0440cbb
observing vram constraint efficiency categories, looking into users t…
naryasomayaj Aug 6, 2025
62a5be3
vram efficiency > 1 and looking at GPU request types
naryasomayaj Aug 11, 2025
fafd927
vram efficiency > 1 and looking at GPU request types
naryasomayaj Aug 11, 2025
15a6f8c
vram efficiency > 1 and looking at GPU request types
naryasomayaj Aug 11, 2025
c7d52b4
committing changes for a100 visualizations
naryasomayaj Aug 13, 2025
54a9c41
Merge branch 'main' into feature/a-100-analysis
naryasomayaj Aug 16, 2025
dffc6dd
polishing a100 notebook
naryasomayaj Aug 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7,363 changes: 7,363 additions & 0 deletions notebooks/A100_Analysis.ipynb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file, could we remove the older Efficiency Analysis cells (at least the plots that are irrelevant to A100s). If we aren't doing anything with users, we don't need to call the user metric functions and show the output here. I had to scroll quite a bit to find the actual A100 plots. It's better if we can simplify this notebook since we will also have another notebook later that contains the Efficiency analysis + time plots + ROC + A100 analysis for certain groups, so this notebook should only focus on A100s for easy reference to those functions

Large diffs are not rendered by default.

678 changes: 664 additions & 14 deletions notebooks/Basic Visualization.ipynb

Large diffs are not rendered by default.

6,376 changes: 6,339 additions & 37 deletions notebooks/Efficiency Analysis.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -130,4 +130,4 @@ exclude = [
"mvp-scripts/gpu_metrics.py",
"mvp-scripts/zero_gpu_usage_list.py",
"notebooks/SlurmGPU.ipynb"
]
]
2 changes: 1 addition & 1 deletion src/analysis/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .efficiency_analysis import EfficiencyAnalysis as EfficiencyAnalysis
from .efficiency_analysis import (
load_preprocessed_jobs_dataframe_from_duckdb as load_preprocessed_jobs_dataframe_from_duckdb,
)
)
155 changes: 153 additions & 2 deletions src/analysis/efficiency_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def load_preprocessed_jobs_dataframe_from_duckdb(
table_name: str = "Jobs",
sample_size: int | None = None,
random_state: pd._typing.RandomState | None = None,
query: str | None = None,
) -> pd.DataFrame:
"""
Load jobs DataFrame from a DuckDB database and preprocess it.
Expand All @@ -28,6 +29,7 @@ def load_preprocessed_jobs_dataframe_from_duckdb(
table_name (str, optional): Table name to query. Defaults to 'Jobs'.
sample_size (int, optional): Number of rows to sample from the DataFrame. Defaults to None (no sampling).
random_state (pd._typing.RandomState, optional): Random state for reproducibility. Defaults to None.
query (str, optional): Custom SQL query to fetch data. If provided, overrides the table_name.

Returns:
pd.DataFrame: DataFrame containing the table data.
Expand All @@ -40,7 +42,7 @@ def load_preprocessed_jobs_dataframe_from_duckdb(
try:
db = DatabaseConnection(str(db_path))

jobs_df = db.fetch_all_jobs(table_name=table_name)
jobs_df = db.fetch_all_jobs(table_name=table_name) if query is None else db.fetch_query(query=query)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but ideally, we wait for Tan's PR to be merged and use those functions. If we don't have time to do that though, then we can stick to using this.

processed_data = preprocess_data(
jobs_df, min_elapsed_seconds=0, include_failed_cancelled_jobs=False, include_cpu_only_jobs=False
)
Expand Down Expand Up @@ -165,6 +167,23 @@ def apply_numeric_filter(
raise ValueError(f"{filter_name} must be a numeric type.")
return mask

def get_unique_gpu_types(self) -> np.ndarray:
"""
Get unique GPU types from the jobs DataFrame.

Returns:
pd.Series: Unique GPU types as a pandas Series.
"""
return (
self.jobs_df["GPUType"]
.dropna()
.explode()
.astype(str)
.str.strip()
.str.lower()
.unique()
)

def filter_jobs_for_analysis(
self,
vram_constraint_filter: int | float | list | set | tuple | dict | pd.api.typing.NAType | None = None,
Expand Down Expand Up @@ -680,7 +699,7 @@ def find_inefficient_pis_by_vram_hours(
# Sort by the metric descending (higher is worse)
inefficient_pi_accounts = inefficient_pi_accounts.sort_values("pi_acc_vram_hours", ascending=False)
return inefficient_pi_accounts

def sort_and_filter_records_with_metrics(
self,
metrics_df_name_enum: MetricsDataFrameNameEnum,
Expand Down Expand Up @@ -750,3 +769,135 @@ def sort_and_filter_records_with_metrics(
filtered_records = filtered_records.sort_values(sorting_key, ascending=ascending)

return filtered_records

def compare_job_metrics_by_gpu_type(self) -> pd.DataFrame:
"""
Aggregate and display metrics for each GPU type for jobs matching a SQL query.

Args:
query (str): SQL query to select jobs.

Returns:
pd.DataFrame: Aggregated metrics by GPU type
"""

# Get unique GPU types
unique_gpu_types = self.get_unique_gpu_types()

metrics = [
"Mean Used GPU Memory (GiB)",
"Median Used GPU Memory (GiB)",
"Mean Requested VRAM Efficiency",
"Median Requested VRAM Efficiency",
"Mean Allocated VRAM Efficiency",
"Median Allocated VRAM Efficiency",
"Total GPU Hours",
"Mean Weighted VRAM Efficiency",
"Median Weighted VRAM Efficiency"
]

job_efficiency_metrics = self.calculate_job_efficiency_metrics(self.jobs_df)

results: dict[str, list] = {gpu_type.upper(): [] for gpu_type in unique_gpu_types}
for gpu_type in unique_gpu_types:
gpu_jobs = job_efficiency_metrics[
job_efficiency_metrics['GPUType'].apply(
lambda x, gpu_type=gpu_type: isinstance(x, dict) and gpu_type in x
)
]

if gpu_jobs.empty:
results[gpu_type.upper()] = [None] * len(metrics)
continue
results[gpu_type.upper()] = [
gpu_jobs["GPUMemUsage"].mean() / (2**30), # Mean Used GPU Memory in GiB
gpu_jobs["GPUMemUsage"].median() / (2**30), # Median Used GPU Memory in GiB
gpu_jobs["vram_constraint_efficiency"].mean(), # Mean VRAM Efficiency
gpu_jobs["vram_constraint_efficiency"].median(), # Median VRAM Efficiency
gpu_jobs["alloc_vram_efficiency"].mean(), # Mean VRAM Efficiency
gpu_jobs["alloc_vram_efficiency"].median(), # Median VRAM Efficiency

gpu_jobs["job_hours"].sum(), # Total GPU Hours
# Mean Weighted VRAM Efficiency
(gpu_jobs["alloc_vram_efficiency"] * gpu_jobs["job_hours"]).sum() / gpu_jobs["job_hours"].sum(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably use VRAM hours here since the other weighted job metrics use that. Unless there's a reason why job_hours would work better here

# Median Weighted VRAM Efficiency
(gpu_jobs["alloc_vram_efficiency"] * gpu_jobs["job_hours"]).median() / gpu_jobs["job_hours"].median()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, should we use vram_hours instead?



]

# Create summary DataFrame
summary_df = pd.DataFrame(results, index=metrics)
return summary_df

def compare_gpu_utilization_patterns(self) -> pd.DataFrame:
"""
Compare GPU utilization patterns across different GPU types.

Returns:
pd.DataFrame: DataFrame with GPU utilization patterns by GPU type.
"""
job_metrics_by_gpu_type = self.compare_job_metrics_by_gpu_type()

# Create a DataFrame to hold the GPU utilization patterns
gpu_utilization_patterns = pd.DataFrame({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just transposing the other df? It looks like we're creating rows out of the column values? Correct me if I'm wrong but could transposing the df do the same thing and be simplified? Try it and see

"GPU Type": job_metrics_by_gpu_type.columns,
"Mean Used GPU Memory (GiB)": job_metrics_by_gpu_type.loc["Mean Used GPU Memory (GiB)"],
"Median Used GPU Memory (GiB)": job_metrics_by_gpu_type.loc["Median Used GPU Memory (GiB)"],
"Mean Requested VRAM Efficiency": job_metrics_by_gpu_type.loc["Mean Requested VRAM Efficiency"],
"Median Requested VRAM Efficiency": job_metrics_by_gpu_type.loc["Median Requested VRAM Efficiency"],
"Mean Allocated VRAM Efficiency": job_metrics_by_gpu_type.loc["Mean Allocated VRAM Efficiency"],
"Median Allocated VRAM Efficiency": job_metrics_by_gpu_type.loc["Median Allocated VRAM Efficiency"],
"Total GPU Hours": job_metrics_by_gpu_type.loc["Total GPU Hours"],
"Mean Weighted VRAM Efficiency": job_metrics_by_gpu_type.loc["Mean Weighted VRAM Efficiency"],
"Median Weighted VRAM Efficiency": job_metrics_by_gpu_type.loc["Median Weighted VRAM Efficiency"]
})

# Sort by Total GPU Hours in descending order
gpu_utilization_patterns = gpu_utilization_patterns.sort_values(by="Total GPU Hours", ascending=False)

return gpu_utilization_patterns

def categorize_jobs_by_vram_constraint_efficiency(self) -> pd.DataFrame:
"""
Bucketize jobs based on their VRAM constraint efficiency.

This is what your original function was actually doing.

Returns:
pd.DataFrame: DataFrame with jobs categorized into efficiency buckets.
"""
if self.jobs_with_efficiency_metrics is None:
self.calculate_job_efficiency_metrics(self.jobs_df)

df = self.jobs_with_efficiency_metrics.copy()

# Create efficiency bucket
def categorize_efficiency(val: float | pd.api.typing.NAType) -> str:
if pd.isna(val):
return "NA"
if val <= 0.3:
return "0–30%"
elif val <= 0.6:
return "30–60%"
elif val <= 1.0:
return "60–100%"
else:
return ">100%"

df["vram_constraint_efficiency_bucket"] = df["vram_constraint_efficiency"].apply(categorize_efficiency)

# Count jobs in each bucket
bucket_counts = df["vram_constraint_efficiency_bucket"].value_counts(dropna=True).sort_index()

# Add proportion of jobs per bucket
total_jobs = len(df)
bucket_distribution = bucket_counts.to_frame(name="job_count")
bucket_distribution["percentage"] = (bucket_distribution["job_count"] / total_jobs * 100).round(2)

# Update the jobs DataFrame with bucket information
self.jobs_with_efficiency_metrics = df

return bucket_distribution


Loading