Skip to content

Conversation

@naryasomayaj
Copy link
Collaborator

Here is the analysis for a100 gpu metrics. Currently, this is generating metrics as well as plots for the top3 requested gpus.

@MisterArdavan MisterArdavan marked this pull request as draft June 30, 2025 21:39
@naryasomayaj naryasomayaj marked this pull request as ready for review August 5, 2025 14:46
@naryasomayaj naryasomayaj requested review from LTan-101104 and MisterArdavan and removed request for LTan-101104 and MisterArdavan August 5, 2025 14:46
db = DatabaseConnection(str(db_path))

jobs_df = db.fetch_all_jobs(table_name=table_name)
jobs_df = db.fetch_all_jobs(table_name=table_name) if query is None else db.fetch_query(query=query)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but ideally, we wait for Tan's PR to be merged and use those functions. If we don't have time to do that though, then we can stick to using this.


gpu_jobs["job_hours"].sum(), # Total GPU Hours
# Mean Weighted VRAM Efficiency
(gpu_jobs["alloc_vram_efficiency"] * gpu_jobs["job_hours"]).sum() / gpu_jobs["job_hours"].sum(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably use VRAM hours here since the other weighted job metrics use that. Unless there's a reason why job_hours would work better here

# Mean Weighted VRAM Efficiency
(gpu_jobs["alloc_vram_efficiency"] * gpu_jobs["job_hours"]).sum() / gpu_jobs["job_hours"].sum(),
# Median Weighted VRAM Efficiency
(gpu_jobs["alloc_vram_efficiency"] * gpu_jobs["job_hours"]).median() / gpu_jobs["job_hours"].median()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, should we use vram_hours instead?

job_metrics_by_gpu_type = self.compare_job_metrics_by_gpu_type()

# Create a DataFrame to hold the GPU utilization patterns
gpu_utilization_patterns = pd.DataFrame({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just transposing the other df? It looks like we're creating rows out of the column values? Correct me if I'm wrong but could transposing the df do the same thing and be simplified? Try it and see

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file, could we remove the older Efficiency Analysis cells (at least the plots that are irrelevant to A100s). If we aren't doing anything with users, we don't need to call the user metric functions and show the output here. I had to scroll quite a bit to find the actual A100 plots. It's better if we can simplify this notebook since we will also have another notebook later that contains the Efficiency analysis + time plots + ROC + A100 analysis for certain groups, so this notebook should only focus on A100s for easy reference to those functions

@MisterArdavan MisterArdavan marked this pull request as draft August 20, 2025 18:48
@bpachev bpachev closed this Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants