Skip to content

HPC Job Summary Tool

Gemy George Kaithakottil edited this page Nov 19, 2025 · 2 revisions

HPC Job Summary Tool

A Python script to analyse and visualise SLURM job performance metrics on HPC clusters.

Overview

get_hpc_job_summary retrieves detailed statistics for SLURM scheduler jobs and provides both tabular summaries and interactive visualisations to help users understand their job resource usage and optimise future submissions.

Features

  • Multiple Input Formats: Accepts job IDs from command line, files, or various log formats (SLURM, DRMAA, Cromwell, Snakemake, Nextflow)
  • Comprehensive Metrics: Tracks CPU time, memory usage, elapsed time, job state, and node allocation
  • Statistical Summary: Provides count, mean, std, min, max, and percentile statistics
  • Interactive Visualisation: Generates plotly-based HTML reports with hover details for each job
  • Flexible Job ID Input: Handles single or multiple job IDs with various delimiters

Execution

# add core bioinformatics software to PATH, if not already done
export PATH=/ei/software/cb/bin:$PATH

source eiutils-0.1.0
get_hpc_job_summary --help

Usage

Basic Usage

Get summary for specific job IDs:

get_hpc_job_summary -l 13268787 23456789 > summary.tsv

From File

Extract job IDs from a log file:

get_hpc_job_summary -i slurm_jobs.log > summary.tsv

With Interactive Plots

Generate summary with HTML visualisation:

get_hpc_job_summary -l 13268787,13617529 -p > summary.tsv

Combine multiple sources:

get_hpc_job_summary -l 13268787 -i jobs.txt -p > summary.tsv

Command-Line Options

Option Description
-i, --input Input file with job IDs (one per line) or log files
-l, --list_jobs Space or comma-separated list of job IDs
-p, --plot Generate interactive HTML plot
-o, --override Include job IDs <3 digits
-v, --verbose Enable detailed logging

Detailed Options

get_hpc_job_summary -h
usage: get_hpc_job_summary [-h] [-i INPUT] [-l LIST_JOBS [LIST_JOBS ...]] [-p] [-o] [-v]

        Script to get an HPC job summary from the SLURM scheduler


options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Provide a file with a list of SLURM job IDs (one job ID per line).
                        The script can also extract job IDs if you provide log files with job IDs is in one (or combination) of the following formats:
                        	'Submitted batch job 12345678' (Normal cluster executions),
                        	'Submitted DRMAA job 12345678' (DRMAA executions)
                        	'Submitted job 1 with external jobid 'Submitted batch job 12345678''(Snakemake exections)
                        	': job id: 12345678' (Cromwell executions)
                        	'> jobId: 12345678;' (Nextflow executions)
                         (default: None)
  -l LIST_JOBS [LIST_JOBS ...], --list_jobs LIST_JOBS [LIST_JOBS ...]
                        Provide job IDs as a list.
                        This can be a single job ID or a list of job IDs separated by spaces or commas.
                        For example:
                        	-l 12345678 23456789 or
                        	-l 12345678,23456789.
                        No need to use quotes for the job IDs. (default: None)
  -p, --plot            Plot a summary of job statistics as an interactive HTML file.
                        Saved as 'HPC_job_summary_YYYYMMDD_HHMMSS.html' in the current directory. (default: False)
  -o, --override        Override the default behaviour of skipping job IDs that are less than 3 digits or all characters are 0.
                        Use this option to include such job IDs in the analysis (default: True).
  -v, --verbose         Enable verbose logging (default: False).

Example commands:
Without plots:
	get_hpc_job_summary.py -l 13268787,13617529 > hpc_job_summary.tsv
	get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt > hpc_job_summary.tsv

With plots:
	get_hpc_job_summary.py -l 13268787,13617529 -p > hpc_job_summary.tsv
	get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt -p > hpc_job_summary.tsv

Output

Tabular Output

Tab-separated values with columns:

  • JobID, Start, End, ReqCPUS, ReqMem(Mb), MaxRSS(Mb), ElapsedRaw(secs), CPUTimeRAW(secs), State, NodeList, JobName

Example Output

$ cat hpc_job_summary.15145984.tsv
JobID     Start                End                  ReqCPUS  ReqMem(Mb)  MaxRSS(Mb)  ElapsedRaw(secs)  CPUTimeRAW(secs)  State          NodeList  JobName
15145984  2025-09-16T11:01:14  2025-09-17T19:13:55  2        20480       119         115961            231922            FAILED         t512n9    eirepeat-run1
15146015  2025-09-16T11:01:44  2025-09-16T11:04:02  4        10240       2054        138               552               COMPLETED      t512n13   eirepeat.clean_genome
15146074  2025-09-16T11:04:45  2025-09-16T11:05:58  1        5120        346         73                73                COMPLETED      t512n13   eirepeat.BuildDatabase
15146075  2025-09-16T11:04:45  2025-09-16T16:08:57  16       25600       13065       18252             292032            COMPLETED      t512n18   eirepeat.RepeatMasker_low
15146076  2025-09-16T11:04:45  2025-09-16T11:20:32  1        10240       7995        947               947               OUT_OF_MEMORY  t512n13   eirepeat.red
15146077  2025-09-16T11:04:45  2025-09-16T11:10:43  16       40960       874         358               5728              FAILED         t512n20   eirepeat.RepeatMasker_interspersed
15146107  2025-09-16T11:06:15  2025-09-17T16:41:32  16       40960       1133        106517            1704272           COMPLETED      t384n3    eirepeat.RepeatModeler
15242670  2025-09-17T16:41:50  2025-09-18T04:15:48  16       40960       40819       41638             666208            OUT_OF_MEMORY  t512n13   eirepeat.RepeatMasker_interspersed_repeatmodeler

Statistical Summary

Displays descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max) for:

  • Requested CPUs
  • Requested Memory (Mb)
  • Maximum RSS Memory (Mb)
  • Elapsed Time (seconds)
  • CPU Time (seconds)

Example Output

# Overall
                  count   mean    std  min   25%    50%    75%     max
ReqCPUS               8      9      8    1     2     10     16      16
ReqMem(Mb)            8  24320  15176 5120 10240  23040  40960   40960
MaxRSS(Mb)            8   8301  13918  119   742   1594   9262   40819
ElapsedRaw(secs)      8  35486  48971   73   303   9600  57858  115961
CPUTimeRAW(secs)      8 362717 589791   73   848 118825 385576 1704272
# Summary (total)
Jobs               ReqCPUS  ReqMem(Mb)  MaxRSS(Mb)  ElapsedRaw(secs)  CPUTimeRAW(secs)
8                  72       194560      66405       283884            2901734

Interactive HTML Plot

When using -p flag, it generates an additional time-stamped HTML file in the format - HPC_job_summary_YYYYMMDD_HHMMSS.html with:

  • 4 interactive strip plots showing resource usage by job state
  • Hover information: JobID, JobName, Start/End times, NodeList, Requested CPU Time (secs), Elapsed CPU Time (secs), Requested Memory (Mb), Used Memory (MaxRSS (Mb))
  • Visual comparison of requested vs. actual resource usage
  • Easy identification of over/under-provisioned jobs

Example Interactive HTML Plot

Screenshot 2025-11-18 at 23 55 49

Example Interactive HTML Plot - Zoomed In and Hovered View

Screenshot 2025-11-18 at 23 57 07

Supported Log Formats

The script automatically extracts job IDs from the following formats (when provided with the option --input):

  • Standard SLURM: Submitted batch job 12345678
  • DRMAA: Submitted DRMAA job 12345678
  • Snakemake: Submitted job 1 with external jobid '12345678'
  • Cromwell: : job id: 12345678
  • Nextflow: > jobId: 12345678;
  • Plain text: One job ID per line

Use Cases

  1. Resource Optimisation: Identify jobs requesting excessive memory or CPU
  2. Failure Analysis: Quickly spot failed jobs and their resource profiles
  3. Cost Management: Understand total resource consumption across job sets
  4. Performance Tuning: Compare requested vs. actual usage to right-size future jobs

Worked out Examples

# copy example log file below to your work directory
$ cp /ei/software/cb/eiutils/tests/data/slurm_example.15145984.log .

# source eiutils environment
$ source eiutils-0.1.0

# Execute the command
$ get_hpc_job_summary -i slurm_example.15145984.log -l 15145984 -p > hpc_job_summary.15145984.tsv
18-Nov-25 23:16:46 - 10719 - root - WARNING - Skipping job ID '1' as it is less than 3 digits or all characters are 0. Use --override to include it.
18-Nov-25 23:16:47 - 10719 - root - INFO - Plotting job statistics ...
18-Nov-25 23:17:11 - 10719 - root - INFO - Interactive HTML plot saved as '/ei/cb/development/kaithakg/eiutils/dev/HPC_job_summary_20251118_231646.html'

# Review hpc_job_summary.15145984.tsv for tabular data
# Open HPC_job_summary_*.html in browser for interactive exploration
# Adjust resource requests for future jobs based on actual usage patterns

Interactive HTML file - HPC_job_summary_20251118_231646.html
Tabular data - hpc_job_summary.15145984.tsv

Clone this wiki locally