-
Notifications
You must be signed in to change notification settings - Fork 0
HPC Job Summary Tool
A Python script to analyse and visualise SLURM job performance metrics on HPC clusters.
get_hpc_job_summary retrieves detailed statistics for SLURM scheduler jobs and provides both tabular summaries and interactive visualisations to help users understand their job resource usage and optimise future submissions.
- Multiple Input Formats: Accepts job IDs from command line, files, or various log formats (SLURM, DRMAA, Cromwell, Snakemake, Nextflow)
- Comprehensive Metrics: Tracks CPU time, memory usage, elapsed time, job state, and node allocation
- Statistical Summary: Provides count, mean, std, min, max, and percentile statistics
- Interactive Visualisation: Generates plotly-based HTML reports with hover details for each job
- Flexible Job ID Input: Handles single or multiple job IDs with various delimiters
# add core bioinformatics software to PATH, if not already done
export PATH=/ei/software/cb/bin:$PATH
source eiutils-0.1.0
get_hpc_job_summary --helpGet summary for specific job IDs:
get_hpc_job_summary -l 13268787 23456789 > summary.tsvExtract job IDs from a log file:
get_hpc_job_summary -i slurm_jobs.log > summary.tsvGenerate summary with HTML visualisation:
get_hpc_job_summary -l 13268787,13617529 -p > summary.tsvCombine multiple sources:
get_hpc_job_summary -l 13268787 -i jobs.txt -p > summary.tsv| Option | Description |
|---|---|
-i, --input |
Input file with job IDs (one per line) or log files |
-l, --list_jobs |
Space or comma-separated list of job IDs |
-p, --plot |
Generate interactive HTML plot |
-o, --override |
Include job IDs <3 digits |
-v, --verbose |
Enable detailed logging |
get_hpc_job_summary -h
usage: get_hpc_job_summary [-h] [-i INPUT] [-l LIST_JOBS [LIST_JOBS ...]] [-p] [-o] [-v]
Script to get an HPC job summary from the SLURM scheduler
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Provide a file with a list of SLURM job IDs (one job ID per line).
The script can also extract job IDs if you provide log files with job IDs is in one (or combination) of the following formats:
'Submitted batch job 12345678' (Normal cluster executions),
'Submitted DRMAA job 12345678' (DRMAA executions)
'Submitted job 1 with external jobid 'Submitted batch job 12345678''(Snakemake exections)
': job id: 12345678' (Cromwell executions)
'> jobId: 12345678;' (Nextflow executions)
(default: None)
-l LIST_JOBS [LIST_JOBS ...], --list_jobs LIST_JOBS [LIST_JOBS ...]
Provide job IDs as a list.
This can be a single job ID or a list of job IDs separated by spaces or commas.
For example:
-l 12345678 23456789 or
-l 12345678,23456789.
No need to use quotes for the job IDs. (default: None)
-p, --plot Plot a summary of job statistics as an interactive HTML file.
Saved as 'HPC_job_summary_YYYYMMDD_HHMMSS.html' in the current directory. (default: False)
-o, --override Override the default behaviour of skipping job IDs that are less than 3 digits or all characters are 0.
Use this option to include such job IDs in the analysis (default: True).
-v, --verbose Enable verbose logging (default: False).
Example commands:
Without plots:
get_hpc_job_summary.py -l 13268787,13617529 > hpc_job_summary.tsv
get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt > hpc_job_summary.tsv
With plots:
get_hpc_job_summary.py -l 13268787,13617529 -p > hpc_job_summary.tsv
get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt -p > hpc_job_summary.tsvTab-separated values with columns:
- JobID, Start, End, ReqCPUS, ReqMem(Mb), MaxRSS(Mb), ElapsedRaw(secs), CPUTimeRAW(secs), State, NodeList, JobName
$ cat hpc_job_summary.15145984.tsv
JobID Start End ReqCPUS ReqMem(Mb) MaxRSS(Mb) ElapsedRaw(secs) CPUTimeRAW(secs) State NodeList JobName
15145984 2025-09-16T11:01:14 2025-09-17T19:13:55 2 20480 119 115961 231922 FAILED t512n9 eirepeat-run1
15146015 2025-09-16T11:01:44 2025-09-16T11:04:02 4 10240 2054 138 552 COMPLETED t512n13 eirepeat.clean_genome
15146074 2025-09-16T11:04:45 2025-09-16T11:05:58 1 5120 346 73 73 COMPLETED t512n13 eirepeat.BuildDatabase
15146075 2025-09-16T11:04:45 2025-09-16T16:08:57 16 25600 13065 18252 292032 COMPLETED t512n18 eirepeat.RepeatMasker_low
15146076 2025-09-16T11:04:45 2025-09-16T11:20:32 1 10240 7995 947 947 OUT_OF_MEMORY t512n13 eirepeat.red
15146077 2025-09-16T11:04:45 2025-09-16T11:10:43 16 40960 874 358 5728 FAILED t512n20 eirepeat.RepeatMasker_interspersed
15146107 2025-09-16T11:06:15 2025-09-17T16:41:32 16 40960 1133 106517 1704272 COMPLETED t384n3 eirepeat.RepeatModeler
15242670 2025-09-17T16:41:50 2025-09-18T04:15:48 16 40960 40819 41638 666208 OUT_OF_MEMORY t512n13 eirepeat.RepeatMasker_interspersed_repeatmodeler
Displays descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max) for:
- Requested CPUs
- Requested Memory (Mb)
- Maximum RSS Memory (Mb)
- Elapsed Time (seconds)
- CPU Time (seconds)
# Overall
count mean std min 25% 50% 75% max
ReqCPUS 8 9 8 1 2 10 16 16
ReqMem(Mb) 8 24320 15176 5120 10240 23040 40960 40960
MaxRSS(Mb) 8 8301 13918 119 742 1594 9262 40819
ElapsedRaw(secs) 8 35486 48971 73 303 9600 57858 115961
CPUTimeRAW(secs) 8 362717 589791 73 848 118825 385576 1704272
# Summary (total)
Jobs ReqCPUS ReqMem(Mb) MaxRSS(Mb) ElapsedRaw(secs) CPUTimeRAW(secs)
8 72 194560 66405 283884 2901734
When using -p flag, it generates an additional time-stamped HTML file in the format - HPC_job_summary_YYYYMMDD_HHMMSS.html with:
- 4 interactive strip plots showing resource usage by job state
- Hover information: JobID, JobName, Start/End times, NodeList, Requested CPU Time (secs), Elapsed CPU Time (secs), Requested Memory (Mb), Used Memory (MaxRSS (Mb))
- Visual comparison of requested vs. actual resource usage
- Easy identification of over/under-provisioned jobs
The script automatically extracts job IDs from the following formats (when provided with the option --input):
- Standard SLURM:
Submitted batch job 12345678 - DRMAA:
Submitted DRMAA job 12345678 - Snakemake:
Submitted job 1 with external jobid '12345678' - Cromwell:
: job id: 12345678 - Nextflow:
> jobId: 12345678; - Plain text: One job ID per line
- Resource Optimisation: Identify jobs requesting excessive memory or CPU
- Failure Analysis: Quickly spot failed jobs and their resource profiles
- Cost Management: Understand total resource consumption across job sets
- Performance Tuning: Compare requested vs. actual usage to right-size future jobs
# copy example log file below to your work directory
$ cp /ei/software/cb/eiutils/tests/data/slurm_example.15145984.log .
# source eiutils environment
$ source eiutils-0.1.0
# Execute the command
$ get_hpc_job_summary -i slurm_example.15145984.log -l 15145984 -p > hpc_job_summary.15145984.tsv
18-Nov-25 23:16:46 - 10719 - root - WARNING - Skipping job ID '1' as it is less than 3 digits or all characters are 0. Use --override to include it.
18-Nov-25 23:16:47 - 10719 - root - INFO - Plotting job statistics ...
18-Nov-25 23:17:11 - 10719 - root - INFO - Interactive HTML plot saved as '/ei/cb/development/kaithakg/eiutils/dev/HPC_job_summary_20251118_231646.html'
# Review hpc_job_summary.15145984.tsv for tabular data
# Open HPC_job_summary_*.html in browser for interactive exploration
# Adjust resource requests for future jobs based on actual usage patternsInteractive HTML file - HPC_job_summary_20251118_231646.html
Tabular data - hpc_job_summary.15145984.tsv
- Induction
- HPC Best practice
- Job Arrays - RC documentation
- Methods to Improve I/O Performance - RC documentation
- Customising your bash profile for ease and efficiency
- Customise bash profile: Logging Your Command History Automatically
- Using the ei-gpu partition on the Earlham Institute computing cluster
- Using the GPUs at EI
- HPC Job Summary Tool
- EI Cloud (CyVerse)
- Git and GitHub
- Worked examples
- Job Arrays
- Using Parabricks on the GPUs
- dependencies
- Software installations
- Workflow management system
- Transfers
- Local (mounting HPC storage)
- Remote - <1gb (ood)
- Remote - <50gb (nbi drop off)
- Remote - No limit (globus)
- mv command