-
Notifications
You must be signed in to change notification settings - Fork 0
HPC Job Summary Tool
A Python script to analyse and visualise SLURM job performance metrics on HPC clusters.
get_hpc_job_summary retrieves detailed statistics for SLURM scheduler jobs and provides both tabular summaries and interactive visualisations to help users understand their job resource usage and optimise future submissions.
- Multiple Input Formats: Accepts job IDs from command line, files, or various log formats (SLURM, DRMAA, Cromwell, Snakemake, Nextflow)
- Comprehensive Metrics: Tracks CPU time, memory usage, elapsed time, job state, and node allocation
- Statistical Summary: Provides count, mean, std, min, max, and percentile statistics
- Interactive Visualisation: Generates plotly-based HTML reports with hover details for each job
- Flexible Job ID Input: Handles single or multiple job IDs with various delimiters
# add core bioinformatics software to PATH, if not already done
export PATH=/ei/software/cb/bin:$PATH
source eiutils-latest
get_hpc_job_summary --helpGet summary for specific job IDs:
get_hpc_job_summary -l 13268787 23456789 > summary.tsvExtract job IDs from a log file:
get_hpc_job_summary -i slurm_jobs.log > summary.tsvGenerate summary with HTML visualisation:
get_hpc_job_summary -l 13268787,13617529 -p > summary.tsvCombine multiple sources:
get_hpc_job_summary -l 13268787 -i jobs.txt -p > summary.tsv| Option | Description |
|---|---|
-i, --input |
Input file with job IDs (one per line) or log files |
-l, --list_jobs |
Space or comma-separated list of job IDs |
-p, --plot |
Generate interactive HTML plot |
-o, --override |
Include job IDs <3 digits |
-v, --verbose |
Enable detailed logging |
get_hpc_job_summary -h
usage: get_hpc_job_summary [-h] [-i INPUT] [-l LIST_JOBS [LIST_JOBS ...]] [-p] [-o] [-v]
Script to get an HPC job summary from the SLURM scheduler
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Provide a file with a list of SLURM job IDs (one job ID per line).
The script can also extract job IDs if you provide log files with job IDs in one (or a combination) of the following formats:
'Submitted batch job 12345678' (Normal cluster executions),
'Submitted DRMAA job 12345678' (DRMAA executions)
'Submitted job 1 with external jobid 'Submitted batch job 12345678''(Snakemake executions)
': job id: 12345678' (Cromwell executions)
'> jobId: 12345678;' (Nextflow executions)
(default: None)
-l LIST_JOBS [LIST_JOBS ...], --list_jobs LIST_JOBS [LIST_JOBS ...]
Provide job IDs as a list.
This can be a single job ID or a list of job IDs separated by spaces or commas.
For example:
-l 12345678 23456789 or
-l 12345678,23456789.
No need to use quotes for the job IDs. (default: None)
-p, --plot Plot a summary of job statistics as an interactive HTML file.
Saved as 'HPC_job_summary_YYYYMMDD_HHMMSS.html' in the current directory. (default: False)
-o, --override Override the default behaviour of skipping job IDs that are less than 3 digits or all characters are 0.
Use this option to include such job IDs in the analysis (default: True).
-v, --verbose Enable verbose logging (default: False).
Example commands:
Without plots:
get_hpc_job_summary.py -l 13268787,13617529 > hpc_job_summary.tsv
get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt > hpc_job_summary.tsv
With plots:
get_hpc_job_summary.py -l 13268787,13617529 -p > hpc_job_summary.tsv
get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt -p > hpc_job_summary.tsvTab-separated values with columns:
- JobID, Start, End, ReqCPUS, ReqMem(Mb), MaxRSS(Mb), ElapsedRaw(secs), CPUTimeRAW(secs), State, NodeList, JobName
$ cat hpc_job_summary.15145984.tsv
JobID Start End ReqCPUS ReqMem(Mb) MaxRSS(Mb) ElapsedRaw(secs) CPUTimeRAW(secs) State NodeList JobName
15145984 2025-09-16T11:01:14 2025-09-17T19:13:55 2 20480 119 115961 231922 FAILED t512n9 eirepeat-run1
15146015 2025-09-16T11:01:44 2025-09-16T11:04:02 4 10240 2054 138 552 COMPLETED t512n13 eirepeat.clean_genome
15146074 2025-09-16T11:04:45 2025-09-16T11:05:58 1 5120 346 73 73 COMPLETED t512n13 eirepeat.BuildDatabase
15146075 2025-09-16T11:04:45 2025-09-16T16:08:57 16 25600 13065 18252 292032 COMPLETED t512n18 eirepeat.RepeatMasker_low
15146076 2025-09-16T11:04:45 2025-09-16T11:20:32 1 10240 7995 947 947 OUT_OF_MEMORY t512n13 eirepeat.red
15146077 2025-09-16T11:04:45 2025-09-16T11:10:43 16 40960 874 358 5728 FAILED t512n20 eirepeat.RepeatMasker_interspersed
15146107 2025-09-16T11:06:15 2025-09-17T16:41:32 16 40960 1133 106517 1704272 COMPLETED t384n3 eirepeat.RepeatModeler
15242670 2025-09-17T16:41:50 2025-09-18T04:15:48 16 40960 40819 41638 666208 OUT_OF_MEMORY t512n13 eirepeat.RepeatMasker_interspersed_repeatmodeler
Displays descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max) for:
- Requested CPUs
- Requested Memory (Mb)
- Maximum RSS Memory (Mb)
- Elapsed Time (seconds)
- CPU Time (seconds)
# Overall
count mean std min 25% 50% 75% max
ReqCPUS 8 9 8 1 2 10 16 16
ReqMem(Mb) 8 24320 15176 5120 10240 23040 40960 40960
MaxRSS(Mb) 8 8301 13918 119 742 1594 9262 40819
ElapsedRaw(secs) 8 35486 48971 73 303 9600 57858 115961
CPUTimeRAW(secs) 8 362717 589791 73 848 118825 385576 1704272
# Summary (total)
Jobs ReqCPUS ReqMem(Mb) MaxRSS(Mb) ElapsedRaw(secs) CPUTimeRAW(secs)
8 72 194560 66405 283884 2901734
When using -p flag, it generates an additional time-stamped HTML file in the format - HPC_job_summary_YYYYMMDD_HHMMSS.html with:
- 4 interactive strip plots showing resource usage by job state
- Hover information: JobID, JobName, Start/End times, NodeList, Requested CPU Time (secs), Elapsed CPU Time (secs), Requested Memory (Mb), Used Memory (MaxRSS (Mb))
- Visual comparison of requested vs. actual resource usage
- Easy identification of over/under-provisioned jobs
The script automatically extracts job IDs from the following formats (when provided with the option --input):
- Standard SLURM:
Submitted batch job 12345678 - DRMAA:
Submitted DRMAA job 12345678 - Snakemake:
Submitted job 1 with external jobid '12345678' - Cromwell:
: job id: 12345678 - Nextflow:
> jobId: 12345678; - Plain text: One job ID per line
- Resource Optimisation: Identify jobs requesting excessive memory or CPU
- Failure Analysis: Quickly spot failed jobs and their resource profiles
- Cost Management: Understand total resource consumption across job sets
- Performance Tuning: Compare requested vs. actual usage to right-size future jobs
# copy example log file below to your work directory
$ cp /ei/software/cb/eiutils/tests/data/slurm_example.15145984.log .
# source eiutils environment
$ source eiutils-latest
# Execute the command
$ get_hpc_job_summary -i slurm_example.15145984.log -l 15145984 -p > hpc_job_summary.15145984.tsv
18-Nov-25 23:16:46 - 10719 - root - WARNING - Skipping job ID '1' as it is less than 3 digits or all characters are 0. Use --override to include it.
18-Nov-25 23:16:47 - 10719 - root - INFO - Plotting job statistics ...
18-Nov-25 23:17:11 - 10719 - root - INFO - Interactive HTML plot saved as '/path/to/eiutils/0.1.0/HPC_job_summary_20251118_231646.html'
# Review hpc_job_summary.15145984.tsv for tabular data
# Open HPC_job_summary_*.html in browser for interactive exploration
# Adjust resource requests for future jobs based on actual usage patternsInteractive HTML file - HPC_job_summary_20251118_231646.html
Tabular data - hpc_job_summary.15145984.tsv
- Induction
- HPC Best practice
- Job Arrays - RC documentation
- Methods to Improve I/O Performance - RC documentation
- Customising your bash profile for ease and efficiency
- Customise bash profile: Logging Your Command History Automatically
- Using the ei-gpu partition on the Earlham Institute computing cluster
- Using the GPUs at EI
- HPC Job Summary Tool
- EI Cloud (CyVerse)
- Git and GitHub
- Worked examples
- Job Arrays
- Using Parabricks on the GPUs
- dependencies
- Software installations
- Workflow management system
- Transfers
- Local (mounting HPC storage)
- Remote - <1gb (ood)
- Remote - <50gb (nbi drop off)
- Remote - No limit (globus)
- mv command