Skip to content

HPC Job Summary Tool

Gemy George Kaithakottil edited this page Nov 19, 2025 · 2 revisions

HPC Job Summary Tool

A Python script to analyse and visualise SLURM job performance metrics on HPC clusters.

Overview

get_hpc_job_summary retrieves detailed statistics for SLURM scheduler jobs and provides both tabular summaries and interactive visualisations to help users understand their job resource usage and optimise future submissions.

Features

  • Multiple Input Formats: Accepts job IDs from command line, files, or various log formats (SLURM, DRMAA, Cromwell, Snakemake, Nextflow)
  • Comprehensive Metrics: Tracks CPU time, memory usage, elapsed time, job state, and node allocation
  • Statistical Summary: Provides count, mean, std, min, max, and percentile statistics
  • Interactive Visualisation: Generates plotly-based HTML reports with hover details for each job
  • Flexible Job ID Input: Handles single or multiple job IDs with various delimiters

Execution

# add core bioinformatics software to PATH, if not already done
export PATH=/ei/software/cb/bin:$PATH

source eiutils-latest
get_hpc_job_summary --help

Usage

Basic Usage

Get summary for specific job IDs:

get_hpc_job_summary -l 13268787 23456789 > summary.tsv

From File

Extract job IDs from a log file:

get_hpc_job_summary -i slurm_jobs.log > summary.tsv

With Interactive Plots

Generate summary with HTML visualisation:

get_hpc_job_summary -l 13268787,13617529 -p > summary.tsv

Combine multiple sources:

get_hpc_job_summary -l 13268787 -i jobs.txt -p > summary.tsv

Command-Line Options

Option Description
-i, --input Input file with job IDs (one per line) or log files
-l, --list_jobs Space or comma-separated list of job IDs
-p, --plot Generate interactive HTML plot
-o, --override Include job IDs <3 digits
-v, --verbose Enable detailed logging

Detailed Options

get_hpc_job_summary -h
usage: get_hpc_job_summary [-h] [-i INPUT] [-l LIST_JOBS [LIST_JOBS ...]] [-p] [-o] [-v]

        Script to get an HPC job summary from the SLURM scheduler


options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Provide a file with a list of SLURM job IDs (one job ID per line).
                        The script can also extract job IDs if you provide log files with job IDs in one (or a combination) of the following formats:
                        	'Submitted batch job 12345678' (Normal cluster executions),
                        	'Submitted DRMAA job 12345678' (DRMAA executions)
                        	'Submitted job 1 with external jobid 'Submitted batch job 12345678''(Snakemake executions)
                        	': job id: 12345678' (Cromwell executions)
                        	'> jobId: 12345678;' (Nextflow executions)
                         (default: None)
  -l LIST_JOBS [LIST_JOBS ...], --list_jobs LIST_JOBS [LIST_JOBS ...]
                        Provide job IDs as a list.
                        This can be a single job ID or a list of job IDs separated by spaces or commas.
                        For example:
                        	-l 12345678 23456789 or
                        	-l 12345678,23456789.
                        No need to use quotes for the job IDs. (default: None)
  -p, --plot            Plot a summary of job statistics as an interactive HTML file.
                        Saved as 'HPC_job_summary_YYYYMMDD_HHMMSS.html' in the current directory. (default: False)
  -o, --override        Override the default behaviour of skipping job IDs that are less than 3 digits or all characters are 0.
                        Use this option to include such job IDs in the analysis (default: True).
  -v, --verbose         Enable verbose logging (default: False).

Example commands:
Without plots:
	get_hpc_job_summary.py -l 13268787,13617529 > hpc_job_summary.tsv
	get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt > hpc_job_summary.tsv

With plots:
	get_hpc_job_summary.py -l 13268787,13617529 -p > hpc_job_summary.tsv
	get_hpc_job_summary.py -l 13268787,13617529 -i another_job_list.txt -p > hpc_job_summary.tsv

Output

Tabular Output

Tab-separated values with columns:

  • JobID, Start, End, ReqCPUS, ReqMem(Mb), MaxRSS(Mb), ElapsedRaw(secs), CPUTimeRAW(secs), State, NodeList, JobName

Example Output

$ cat hpc_job_summary.15145984.tsv
JobID     Start                End                  ReqCPUS  ReqMem(Mb)  MaxRSS(Mb)  ElapsedRaw(secs)  CPUTimeRAW(secs)  State          NodeList  JobName
15145984  2025-09-16T11:01:14  2025-09-17T19:13:55  2        20480       119         115961            231922            FAILED         t512n9    eirepeat-run1
15146015  2025-09-16T11:01:44  2025-09-16T11:04:02  4        10240       2054        138               552               COMPLETED      t512n13   eirepeat.clean_genome
15146074  2025-09-16T11:04:45  2025-09-16T11:05:58  1        5120        346         73                73                COMPLETED      t512n13   eirepeat.BuildDatabase
15146075  2025-09-16T11:04:45  2025-09-16T16:08:57  16       25600       13065       18252             292032            COMPLETED      t512n18   eirepeat.RepeatMasker_low
15146076  2025-09-16T11:04:45  2025-09-16T11:20:32  1        10240       7995        947               947               OUT_OF_MEMORY  t512n13   eirepeat.red
15146077  2025-09-16T11:04:45  2025-09-16T11:10:43  16       40960       874         358               5728              FAILED         t512n20   eirepeat.RepeatMasker_interspersed
15146107  2025-09-16T11:06:15  2025-09-17T16:41:32  16       40960       1133        106517            1704272           COMPLETED      t384n3    eirepeat.RepeatModeler
15242670  2025-09-17T16:41:50  2025-09-18T04:15:48  16       40960       40819       41638             666208            OUT_OF_MEMORY  t512n13   eirepeat.RepeatMasker_interspersed_repeatmodeler

Statistical Summary

Displays descriptive statistics (count, mean, std, min, 25%, 50%, 75%, max) for:

  • Requested CPUs
  • Requested Memory (Mb)
  • Maximum RSS Memory (Mb)
  • Elapsed Time (seconds)
  • CPU Time (seconds)

Example Output

# Overall
                  count   mean    std  min   25%    50%    75%     max
ReqCPUS               8      9      8    1     2     10     16      16
ReqMem(Mb)            8  24320  15176 5120 10240  23040  40960   40960
MaxRSS(Mb)            8   8301  13918  119   742   1594   9262   40819
ElapsedRaw(secs)      8  35486  48971   73   303   9600  57858  115961
CPUTimeRAW(secs)      8 362717 589791   73   848 118825 385576 1704272
# Summary (total)
Jobs               ReqCPUS  ReqMem(Mb)  MaxRSS(Mb)  ElapsedRaw(secs)  CPUTimeRAW(secs)
8                  72       194560      66405       283884            2901734

Interactive HTML Plot

When using -p flag, it generates an additional time-stamped HTML file in the format - HPC_job_summary_YYYYMMDD_HHMMSS.html with:

  • 4 interactive strip plots showing resource usage by job state
  • Hover information: JobID, JobName, Start/End times, NodeList, Requested CPU Time (secs), Elapsed CPU Time (secs), Requested Memory (Mb), Used Memory (MaxRSS (Mb))
  • Visual comparison of requested vs. actual resource usage
  • Easy identification of over/under-provisioned jobs

Example Interactive HTML Plot

Screenshot 2025-11-18 at 23 55 49

Example Interactive HTML Plot - Zoomed In and Hovered View

Screenshot 2025-11-18 at 23 57 07

Supported Log Formats

The script automatically extracts job IDs from the following formats (when provided with the option --input):

  • Standard SLURM: Submitted batch job 12345678
  • DRMAA: Submitted DRMAA job 12345678
  • Snakemake: Submitted job 1 with external jobid '12345678'
  • Cromwell: : job id: 12345678
  • Nextflow: > jobId: 12345678;
  • Plain text: One job ID per line

Use Cases

  1. Resource Optimisation: Identify jobs requesting excessive memory or CPU
  2. Failure Analysis: Quickly spot failed jobs and their resource profiles
  3. Cost Management: Understand total resource consumption across job sets
  4. Performance Tuning: Compare requested vs. actual usage to right-size future jobs

Worked out Examples

# copy example log file below to your work directory
$ cp /ei/software/cb/eiutils/tests/data/slurm_example.15145984.log .

# source eiutils environment
$ source eiutils-latest

# Execute the command
$ get_hpc_job_summary -i slurm_example.15145984.log -l 15145984 -p > hpc_job_summary.15145984.tsv
18-Nov-25 23:16:46 - 10719 - root - WARNING - Skipping job ID '1' as it is less than 3 digits or all characters are 0. Use --override to include it.
18-Nov-25 23:16:47 - 10719 - root - INFO - Plotting job statistics ...
18-Nov-25 23:17:11 - 10719 - root - INFO - Interactive HTML plot saved as '/path/to/eiutils/0.1.0/HPC_job_summary_20251118_231646.html'

# Review hpc_job_summary.15145984.tsv for tabular data
# Open HPC_job_summary_*.html in browser for interactive exploration
# Adjust resource requests for future jobs based on actual usage patterns

Interactive HTML file - HPC_job_summary_20251118_231646.html
Tabular data - hpc_job_summary.15145984.tsv

Clone this wiki locally