This project is deliberately small: one Streamlit app that shells out to Slurm, split into three Python modules plus a helper script to start it safely on a login node.
- Process: one Streamlit process running
swc_slurm_dashboard.pyon an HPC login node (e.g.hpc-gw2). - Data source: read-only Slurm CLI commands (
squeue,sacct,scontrol show job). - Python modules:
read_slurm_data.py– read layer: shell helpers and Slurm parsers that turnsqueue/sacct/scontrol show jobinto pandas DataFrames or text.shape_slurm_data.py– shape layer: pure helpers that aggregate and reshape those DataFrames into the summaries used by the UI.swc_slurm_dashboard.py– view layer: Streamlit UI, cachedget_*wrappers over the read/shape layers, sidebar, and tabs.
- UI: a single Streamlit app with three main tabs:
Overview– live queue summary, recent finished jobs, and failures.Job inspector– detailed view of a single job viascontrol.Help– documentation on how jobs / arrays / names map onto the dashboard.
Users normally:
- Start the app on a login node via
run_dashboard.sh(often intmux). - Open an SSH tunnel from their laptop.
- Visit
http://localhost:<LOCAL_PORT>in a browser.
flowchart LR
A[Laptop Browser] -->|SSH tunnel| B[Streamlit UI: swc_slurm_dashboard.py]
B --> C[read_slurm_data.py<br/>parse_squeue / parse_sacct / scontrol_show_job]
B --> D[shape_slurm_data.py<br/>summarise_live_by_name / summarise_failures_by_name]
C --> F[squeue]
C --> G[sacct]
C --> H[scontrol show job]
D --> I[Live summary + Jobs by name]
D --> J[Finished jobs + failures]
H --> K[Job inspector output]
Responsibility
- Start the Streamlit app with a sensible port and clear tunnel instructions.
Key behavior
- Picks a port:
- If you pass a port: uses that directly.
- Else: finds the first free port in
8501–8510with a tiny Python snippet.
- Prints:
- Hostname and chosen port.
- A ready-to-copy SSH command:
ssh -J <user>@ssh.swc.ucl.ac.uk <user>@<host> -N -L <LOCAL_PORT>:127.0.0.1:<PORT> - The browser URL to open:
http://localhost:<LOCAL_PORT>.
- Runs:
streamlit run swc_slurm_dashboard.py --server.port "$PORT" --server.address 0.0.0.0
This script encodes the “official” way to start the portal and is the only place you should need to touch for port / tunnelling conventions.
Responsibility
- Provide a small, typed API for reading Slurm data into pandas DataFrames or raw text using the standard CLI tools.
Key pieces
- Shell helpers:
sh(cmd)– thin wrapper aroundsubprocess.check_output.safe_sh(cmd)– callsshbut returns error text instead of raising.
- Column definitions for
squeue/sacct. - Parsers:
parse_squeue(user)– returns a fixed-shape live-queue DataFrame.parse_sacct(user, start)– returns a fixed-shape history DataFrame.
- Other helpers:
list_squeue_users()– distinct users insqueueplus$USER.scontrol_show_job(job_id)– validation + rawscontrol show joboutput.
Only read‑only, fixed-format Slurm commands ever reach safe_sh; no free‑form
user shell is executed.
Responsibility
- Take the raw DataFrames from
read_slurm_data.pyand turn them into the higher-level summaries the UI needs.
Key pieces
summarise_live_by_name(df)– live queue grouped by job name with per-name counts, status summary, elapsed time, and a representative sample JobID.summarise_failures_by_name(dfh)– failures grouped by JobName with a count and “last failure” details (JobID, State, ExitCode, Elapsed, Node, MaxRSS, and optional fields such as ReqMem / Timelimit / WorkDir).derive_history_start_from_squeue(df)– approximates a sensible--starttimeforsacctbased on the oldest running job’s elapsed time, or start of today if nothing is running._parse_maxrss_to_gb(value)– converts Slurm MaxRSS strings into GiB._derive_array_or_job_id(job_id)– maps12345_3→12345to group array elements.
All of these helpers are pure transformations (no IO).
swc_slurm_dashboard.py is the Streamlit entrypoint and is structured into
clear sections (in order):
- Styles and page config
- Cached wrappers over the read/shape layers
- Refresh timer helper
- Sidebar (user + refresh)
- Tabs (
Overview,Job inspector,Help)
st.set_page_config(...):- Title:
SWC Slurm Dashboard. - Layout: wide; expanded sidebar.
- Title:
- CSS injected via
st.markdown(..., unsafe_allow_html=True):- Section title color and typography.
- Status colors:
- RUNNING, WAITING, FAILED, DONE (match the legend).
- Health banner:
- OK (green), ATTENTION NEEDED (orange).
- Subtle input focus ring (neutral, not “error red”).
This gives a consistent dark theme without depending on external CSS files.
To avoid hammering the scheduler and to give Streamlit stable entrypoints, the
view layer wraps the read/shape functions in @st.cache_data(ttl=...)
decorated helpers:
get_squeue_users()– useslist_squeue_users().get_squeue(user)– usesparse_squeue(user).get_sacct(user, start)– usesparse_sacct(user, start).get_live_by_name(df)– usessummarise_live_by_name(df).get_failures_by_name(dfh)– usessummarise_failures_by_name(dfh).get_scontrol_job(job_id)– usesscontrol_show_job(job_id).
The Refresh now button in the sidebar does a full refresh by calling
st.cache_data.clear() before rerunning, so you always get fresh data when
you ask for it.
The sidebar manages:
- User selection
selected_userfrom aselectboxbacked byget_squeue_users().
- Refresh control
last_manual_refresh_tsstored inst.session_state.- A small
render_refresh_age(...)helper showsElapsed since refresh: HH:MM:SSin the sidebar. - Refresh now button:
- Clears caches via
st.cache_data.clear(). - Updates
last_manual_refresh_ts. - Calls
st.rerun().
- Clears caches via
This keeps the user context and manual refresh behavior in a single, predictable place.
Rendered inside the Overview tab.
Header + meta
- Title:
SWC Slurm Dashboard. - Meta line:
User: <user> · Last updated: <UTC timestamp>.
Live summary
df = get_squeue(selected_user).- If empty:
- Metrics = 0, info message “No jobs in queue.”
- Else:
- Metrics:
- TOTAL jobs
- RUNNING jobs
- WAITING jobs
- DEP problems (DependencyNeverSatisfied count)
- Health banner:
- OK (green) if
dep_bad == 0. - ATTENTION NEEDED (orange) otherwise.
- OK (green) if
- Metrics:
Jobs by name
When queue isn’t empty:
- Section title:
QUEUED JOBS (by name). How to read this sectionexpander:- Explains:
- Grouping by job name.
- Meaning of each column.
- Importance of
BLOCKED (dependency never satisfied). - Status color legend (RUNNING, WAITING, FAILED, DONE).
- Explains:
- Table:
- Data:
df_by_name = get_live_by_name(df)→df_display. - Rendered with
st.dataframe(...)and a style function that colors the STATUS (summary) column to match the legend.
- Data:
Finished jobs
- Section title:
FINISHED JOBS (since: <date>). How to read this sectionexpander:- Explains:
- The since date is the start of the history window, derived from the
live queue:
- It starts roughly when your longest-running current task started
(based on the elapsed time reported by
squeue), or - From the beginning of today (UTC) if nothing is running.
- It starts roughly when your longest-running current task started
(based on the elapsed time reported by
- Only successful tasks are included (state contains
COMPLETEDandExitCodestarts with0:). - The table is split into:
- Related to running jobs (jobs whose array job ID matches an array that currently has at least one RUNNING job).
- Other finished jobs (all other successful tasks in the window).
- Each row is one
JobIDfrom Slurm (which, for job arrays, may be a specific job array element such as12345_0), with its array-or-job identifier, name, state, exit code, elapsed time, and node list.
- The since date is the start of the history window, derived from the
live queue:
- Explains:
- Data flow:
- A start time and label are derived from
squeueviaderive_history_start_from_squeue(df). dfh_window = get_sacct(selected_user, start_time).- A filtered subset of successful tasks is rendered as two
st.dataframe(...)tables (related vs other).
- A start time and label are derived from
Failures
- Section title:
FAILURES (since: <date>). How to read this sectionexpander:- Explains:
- The same history window as Finished jobs is used.
- Included rows:
- States matching
FAILED,CANCELLED,TIMEOUT,OUT_OF_MEMORY, or - Any row with a non-zero
ExitCode.
- States matching
- The table is split into:
- Related to running job names.
- Other failures.
- Each row is grouped by
JobNameand includes:Count, last failingJobID(for arrays this is a specific job array element such as12345_0), state, exit code, elapsed time, node,MaxRSS, and optional resource columns (e.g.ReqMem,Timelimit,CPUTime,WorkDir) when present.
- Explains:
- Data flow:
df_fail_all = get_failures_by_name(dfh_window).- Two
st.dataframe(...)tables are rendered (related vs other).
Rendered inside the Job inspector tab.
Purpose
- Let the user run
scontrol show job <JobID>via a simple form, and see raw Slurm output for that job.
UI
- Help text explaining what the tool does and how to use it.
- Two columns:
- Left:
- Free text input:
Job ID(e.g. 12345 or 12345_3).
- Free text input:
- Right:
- Dropdown:
Or pick from your queueusing liveget_squeue(...).
- Dropdown:
- Left:
- Resolution:
- Chooses the picked ID if present, otherwise the typed ID.
- Validates ID via
get_scontrol_job(job_id).
Output
- If a valid job ID is provided:
st.code(..., language="text")showing rawscontroloutput.
- Otherwise:
- Info message asking for a job ID.
The Job inspector is intentionally thin; it delegates all Slurm semantics to
scontrol.
Rendered inside the Help tab.
- Displays the contents of
SLURM_DASHBOARD_HELP.mdusingst.markdown. - Explains how SLURM jobs / arrays / job names map onto:
- SUMMARY
- QUEUED JOBS
- FINISHED JOBS
- FAILURES
- Intended deployment is a trusted HPC environment:
- App runs on a login node.
- Users access it via SSH tunnelling from their own machines.
- The portal is read-only by design:
- It calls
squeue,sacct, andscontrol show job. - It never submits, cancels, or modifies jobs.
- It calls
- The app is not designed as a public internet service:
- Keep access scoped to your cluster/network policies.
- Prefer SSH forwarding over exposing Streamlit directly.
- Depends on local Slurm CLI tooling and permissions:
- If
squeue,sacct, orscontrolare unavailable/misconfigured, sections may show empty/error outputs.
- If
- Data freshness is cache-based:
@st.cache_dataTTLs reduce scheduler load but can delay updates.- Refresh now clears cached data and reruns immediately.
- Parsing depends on Slurm output behavior:
- JSON is preferred when available; legacy fallback parsers are best-effort.
Refresh nowclears Streamlit data caches for this app session:- This is intentional for manual "fetch latest now" behavior.
Some natural extension points:
- New summary tables:
- For example, grouping by user, partition, or node:
- Mirror
summarise_live_by_namewith a differentgroupby.
- Mirror
- For example, grouping by user, partition, or node:
- Additional history views:
- Configurable
startforget_sacct(e.g. “last 7 days”).
- Configurable
- Job detail panels:
- When clicking a row in QUEUED JOBS (by name), pre-fill the Job inspector with its
SAMPLE JOB ID.
- When clicking a row in QUEUED JOBS (by name), pre-fill the Job inspector with its
The current structure (parsers → cached wrappers → summarizers → pages) is meant to keep these additions straightforward.