Shelper is a Go library designed to help with Slurm-related operations, particularly for GPU management in a cluster environment. It provides tools to map GPUs to Slurm jobs and retrieve metadata about those jobs.
Shelper provides functionality to:
- Map GPUs to Slurm jobs running on a host
- Retrieve metadata about Slurm jobs
- Query Slurm controllers (slurmctld) for job information
- Cache Slurm metadata to avoid excessive queries
- Parse Slurm hostlists and GPU allocations
- Retrieve information about running processes and their associated Slurm jobs
The core functionality of Shelper is to map GPU devices to the Slurm jobs that
are using them. This is done through the GetGPU2Slurm function, which can:
- Query the Slurm controller for job information
- Use local process information to determine GPU allocations
- Cache results to improve performance
To avoid excessive queries to the Slurm controller, Shelper includes caching functionality:
- Cache Slurm metadata for a configurable duration
- Save and load cache from the local filesystem
- Automatically refresh cache when it expires
GetGPU2Slurm: Maps GPUs to Slurm jobsGetJobMetadata: Retrieves metadata for a specific Slurm jobGetJob2Pid: Gets mapping of Slurm job IDs to PIDsGetGPU2SlurmFromPIDs: Maps GPUs to Slurm jobs using PID informationGetSlurmMetadataCache: Retrieves cached Slurm metadataSaveSlurmToJSON: Saves Slurm metadata to a JSON fileAttributeGPU2SlurmMetadata: Populates GPU to Slurm metadata mappingGetHostList: Extracts host list from job metadataHostnameInList: Checks if a hostname is in a Slurm host listparseGRES: Parses GPU indices from GRES (Generic Resource) output
- Access to Slurm commands (
scontrol,squeue) - Appropriate permissions to read process information
shelper is licensed under the MIT license.