Releases: SchedMD/slurm
Releases · SchedMD/slurm
v25.11.3
Changes in 25.11.3
- Fix regression from af2c0bd which caused usercpu and systemcpu to be missing for job steps.
- Fixed issue where RestrictedCoresPerGPU with shared gres are limited to using restricted cores on one job per sharing gres.
- slurmd - Fix regression that could cause thread limits to not be enforced for handling incoming RPCs.
- Fix "sacctmgr show conf" to properly display CommitDelay in seconds instead of as a boolean.
- Fix cron/requeued jobs being incorrectly reported as runaway
- slurmctld - Prevent the double-removal of accounting usage for jobs being requeued that are in the COMPLETED or COMPLETING state.
- When deleting a QOS from the DB, also remove it from partition QOS, AllowQOS and DenyQOS fields.
- Fixed bug that could cause the detected CPU count to be lower than actual available CPU count. This bug could have resulted in the default value for conmgr_threads being lower than the number of available CPUs in sackd, scrun, slurmctld, slurmscriptd, slurmd, slurmstepd, slurmdbd, and slurmrestd when the assigned CPUs are not sequential.
- slurmdbd - Prevent the following slurmdbd.conf options from overriding the default values of any in the list not specified: AllowNoDefAcct, AllResourcesAbsolute, DisableCoordDBD, DisableArchiveCommands.
- salloc/sbatch - Nesting a non-stepmgr salloc or sbatch inside an existing job allocation that enabled the stepmgr will no longer result in the inner job's steps failing to launch.
- Prevent slurmd -G from initializing sack processing thread.
- Added SLURM_CLUSTER_NAME, SLURM_JOB_ACCOUNT and SLURM_JOB_GROUP environment variables when a step is launched.
- slurmctld - Prevent marking external nodes as being unresponsive when reconfiguring if SlurmctldParameters=enable_configless is used.
- Fix potential segfault when attempting to look up the controller address via DNS in configless mode.
- Fix "undefined symbol: gpu_common_underscorify_tolower" when gpu/nrt plugin in use.
- slurmrestd - Avoid memory leak on authentication failures with invalid bearer tokens.
- Fix potential deadlock in _x11_signal_handler() during stepd_cleanup().
- slurmctld - Fix reservations AllowedPartitions logic leading to incorrect purge of valid reservations in some use-cases.
- slurmcltd - Avoid persistent connections hangs when enable_async_reply is configured.
- Prevent potential controller segfault when reconfiguring after gres file updates.
- Reparent slurmd to a subcgroup to avoid conflicting with systemd.
- Fix sprio regression not handling comma separated list of jobids.
- slurmctld,slurmd - Fix memory leak when container ID is populated.
- slurmd - Fix P-core detection on processors with varying P-core frequencies and in cpuset-restricted environments.
- namespace/linux - add disable_bpf_token option.
- slurmctld - Avoid expedited requeue triggering a job to requeue when job exit code was zero.
- slurmctld - Avoid expedited requeue of jobs while waiting for job epilog script to complete.
- slurmctld - Prevent removing cloud nodes from the topology when putting them in the POWERED_DOWN state if they are present in topology.conf or topology.yaml and their node configuration did not specify the Topology option.
- interfaces/topology - When modifying a nodes topology with the Topology option in slurm.conf or the slurmd --conf Topology, change the topology to fully match the new topology.
- slurmctld - Allow changes to topology.conf or topology.yaml, and slurm.conf node configuration Topology option to take effect on a reconfigure or restart when power saving is enabled.
- slurmctld - Prevent backfill from combining future timeslots if they have different license reservations.
- Fix CLOUD nodes infrequently becoming FUTURE on slurmctld restart.
- slurmdbd - Avoid race condition that could cause a hang during shutdown when incoming connection fails.
- slurmdbd - Avoid crash during shutdown due to
sacctmgr shutdownrequest. - Fix slurmctld assertion when using "enable_async_reply" and certmgr is used for a TLS enabled cluster.
- Fix potential slurmd process leak when handling --get-user-env.
- slurmcltd - Avoid race condition that could cause the StateSaveLocation updates to be missed during shutdown.
- slurmcltd - Avoid race condition that could cause slurmctld to hang during shutdown before updating StateSaveLocation.
- slurmctld - Avoid race condition that could cause shutdown to wait on the wrong thread.
- Fix handling of 0 node test allocations in topology/block.
- slurmctld - In backfill, prevent unnecessarily testing jobs at future times using the select plugin if it is guaranteed to fail.
v25.11.2
Changes in 25.11.2
- slurmstepd - Revert regression that would apply job environment to container runtime invocation.
- Fix issue where reservations may start while required GRES resources are still being used by jobs.
- Fix slurmctld segfault when using --consolidate-segments.
- Expose slurm.CONSOLIDATE_SEGMENTS flag in lua.
- Expose the job record's segment_size in lua.
- job_submit/lua - Expose the job_desc's segment_size in lua.
- Prevent PMIx 5.0.8 and 5.0.9 clients from hanging when connecting to the PMIx server.
- Clarify warning when BPF tokens are not supported.
- slurmctld - Ensure we close already accepted conn before RPC flush check
- slurmctld - Fix rpc_queue feature causing statesave corruption while shutdown
- slurmctld - Ensure backfill has finished before saving state.
- slurmctld - Ensure main scheduler has finished before saving state.
- slurmctld - Fix error message while shutting down and state cannot be saved.
- Fix slurmctld double free that occurs when purging array jobs from memory only when using the topology/block plugin.
- Fix steps being rejected inside a batch job when using --cpus-per-task and --mem-per-cpu, and the job was submitted to multiple partitions, but not all of them had the same MaxMemPerCPU limit in place.
- slurmctld - Fix crash after failed reconfiguration while running jobs and priority/multifactor enabled.
- slurmctld - Fix jobs' QOS/association usage leading to potential underflow errors after a failed reconfiguration attempt.
- Guess NodeName with gethostname instead of gethostname_short
- Fix allowing job submissions when EnforcePartLimits=NO and the requested minimum number of nodes exceeds the total nodes in the specified partition(s).
- Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
- srun - fix bug where some input/output/error filename format identifiers were not expanded.
- Fix detecting restricted cores with SlurmdSpecOverride in nodes with more than one socket.
- slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.
- Fix average calculation in latency timers to show more accurate timing logs.
v25.05.6
Changes in 25.05.6
- Updating a job's qos will always replace the previous timelimit with the new qos' timelimit, unless another time limit is explicitly specified in the update command.
- slurmctld - Prevent memory corruption when fanning out messages to the slurmds if TreeWidth is more then or equal to 46341 and the number of nodes in the cluster is more then or equal to (TreeWidth + 1).
- Fix slurmctld potential deadlock when trying to schedule jobs starting many years in the future. Slurm only supports one year time limits.
- Fix accounting for memory on steps without pids, like the extern step, which caused them to be killed if OvermemoryKill was set.
- slurmrestd - Revert tagging
.scriptfield as deprecated in 'POST /slurm/v0.0.42/job/submit'. - slurmrestd - Revert tagging
.scriptfield as deprecated in 'POST /slurm/v0.0.43/job/submit'. - slurmrestd - Revert tagging
.scriptfield as deprecated in 'POST /slurm/v0.0.44/job/submit'. - slurmctld - Fixed segfault when running configless and a malformed REQUEST_CONFIG RPC is received.
- slurmctld - Fixed segfault when using newly added remote licenses.
- Fix memory leak on slurmctld for jobs that use --exclusive=topo
- Fix double unlock issue in _slurm_rpc_job_sbcast_cred()
- slurmctld/slurmdbd - Prevent segfaulting if a persistent connection closes right before reconfiguring or shutting down.
v25.11.1
Changes in 25.11.1
- data_parser/v0.0.41 - Prevent memory leaks when freeing parsed lists.
- data_parser/v0.0.42 - Prevent memory leaks when freeing parsed lists.
- data_parser/v0.0.43 - Prevent memory leaks when freeing parsed lists.
- data_parser/v0.0.44 - Prevent memory leaks when freeing parsed lists.
- slurmctld - Prevent a fatal when min_exempt_priority is not the last option listed in PreemptParameters.
- Updating a job's qos will always replace the previous timelimit with the new qos' timelimit, unless another time limit is explicitly specified in the update command.
- When debugflags=script is set in slurm.conf, Lua runtime error message will be logged with backtrace.
- slurmctld - Prevent memory corruption when fanning out messages to the slurmds if TreeWidth is more then or equal to 46341 and the number of nodes in the cluster is more then or equal to (TreeWidth + 1).
- When GrpTRES and MaxTRESPU are set on different QOSes and both QOSes are applied to a job, ensure that both limits are honored.
- Fix issue where a cli command or process could get stuck indefinitely when trying to retrieve a slurm.conf from slurmctld.
- Fix slurmctld potential deadlock when trying to schedule jobs starting many years in the future. Slurm only supports one year time limits.
- Fix pam_slurm_adopt when using namespace/linux plugin.
- topology/tree - Prevent overflow error when calculating fanout depth.
- The state string for nodes in the MIXED+FAIL state will now appear as "FAILING" rather than just "FAIL", similar to what is already done for nodes in the ALLOCATED+FAIL state.
- slurmctld - Prevent a divide by zero crash by fataling if the following SlurmctldParameters have a value of less than or equal to 0: rl_table_size, rl_bucket_size, rl_refill_rate, and rl_refill_period.
- Fix missing updates to reservation TRES and accounting when node(s) replaced due to REPLACE or REPLACE_DOWN flags.
- slurmctld - Cancel interactive job if prolog RPC never reaches its receiver.
- slurmctld - Cancel interactive jobs that never ran the prolog in the purge jobs logic.
- Fix accounting for memory on steps without pids, like the extern step, which caused them to be killed if OvermemoryKill was set.
- NO_NORMAL_ALL will only be printed if all NO_NORMAL_* flags are set.
- slurmctld - Prevent the controller from believing it has a job's federation cluster lock when it does not.
- Fix jobs incorrectly stuck waiting for resources when launched with specific client flag combinations containing "--hint=nomultithread".
- Fix allocated licenses still showing after removing all allocated licenses.
- accounting_storage/mysql - Disallow creating users if requested user list is empty or usernames are empty strings.
- slurmrestd - Revert tagging
.scriptfield as deprecated in 'POST /slurm/v0.0.42/job/submit'. - slurmrestd - Revert tagging
.scriptfield as deprecated in 'POST /slurm/v0.0.43/job/submit'. - slurmrestd - Revert tagging
.scriptfield as deprecated in 'POST /slurm/v0.0.44/job/submit'. - slurmrestd - Revert regression that changed the error from "Authentication failure" to "Authentication does not apply to request" when a HTTP request lacks any authentication credentials.
- When a job requests multiple partitions and cannot run in one of them due to topology, allow the main scheduler to evaluate jobs in the other requested partitions.
- slurmctld - Acquire the node write lock instead of the node read lock when querying 'GET /metrics/nodes' and 'GET /metrics/partitions' endpoints.
- slurmctld - Fixed segfault when running configless and a malformed REQUEST_CONFIG RPC is received.
- Remove error output for missing optional spank plugin.
- slurmctld - when unable to schedule a job with preferred node features, don't exclude the partition from further scheduling attempts in the same iteration.
- Fix issue with RestrictedCoresPerGPU with shared gres.
- Fix rpmbuild --with libcurl option.
- Add new JobAcctGatherParams=no_file_cache to change how memory usage (RSS) is reported when using cgroup/v2. With this flag set we will subtract active_file and inactive_file from the value reported in memory.current to avoid counting the file cache. memory.peak will then not be used to get the MaxRSS and getting memory spikes will depend on the JobAcctGatherFrequency parameter.
- namespace/linux - fix bug that could leave defunct processes in the jobs namespace.
- namespace/linux - kill and reap the namespace process during job teardown.
- namespace/linux - Fix issue with user_ns_script that may result in STDIN closing, which may result in 'Unable to receive "ok ack"' error on slurmstepd or other undefined behavior.
- Fix error reading /proc/0/* when calling the api outside the step namespace.
- slurmctld - Fixed segfault when using newly added remote licenses.
- Fix SIGCHLD not being sent to tasks.
- bitmap2node_name() is not cleaned up properly when reservation logging is enabled.
- Fix issue with jobs running on slurmd's with version 25.05.x or older getting aborted when slurmd re-registers with slurmctld.
- Fix memory leak on slurmctld for jobs that use --exclusive=topo
- Prevent jobs that cannot fit in the reservation's time limit from being attracted to a magnetic reservation.
- Fix slurmstepd segfault for older versioned batch jobs (25.05 and older) submitted without using -o/--output on submission.
v25.05.5
Changes in 25.05.5
- Fix slurmdbd error triggered by "sreport user topusage" when trying to get data from monthly usage tables.
- scontrol - fix regression where "scontrol update jobid= qos=" was not considered a valid command.
- slurmstepd - Prevent the slurmstepd from segfaulting if the switch/hpe_slingshot plugin is enabled and SwitchParameters is not specified.
- Avoid deadlock that occurs on a failed reconfigure when there are issues with slurmdbd connections and AccountingStoreFlags is set with job_script or job_env.
- slurmctld - Avoid regression that caused POSIX signals to be ignored after quiesce timeout triggers.
- Fix potential file descriptor leak to child processes.
- slurmctld - Prevent a fatal when min_exempt_priority is not the last option listed in PreemptParameters.
v25.11.0
Changes in 25.11.0
- namespace/linux - move directory creation for bind mounts to before the init script is called.
- namespace/linux - add SLURM_JOB_MEM to script environments when able.
- Fix an error when printing sdiag rpc stats in json format when hostlists strings are too long.
- Add --no-trunc argument to sdiag. That will output long hostlists that default to being truncated to 80 characters.
- Add infinite (-1) layer support to HRes mode 3.
- Fix ESLURM_RETRY_EVAL handling in common_topo_choose_nodes().
- Fix HRes MODE_3 when using with --gpus.
- Fix enforcing of MODE_3 with --distribution=arbitrary.
- slurmrestd - Fix regression that caused rejected HTTP requests to not include an descriptive error message.
- slurmrestd - Fix regression that caused requests for unknown or unsupported URL paths to not include a descriptive error.
v25.11.0rc2
Changes in 25.11.0rc2
- Avoid deadlock that occurs on a failed reconfigure when there are issues with slurmdbd connections and AccountingStoreFlags is set with job_script or job_env.
- Use rename() to atomically replace the heartbeat state file.
- scrun - Fix memory leak from invalid incoming messages.
- scrun - Avoid regressoion that would cause shutdown to hang.
- scrun - Fix race condition that could cause scrun to crash during shutdown.
- Set SLURM_JOB_SELINUX_CONTEXT in Prolog, Epilog, PrologSlurmctld, and EpilogSlurmctld with the selinux_context.
- Avoid printing "JobID=Invalid" or "SLUID=Invalid" to the logs. Print both when both are set, otherwise print whichever is set.
- slurmctld - Avoid regression that caused POSIX signals to be ignored after quiesce timeout triggers.
- Fix potential file descriptor leak to child processes.
- Add expediting state to job metrics.
- Fix federated jobs not getting SLUID set.
- Fix memory corruption on federated sibling submissions.
- Add SLURM_JOB_QOS to PrologSlurmctld/EpilogSlurmctld environment.
- namespace/linux - fix potential error with chown at job startup.
- Fix use after free in namespace/linux on an error condition.
- namespace/linux - fix potential invalid close() of file descriptors.
- slurmctld,slurmd - Reject incoming RPC connections with TLS required error to help misconfigured clients.
- Add requeue_delay option to SchedulerParameters.
- RPCs that are keyed by SLUID no longer fall-back to looking up the job by JobId. This should avoid (rare) edge cases where a node reconnects to the cluster and attempts to cancel requeued jobs.
- Add %S as a filename replacement pattern for SLUID.
- Add %r as a filename replacement pattern for restart count for batch jobs.
- Add topology.yaml manpage to debian packages.
- Add GET /metrics endpoint to list all metric-related endpoints.
- Export SLURM_JOB_SLUID in the environment for Prolog/Epilog. Remove the undocumented SLURM_SLUID environment variable.
- Export SLURM_JOB_SLUID in the environment for PrologSlurmctld/EpilogSlurmctld.
- namespace/linux - Default to 10 seconds for clone_ns_script_wait and clone_ns_epilog_wait if their values are not configured.
- namespace/linux - The namespace/linux plugin no longer reads job_container.conf. Instead it parses namespace.yaml.
- Prevent potential segfault when providing hostlist_push() with an incorrectly formatted hostlist string.
v25.11.0rc1
Changes in 25.11.0rc1
- slurmd - Avoid segfault during startup when /sys/ is not mounted correctly and gpu/nvidia plugin is configured.
- Fix compilation when building with SLURMSTEPD_MEMCHECK == 1.
- slurmresd - Catch use-after-free bugs for closed connections.
- Fix building with libyaml in a non-standard location.
- Suppress false error messages when system.slice cgroup is not present and EnableControllers is set.
- Adjust the OOM score of slurmstepd processes from -1000 to -999 to make them killable. This change addresses cases where slurmstepd consumes more memory than what is allocated to the job's cgroup. Previously, being unkillable could cause the process to get stuck during memory allocation, fully occupy a CPU core, and flood the kernel logs.
- slurmrestd - Prevent crash when an empty request is submitted to 'POST slurm/*/job/submit' endpoints.
- Allow tres-bind on steps with only one task.
- Fix INVALID nodes going DRAIN after a slurmd restart without any gres.conf modification.
- slurmrestd - Prevent potential crash when using the 'POST /slurmdb/*/accounts_association' endpoints.
- data_parser/v0.0.44 - Make field 'association_condition' required to prevent a slurmrestd crash when the field is not provided. This affects the following endpoints: 'POST /slurmdb/v0.0.44/accounts_association/'
- Add no_tag as a possible parameter to environment SLURMRESTD_YAML or environment SLURM_YAML to disable dumping YAML datatype !!tags for CLI commands supporting --yaml output.
- Fix non-fatal "No such file or directory" build errors when building on *EL systems.
- Slurmd will now unload gpu plugin after configuration is over, unless acct_gather_energy/gpu is set.
- Improve stepd termination messages for non-zero return codes.
- Fix double-deducting licenses when recovering COMPLETING jobs from state.
- Remove AccountingStorageUser option from slurm.conf.
- data_parser/v0.0.44 - Add five reservation flags that were missing from the RESERVATION_FLAGS array. This affects scontrol show reservation --{json|yaml} and the following REST API endpoints:
- 'GET /slurm/v0.0.44/reservation/{reservation_name}'
- 'GET /slurm/v0.0.44/reservations'
- 'POST /slurm/v0.0.44/reservation'
- 'POST /slurm/v0.0.44/reservations'
- Fix Slurm components that depend on libjson-c not setting RUNPATH when requested, which could cause them to fail at runtime.
- Fix Slurm components that depend on libyaml not setting RUNPATH when requested, which could cause them to fail at runtime.
- slurm.spec - Remove duplicated --with-freeipmi
- slurmrestd - Fix crash for /reservations endpoint when a valid reservation_desc_msg body with a specified partition but no node_list string array is submitted.
- Fix memory leak in configless mode when resolving the slurmctld address.
- slurmrestd - Avoid segfault with
-d listargs when no data_parser plugins can be read by process which requires removal of plugins from LD_LIBRARY_PATH or RPATH or some other administrative action as they are always installed with Slurm. - Change pam_slurm_adopt install location for the Debian package to the multiarch location.
- Support conversion of JSON/YAML dictionary/object to list/array where automatic type inferencing is supported.
- slurmrestd - Avoid giving database connection hex address in warnings when a slurmdbd query "found nothing". All
GET /slurmdb/v0.0.*/*endpoints are affected by this change. - slurmrestd - Avoid giving database connection hex address in warnings when a slurmdbd query "reports nothing changed". All
GET /slurmdb/v0.0.*/*endpoints are affected by this change. - slurmrestd - Avoid giving database connection hex address in warnings when a slurmdbd query "failed". All
GET /slurmdb/v0.0.*/*endpoints are affected by this change. - slurmrestd - Avoid giving database connection hex address in errors when a slurmdbd query "failed". All
GET /slurmdb/v0.0.*/*endpoints are affected by this change. - slurmrestd - Avoid giving database connection hex address in errors when a slurmdbd query "failed". All
POST /slurmdb/v0.0.*/*endpoints are affected by this change. - slurmrestd - Avoid giving database connection hex address in errors when a slurmdbd query "failed" to commit changes. All
POST /slurmdb/v0.0.*/*endpoints are affected by this change. - Add log_user() to be callable by all Lua scripts.
- slurmctld - Relock the controller's pidfile on a reconfigure.
- Add --with cgroupv2 option to slurm.spec to assist building rpms on systems that require cgroupv2 support.
- Pass existing cluster id from slurmctld to slurmdbd when registering the cluster with accounting for the first time
- scontrol - Improve error message when attempting to perform an invalid state update on a node, e.g. RESUME a node that is currently in a state from which resuming is not possible.
- Set exit_code when "scontrol listjobs/liststeps" does not find jobs/steps.
- Fix rejecting some valid jobs that requested --sockets-per-node > 1 and --gres-flags=enforce-binding.
- For some jobs that use --gres-flags=enforce-binding, allocate the lowest numbered socket first. The previous behavior allocated the highest numbered socket first.
- Fix some situations where --gres-flags=enforce-binding was not respected for jobs that request --gpus-per-task and multiple nodes.
- sacctmgr - Catch and reject attempts to pass invalid account flags= arguments.
- Avoid returning an "INVALID" Accounting flag in JSON or YAML output which was never valid flag.
- sacctmgr - Catch and reject attempts to pass invalid Cluster flags= arguments.
- Avoid returning an "INVALID" Cluster flag in JSON or YAML output which was never a valid flag.
- Avoid returning an "INVALID" Association flag in JSON or YAML output which was never a valid flag.
- sacctmgr - Catch and reject attempts to pass invalid QOS flags.
- Avoid returning an "INVALID" QOS flag in JSON or YAML output which was never a valid flag.
- slurmdbd.conf can have permissions of 640 or 600.
- Avoid possible race condition that could cause a hang if process should shutdown before all conmgr worker threads have started.
- Add HttpParserType parameter to slurm.conf
- Add UrlParserType parameter to slurm.conf
- Add http_parser/libhttp_parser plugin.
- slurmrestd - Switch to using http_parser/libhttp_parser plugin for http processing.
- topology/block - Add new TopologyParam=BlockAsNodeRank option to reorder nodes based on block layout. This can be useful if the naming convention for the nodes does not natually map to the network topology.
- Fix memory leak in acct_gather_profile_influxdb.c.
- Allow --segment to be bigger than base block size.
- sacctmgr - Allow an operator to alter the allocated TRES of a job.
- sacctmgr - add new option to set fixed runaway jobs as FAILED or COMPLETED.
- Job option --hint is now mutually exclusive with both --cores-per-socket and --sockets-per-node.
- Add HealthCheckNodeState=START_ONLY option.
- Add a new CliFilterParameters=cli_filter_lua_path= to slurm.conf, enabling the configuration of an absolute path to the cli_filter.lua script when cli_filter/lua is configured.
- Allow removing a user's default association when AllowNoDefAcct=yes.
- scontrol - Fix "KillOnInvalidDependent" typo in show jobs output.
- scontrol - Use whitespace to separate all key-value pairs in show jobs output. Previously, commas were used between job flags.
- scontrol - Fix GresAllowTaskSharing not always appearing in show jobs output.
- slurmrestd - Log URLs that fail to parse under the debugflag=data instead as errors or debug5 as invalid user provided is not directly an error of the slurmrestd daemon itself.
- Fix use-cases incorrectly rejecting job requests when MaxCPUsPer[Socket|Node] applied and CPUSpecList/CoreSpecCount configured.
- slurmrestd - Specify listening on UNIX socket by giving the "unix://" prefix instead of "unix:" prefix for the URL scheme.
- Don't subtract 128 from sacct returned exit codes for the codes above 128.
- Fix issue resolving socket max segment size (MSS) via kernel to more accurately size buffered communications.
- mpi/pmix - Restore the ability to follow symlinks on PMIx cli/lib temporal directory creation when considered trusted (i.e. SlurmdSpoolDir, TmpFS, PMIxCliTmpDirBase). This ability was lost in Slurm 23.02.6.
- Invalid use of -= or += aborts sacctmgr commands with an error.
- Add SLURM_JOB_SEGMENT_SIZE environment variable to salloc and srun job environments when --segment is set.
- Remove the trailing space after AdminComment, SystemComment, Comment, and Extra fields in scontrol show jobs output.
- slurmdbd - Make sure locks are appropriate when modifying or removing associations while filtering by QOS.
- defaultqos is not set when loaded from dump file
- Add display for default qos when listing users and not specifying "withassoc"
- slurmctld - Always close rejected incoming RPC connections to avoid reading another incoming RPC request.
- slurmctld - Log runtime errors from global section of Lua scripts during loading.
- Allow gpu shared gres to use RestrictedCoresPerGPU.
- slurmrestd - Fix memory leak that happened when submitting a request body containing the "warnings", "errors", or "meta" field. This affects the following endpoints: 'POST /slurmdb/v0.0.4*/qos'
- slurmrestd - Prevent triggering a fatal abort when parsing a non-empty group id string. This affects all endpoints with request bodies containing openapi_meta_client group field. It also affects the following endpoints:
- 'GET /slurmdb/v0.0.4*/jobs'
- 'POST /slurm/v0.0.4*/job/submit'
- 'POST /slurm/v0.0.4*/job/{job_id}'
- 'POST /slurm/v0.0.4*/job/allocate'
- Add the ability to gate access to a reservation by QOS.
- Add the ability to gate access to a reservation by Partition.
- scontrol - Add "SubmitLine" field to show job subcommand's output.
- slurmctld - Prevent an invalid read and a possible crash by rejectin...
v25.05.4
Changes in 25.05.4
- scontrol/sacct/sacctmgr - Prevent hitting a double free when slurm is compiled with --enable-m
emory-leak-debug and using the --json or --yaml options. - Prevent possible truncation of the CPU_IDs list in "scontrol --details show job" on high-core-
count systems. - slurmd - Fix potential memory leak when incoming RPC is rejected or fails to unpack successful
ly. - Demote certain plugin loading "error" messages to "debug" messages. This prevents unnecessary
errors from being logged when slurmrestd tries to load the tls/s2n plugin to handle https reques
ts. Failures to load any other plugins will still result in informative "error" messages being l
ogged. - Respect time_min_as_soft_limit when calculating the projected job end time.
- Fix a regression added in 25.05.2 that broke compatibility with PMIx v2.x through v3.1.0rc1.
- slurmrestd - Fix a bug in the "GET /slurm/v0.0.4*/node/{node_name}" endpoint where the node's
partitionsfield would be incorrectly populated. - Fix regression that caused slurmctld to wait forever on shutdown until all powersave scripts c
ompleted. Now the slurmctld waits up to 10 seconds, as documented. - Improve consistency of invalid node name errors.
- Prevent potential memory corruption while forwarding messages that require addresses to be pac
ked. - slurmctld - Increased the default maximum number of incoming connections from 50 to 512 as con
figured in slurm.conf with SlurmctldParameters=conmgr_max_connections=512 to reduce amount of co
nnections getting deferred responses. - Avoid allowing one extra connection above the configured conmgr_max_connections limit.
- Docs - Change man page footer to display the current Slurm release instead of the last-changed
"Month Year". - Docs - Remove "Last modified" dates from HTML documentation.
- slurmstepd - Fix deadlock when PMIx receives an event during step termination, which caused st
uck stepd processes after job completion.
v25.05.3
Changes in 25.05.3
- slurmctld.service - Set LimitMEMLOCK=infinity by default to avoid slurmctld crashes due to def
ault for locked memory being too low. - slurmdbd.service - Set LimitMEMLOCK=infinity by default to avoid slurmdbd crashes due to default for locked memory being too low.
- slurmrestd.service - Set LimitMEMLOCK=infinity by default to avoid slurmrestd crashes due to default for locked memory being too low.
- Fix a segfault in the slurmctld caused by invalid core affinity for GPUs on a node.
- Fix a node not being set to the invalid state when GPU core affinity is invalid.
- A cluster will start the MaxJobCount of jobs and not one less.
- Allow QOS usage to be purged and optionally archived as part of a Usage purge and optional archive.
- Fix slurmctld crash caused by accessing job_desc.assoc_qos in job_submit.lua for an association that doesn't exist.
- Fix slurmctld segfault when SIGUSR2 is received early and jobcomp plugin is enabled.
- Fix use-cases incorrectly rejecting job requests when MaxCPUsPer[Socket|Node] applied and CPUSpecList/CoreSpecCount configured.
- tls/s2n - Fix heterogeneous jobs failing to run in a TLS enabled environment.
- sbatch - Fix a regression where SLURM_NETWORK would not be exported for non-Cray systems when using --network.
- REGEX_REPLACE() was not supported before MySQL 8.0.4 and MariaDB 10, and the regex syntax used previously was not supported for both MySQL and MariaDB (not all POSIX syntax is supported in both)
- fatal() if the SQL server does not support REGEXP_REPLACE(). This was introduced in MySQL 8.0.4 or MariaDB 10.0.5.
- Pass environment variables to container when using Apptainer/Singularity OCI runtimes.
- slurmscriptd,slurmstepd - Fix use-after-free issue with the "ident" string when logging to syslog.
- Fix bug where the backfill scheduler changed the specified --time of a job and incorrectly reset it to --time-min.
- Prevent healthy nodes being marked as unresponsive due to forwarding message timeouts increasing as the tree is traversed. The issue occurred if Slurm was running with a mix of 24.05- and 24.11+ slurmds. This only fixes 25.05+ slurmds.
- Fix crash while using the wckeys rest endpoint.
- Fix cases of job updates incorrectly rejected when specifying modifications on fields unrelated to tasks computation (i.e. changing JobName).
- slurmrestd - Prevent triggering a fatal abort when parasing a non-empty group id string by replacing it with an error. This affects all endpoints with request bodies containing openapi_meta_client group field. It also affects the following endpoints: 'GET /slurmdb/v0.0.4[1-3]/jobs' 'POST /slurm/v0.0.4[1-3]/job/submit' 'POST /slurm/v0.0.4[1-3]/job/{job_id}' 'POST /slurm/v0.0.4[1-3]/job/allocate'
- slurmrestd - Fix memory leak that happened when submitting a request body containing the meta.plugin.accounting_storage field.
- slurmrestd - Fix memory leak that happened when submitting a request body containing the "warnings", "errors", or "meta" field. This affects the following endpoints: 'POST /slurmdb/v0.0.4*/qos'
- slurmctld - Fix how gres with cores or a type defined are selected to prevent jobs not using reservations from being allocated reserved gres and vice versa.