Conversation
eca58f2 to
62efa0b
Compare
ba2f425 to
b63288e
Compare
|
Good afternoon @cgetzen,
This is correct, and is a known limitation of Slurm-bridge that the Slinky team is actively working to resolve. Setting Presently, the Slinky team is working on integrating DRA capabilities into Slurm-bridge. In doing so, we will have the capability to accurately de-conflict the resource requirements of Kubernetes and Slurm workloads for both CPUs and GPUs. This capability will enable multiple Kubernetes or Kubernetes/Slurm workloads on a node, without the degradation in the end-user experience that would be provided through the use of the OverSubscribe configuration parameter. Additionally, this should enable "group workloads" to take advantage of this capability. At the time that these integrations are complete, the ability to enable node-packing/sharing in Slurm will be exposed on a system level. However, we do not at this time intend to expose the shared policy via an annotation. Please let me know if you have any further questions on the matter. Best regards, |
|
I appreciate the detailed response. This PR has two errors:
I have updated the branch accordingly. @vivian-hafener DRAExtendedResources are still in alpha. I agree that it’s important for all code paths to support bin-packing, and believe this PR takes an incremental step by safely enabling the non-DRA paths that are already usable in production. Does the Slinky team plan to add (kubernetes-only) bin-packing support for the non-DRA code paths? If so, I am happy to maintain this branch until that feature is developed. Otherwise, it may be beneficial to reconsider this PR with the fixes I have added. Non-DRA, kubernetes-only bin-packing is needed for our use-case: A cluster where researchers use slurm to schedule whole-node/multi-node workloads, and SRE use kubernetes to schedule autoscaling inference workloads consuming single GPUs. Thanks for your work on this project! |
|
Good afternoon @cgetzen, I apologize for closing your PR prematurely. At the time that the Slinky team implements node sharing with the CPU DRA driver, we intend to fully drop our existing limitation that forces the application of the After the integration of Slurm-bridge with DRA-Driver-CPU, I will re-evaluate this PR and discuss it with the rest of the team. I think that adding an annotation as you have done here may indeed make sense. Best regards, |
Summary
Problem
Slurm-bridge does not support colocating multiple pods on a single multi-GPU node, resulting in underutilization when workloads require fewer GPUs than the node provides.
Solution
This adds an optional workload annotation
slurmjob.slinky.slurm.net/sharedaccepting a subset of Slurm shared policy values (none,user) on workloads that have a 1:1 relationship between slurm jobs and pods. This excludes PodGroup and LeaderWorkerSet resources.The admission controller ensures correctness:
The scheduler then applies the "shared" setting when creating the slurm job.
Limitations
Allowing group workloads to use the shared annotation is out of scope.
Group workloads use a single placeholder job for multiple pods with a fixed node count and one-node-per-pod assignment. Allowing shared on them would require supporting Slurm packing (fewer nodes than pods), which would require changes to PostFilter, submitJob node count, and annotatePodsWithNodes.
Using group workloads with DRA poses additional challenges. Slurm-bridge currently assumes one pod per node per job: PreBind is called per-pod with
(pod, nodeName), andGetResources(ctx, pod, nodeName)returns the job’s allocation on that node from Slurm’s NodeResourceLayout. One ResourceClaim is created per pod for that full allocation. With multiple pods on the same node, each pod should only receive a portion of the job's allocation.Breaking Changes
All existing behavior is maintained by default. Only workloads that opt in to using
slurmjob.slinky.slurm.net/sharedare affected.Testing Notes
Unit tests have been added, and manual tests have been performed to confirm scheduling placement.
Additional Context