Skip to content

feat: shared annotation#14

Open
cgetzen wants to merge 7 commits intoSlinkyProject:mainfrom
taichi-dev:feat-shared-annotation
Open

feat: shared annotation#14
cgetzen wants to merge 7 commits intoSlinkyProject:mainfrom
taichi-dev:feat-shared-annotation

Conversation

@cgetzen
Copy link
Copy Markdown
Contributor

@cgetzen cgetzen commented Feb 3, 2026

Summary

Problem

Slurm-bridge does not support colocating multiple pods on a single multi-GPU node, resulting in underutilization when workloads require fewer GPUs than the node provides.

Solution

This adds an optional workload annotation slurmjob.slinky.slurm.net/shared accepting a subset of Slurm shared policy values (none, user) on workloads that have a 1:1 relationship between slurm jobs and pods. This excludes PodGroup and LeaderWorkerSet resources.

The admission controller ensures correctness:

  • validates the annotation value
  • ensures the annotation is immutable once the placeholder slurm job is running
  • ensures that it is only applied onto accepted workloads

The scheduler then applies the "shared" setting when creating the slurm job.

Limitations

Allowing group workloads to use the shared annotation is out of scope.

Group workloads use a single placeholder job for multiple pods with a fixed node count and one-node-per-pod assignment. Allowing shared on them would require supporting Slurm packing (fewer nodes than pods), which would require changes to PostFilter, submitJob node count, and annotatePodsWithNodes.

Using group workloads with DRA poses additional challenges. Slurm-bridge currently assumes one pod per node per job: PreBind is called per-pod with (pod, nodeName), and GetResources(ctx, pod, nodeName) returns the job’s allocation on that node from Slurm’s NodeResourceLayout. One ResourceClaim is created per pod for that full allocation. With multiple pods on the same node, each pod should only receive a portion of the job's allocation.

Breaking Changes

All existing behavior is maintained by default. Only workloads that opt in to using slurmjob.slinky.slurm.net/shared are affected.

Testing Notes

Unit tests have been added, and manual tests have been performed to confirm scheduling placement.

Additional Context

@cgetzen cgetzen force-pushed the feat-shared-annotation branch from eca58f2 to 62efa0b Compare February 3, 2026 10:10
@cgetzen cgetzen force-pushed the feat-shared-annotation branch from ba2f425 to b63288e Compare February 16, 2026 04:56
@vivian-hafener vivian-hafener self-assigned this Feb 17, 2026
@vivian-hafener
Copy link
Copy Markdown
Contributor

vivian-hafener commented Feb 17, 2026

Good afternoon @cgetzen,

Slurm-bridge does not support colocating multiple pods on a single multi-GPU node, resulting in underutilization when workloads require fewer GPUs than the node provides.

This is correct, and is a known limitation of Slurm-bridge that the Slinky team is actively working to resolve. Setting OverSubscribe=YES has severe implications for the quality of service provided to end-users when using Slurm, especially on multitenant, highly utilized systems. As such, I am not comfortable with officially recommending this parameter as the means by which multiple Slurm jobs should be run on a single node on production clusters.

Presently, the Slinky team is working on integrating DRA capabilities into Slurm-bridge. In doing so, we will have the capability to accurately de-conflict the resource requirements of Kubernetes and Slurm workloads for both CPUs and GPUs. This capability will enable multiple Kubernetes or Kubernetes/Slurm workloads on a node, without the degradation in the end-user experience that would be provided through the use of the OverSubscribe configuration parameter. Additionally, this should enable "group workloads" to take advantage of this capability. At the time that these integrations are complete, the ability to enable node-packing/sharing in Slurm will be exposed on a system level. However, we do not at this time intend to expose the shared policy via an annotation.

Please let me know if you have any further questions on the matter.

Best regards,
Vivian Hafener

@cgetzen
Copy link
Copy Markdown
Contributor Author

cgetzen commented Feb 17, 2026

I appreciate the detailed response. This PR has two errors:

  • OVERSUBSCRIBE is not required on the partition in order for shared=user to schedule multiple workloads on a pod. This is a documentation error.
  • Setting shared to mcs, oversubscribe, topo has risks when not running with DRA, as slurm and kubernetes workloads can be scheduled on the same node. In the scope of this PR, these options should be removed.

I have updated the branch accordingly.

@vivian-hafener DRAExtendedResources are still in alpha. I agree that it’s important for all code paths to support bin-packing, and believe this PR takes an incremental step by safely enabling the non-DRA paths that are already usable in production. Does the Slinky team plan to add (kubernetes-only) bin-packing support for the non-DRA code paths? If so, I am happy to maintain this branch until that feature is developed. Otherwise, it may be beneficial to reconsider this PR with the fixes I have added.

Non-DRA, kubernetes-only bin-packing is needed for our use-case: A cluster where researchers use slurm to schedule whole-node/multi-node workloads, and SRE use kubernetes to schedule autoscaling inference workloads consuming single GPUs.

Thanks for your work on this project!

@vivian-hafener
Copy link
Copy Markdown
Contributor

Good afternoon @cgetzen,

I apologize for closing your PR prematurely.

At the time that the Slinky team implements node sharing with the CPU DRA driver, we intend to fully drop our existing limitation that forces the application of the --exclusive flag to all workloads.

After the integration of Slurm-bridge with DRA-Driver-CPU, I will re-evaluate this PR and discuss it with the rest of the team. I think that adding an annotation as you have done here may indeed make sense.

Best regards,
Vivian Hafener

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants