Skip to content

Support GPU Sharing Between Pods in a PodCliqueScalingGroup #390

@julienmancuso

Description

@julienmancuso

What you would like to be added?

We need to define groups of pods within a PodCliqueScalingGroup (PCSG) that share the exact same GPU(s). When scaling the PCSG, all GPU-sharing groups should be replicated together.

Use Case
Within a single PCSG, define multiple "GPU-sharing groups" where pods share GPUs:
Each sharing group has 2+ pods (e.g., primary-shadow pair)
Pods within a sharing group access the same GPU(s)
Different sharing groups use different GPU sets
When scaling the PCSG, all groups are replicated together
Example:

PCSG (replicas: 2)
├── Replica 0
│   ├── Sharing Group "worker1": primary-1 + shadow-1 → share GPU 0-7
│   └── Sharing Group "worker2": primary-2 + shadow-2 → share GPU 8-15
└── Replica 1
    ├── Sharing Group "worker3": primary-3 + shadow-3 → share GPU 16-23
    └── Sharing Group "worker4": primary-4 + shadow-4 → share GPU 24-31

Current Gap
No way to specify which pods within a PCSG should share GPUs
No mechanism to define multiple independent GPU-sharing groups
Traditional GPU requests allocate separate GPUs to each pod

Why is this needed?

Dynamo GPU emory service

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions