[RFC]: Support Exclusive Topology in StormService for Pod Colocation


### **Description**

#### **Problem Statement**
In distributed inference and high-performance computing scenarios (e.g., disaggregated LLM serving with Prefill/Decode roles), it is critical to ensure that all Pods belonging to the same logical unit (i.e., a single RoleSet) are scheduled on the same topology domain—such as the same node (`kubernetes.io/hostname`) or availability zone (`topology.kubernetes.io/zone`). This minimizes network latency and improves data locality.

Currently, `StormService` lacks native support for enforcing such co-location constraints across multiple roles within a RoleSet.

#### **Proposed Solution**
Introduce an optional field `exclusiveTopology` in `RoleSetSpec`:

```go
// ExclusiveTopology specifies a Kubernetes topology key (e.g., "kubernetes.io/hostname")
// that all Pods in this RoleSet must share. When set, the StormService controller
// will automatically inject required pod affinity rules to ensure co-location.
// +optional
ExclusiveTopology string `json:"exclusiveTopology,omitempty"`
```

When `exclusiveTopology` is specified:
- The controller adds a **requiredDuringSchedulingIgnoredDuringExecution** pod affinity rule to every role’s PodTemplate.
- The label selector targets all Pods in the same RoleSet using stable labels like:
  - `storm-service-name`
  - `roleset-name` or a unique RoleSet identifier
- All roles within the RoleSet are guaranteed to land on nodes sharing the same value for the given topology key.

#### **Example Usage**
```yaml
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: pd-inference
spec:
  replicas: 2
  template:
    spec:
      exclusiveTopology: "kubernetes.io/hostname"  # ← enforce per-RoleSet node co-location
      roles:
        - name: prefill
          replicas: 1
          template: { ... }
        - name: decode
          replicas: 2
          template: { ... }
```
Result:
- 2 RoleSets created (due to `replicas: 2`)
- Each RoleSet’s 3 Pods (1 prefill + 2 decode) scheduled on **one node**
- The two RoleSets placed on **different nodes** (naturally via scheduling spread)

#### **Benefits**
- Enables low-latency communication between roles (e.g., Prefill ↔ Decode)
- Improves resource efficiency via data locality


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: Support Exclusive Topology in StormService for Pod Colocation #1842

Description

Problem Statement

Proposed Solution

Example Usage

Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Support Exclusive Topology in StormService for Pod Colocation #1842

Description

Description

Problem Statement

Proposed Solution

Example Usage

Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions