InferencePool's pod selector cannot select multiple ports within a pod needed for data parallel attention

[Data parallel attention in vLLM](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html#data-parallel-deployment) involves multiple processes on each host, typically one per GPU, communicating collectively with the other accelerators on the host and off the host (depending on the size of the wide expert parallelism config, it could be as many as 20 nodes).  Each process is expected to be a serving replica and may experience non-uniform traffic, and it is desirable to load balance from IGW based on the discrete load of each.

The natural way to run data parallel attention in python would be a single pod per node using all GPUs, with standard python fork/spawn used to launch child processes.  The processes are communicating over both NVLINK, CUDA IPC (and thus need shared memory), and at least today are expected to be started / restarted by the launching process.

While we have discussed a few options in [llm-d involving pods per rank](https://docs.google.com/document/d/1yYvX-eOVpI6zgqLzN5CjWhMMGmc2RvEHS9GVanY4AH0/edit?usp=sharing), the lowest low friction way to implement this in Kubernetes would be one pod containing multiple processes, each listening on a different port, which requires multiple container ports per pod, and then program the load balancers to see all ports equally (may require multiple services / other fixes).

The current design of the InferencePool pod selector cannot select multiple container ports per pod. 

Going forward, I would expect to see

*  more multi-process / multi-rank coordination within pods 
*  potentially distjoint port sets across variants within the pool 
* a need to associate a port -> a container -> a GPU for failure detection and two way linking
* health checks per port (which is not possible on pods without multiple containers today)
* port metadata so the gateway can correlate port -> ranks for algorithms

This is likely a blocker to data-parallel attention support for inference gateway, although we could work around it inside the EPP at the cost of causing higher level systems to not have the right data (anyone building on top of the pool, metrics, etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

InferencePool's pod selector cannot select multiple ports within a pod needed for data parallel attention #1336

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

InferencePool's pod selector cannot select multiple ports within a pod needed for data parallel attention #1336

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions