Skip to content

InferencePool's pod selector cannot select multiple ports within a pod needed for data parallel attention #1336

@smarterclayton

Description

@smarterclayton

Data parallel attention in vLLM involves multiple processes on each host, typically one per GPU, communicating collectively with the other accelerators on the host and off the host (depending on the size of the wide expert parallelism config, it could be as many as 20 nodes). Each process is expected to be a serving replica and may experience non-uniform traffic, and it is desirable to load balance from IGW based on the discrete load of each.

The natural way to run data parallel attention in python would be a single pod per node using all GPUs, with standard python fork/spawn used to launch child processes. The processes are communicating over both NVLINK, CUDA IPC (and thus need shared memory), and at least today are expected to be started / restarted by the launching process.

While we have discussed a few options in llm-d involving pods per rank, the lowest low friction way to implement this in Kubernetes would be one pod containing multiple processes, each listening on a different port, which requires multiple container ports per pod, and then program the load balancers to see all ports equally (may require multiple services / other fixes).

The current design of the InferencePool pod selector cannot select multiple container ports per pod.

Going forward, I would expect to see

  • more multi-process / multi-rank coordination within pods
  • potentially distjoint port sets across variants within the pool
  • a need to associate a port -> a container -> a GPU for failure detection and two way linking
  • health checks per port (which is not possible on pods without multiple containers today)
  • port metadata so the gateway can correlate port -> ranks for algorithms

This is likely a blocker to data-parallel attention support for inference gateway, although we could work around it inside the EPP at the cost of causing higher level systems to not have the right data (anyone building on top of the pool, metrics, etc)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions