diff --git a/docs/proposals/1374-multi-cluster-inference/README.md b/docs/proposals/1374-multi-cluster-inference/README.md new file mode 100644 index 000000000..321b0d2a4 --- /dev/null +++ b/docs/proposals/1374-multi-cluster-inference/README.md @@ -0,0 +1,493 @@ +# Multi-Cluster InferencePools + +Author(s): @danehans, @bexxmodd, @robscott + +## Proposal Status + + ***Draft*** + +## Summary + +An Inference Gateway (IG) provides efficient routing to LLM workloads in Kubernetes by sending requests to an Endpoint Picker (EPP) associated with +an [InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/) and routing the request to a backend model server +based on the EPP-provided endpoint. Although other multi-cluster inference approaches may exist, this proposal extends the current model to support +multi-cluster routing so capacity in one cluster can serve traffic originating in another cluster or outside the clusters. + +### Why Multi-Cluster? + +GPU capacity is scarce and fragmented. Many users operate multiple clusters across regions and providers. A single cluster rarely satisfies peak or +sustained demand, so a prescribed approach is required to share GPU capacity across clusters by: + +- Exporting an InferencePool from a source (“exporting”) cluster. +- Importing the exported InferencePool into one or more destination (“importing”) clusters with enough detail for IGs to route requests to the associated + remote model server Pods. + +### Goals + +- Enable IGs to route to a group of common model server Pods, e.g. InferencePools, that exist in different clusters. +- Align the UX with familiar [Multi-Cluster Services (MCS)](https://multicluster.sigs.k8s.io/concepts/multicluster-services-api/) concepts (export/import). +- Keep the API simple and implementation-agnostic. + +### Non-Goals + +- Managing DNS or automatic naming. +- Over-specifying implementation details to satisfy a single approach to Multi-Cluster InferencePools. + +## Design Proposal + +The Multi-Cluster InferencePools (MCIP) model will largely follow the Multi-Cluster Services (MCS) model, with a few key differences: + +- DNS and ClusterIP resolution will be omitted, e.g. ClusterSetIP. +- A separate export resource will be avoided, e.g. ServiceExport, by inlining the concept within InferencePool. + +An InferencePoolImport resource is introduced that is meant to be fully managed by a controller. This resource provides the information +required for IGs to route LLM requests to model server endpoints of an InferencePool in remote clusters. How the IG routes the request to the remote +cluster is implementation-specific. + +### Routing Modes + +An implementation must support at least one of the following routing modes: + +- Endpoint Mode: An IG of an importing cluster routes to endpoints selected by the EPP of the exported InferencePool. Pod and Service network connectivity + MUST exist between cluster members. +- Parent Mode: An IG of an importing cluster routes to parents, e.g. Gateways, of the exported InferencePool. Parent connectivity MUST exist between cluster + members. + +### Sync Topology (Implementation-Specific) + +An implementation must support at least one of the following distribution topologies. The API does not change between them (same export annotation and InferencePoolImport). + +1. **Hub/Spoke** + - A hub controller has visibility into member clusters. + - It watches exported InferencePools and creates/updates the corresponding InferencePoolImport (same namespace/name) in each member cluster. + - Typical when a central control plane has K8s API server access for each member cluster. + - Consider [KEP-5339-style](https://github.com/kubernetes/enhancements/tree/master/keps/sig-multicluster/5339-clusterprofile-plugin-credentials) pluggable credential issuance to avoid hub-stored long-lived secrets. + +2. **Push/Pull** + - A cluster-local controller watches exported InferencePools and publishes export state to a central hub. + - A cluster-local controller watches the central hub and CRUDs the local InferencePoolImport. + - Typical when you want no hub-stored member credentials, looser coupling, and fleet-scale fan-out. + +### Workflow + +1. **Export an InferencePool:** An [Inference Platform Owner](https://gateway-api-inference-extension.sigs.k8s.io/concepts/roles-and-personas/) + exports an InferencePool by annotating it. +2. **Distribution (topology-dependent, API-agnostic):** + - **Hub/Spoke:** A central hub controller watches exported InferencePools and mirrors a same-name/namespace InferencePoolImport into each member cluster, updating `status.controllers[]` to reflect the managing controller, exporting clusters, etc.. + - **Push/Pull:** A cluster-local controller watches exported InferencePools and publishes export records to a central hub. In each member cluster, a controller watches the hub and CRUDs the local InferencePoolImport (same name/namespace) and maintains `status.controllers[]`. +3. **Importing Controller (common):** + - Watches local InferencePoolImport and: + - Programs the IG dataplane based on the supported routing mode. + - If this controller differs from the Exporting Controller, populate `status.controllers[]`. + - Manages `status.controllers[].parentRefs` with the Group, Kind, and Name of the local parent resource, e.g. Gateway, of the InferencePoolImport. +4. **Data Path:** + The data path is dependent on the export mode selected by the implementation. + - Endpoint Mode: Client → local IG → (make scheduling decision) → local/remote EPP → selected model server endpoint → response. + - Parent Mode: Client → local IG → (make scheduling decision) → local EPP/remote parent → remote EPP → selected model server endpoint → response. + +### InferencePoolImport Naming + +The exporting controller will create an InferencePoolImport resource using the exported InferencePool namespace and name. A cluster name entry in +`status.controllers[]` is added for each cluster that exports an InferencePool with the same ns/name. + +**Note:** EPP ns/name sameness is not required. + +### InferencePool Selection + +InferencePool selection is implementation-specific. The following are examples of how an IG may select one exported InferencePool over another: + +- **Metrics-based:** Scrape EPP-exposed metrics (e.g., ready pods) to bias InferencePool choice. +- **Active-Passive:** Basic EPP readiness checks (gRPC health). + +**Note:** When an exported InferencePool is selected by an IG, standard EPP semantics are used to select endpoints of that pool. + +### API Changes + +#### InferencePool Annotation + +The following annotation is being proposed to indicate the desire to export the InferencePool to member clusters of a ClusterSet. + +The `inference.networking.x-k8s.io/export` annotation key indicates a desire to export the InferencePool: + +```yaml +inference.networking.x-k8s.io/export: "" +``` + +Supported Values: + +- `ClusterSet` – export to all members of the current [ClusterSet](https://multicluster.sigs.k8s.io/api-types/cluster-set/). + +**Note:** Additional annotations, e.g. region/domain scoping, filter clusters in the ClusterSet, routing mode configuration, etc. and +potentially adding an InferencePoolExport resource may be considered in the future. + +#### InferencePool Status + +An implementation MUST set a parent status entry with a parentRef of kind `InferencePoolImport` and the ns/name of the exported InferencePool. +This informs the user that the request to export the InferencePool has been recognized by the implementation, along with the implementation's +unique `ControllerName`. `ControllerName` is a domain/path string that indicates the name of the controller that wrote the status entry. + +An `Exported` parent condition type is being added to surface status of the exported InferencePool. An implementation MUST set +this status condition to `True` when the annotated InferencePool has been exported to all member clusters of the ClusterSet and `False` +for all other reasons. When the export annotation is removed from the InferencePool, an implementation MUST remove this condition type. + +#### InferencePoolImport + +A cluster-local, controller-managed resource that represents an imported InferencePool. It primarily communicates a relationship between an exported +InferencePool and the exporting cluster name. It is not user-authored; status carries the effective import. Inference Platform Owners can reference +the InferencePoolImport, even if the local cluster does not have an InferencePool. In the context of Gateway API, it means that an HTTPRoute can be +configured to reference an InferencePoolImport to route matching requests to remote InferencePool endpoints. This API will be used almost exclusively +for tracking endpoints, but unlike MCS, we actually have two distinct sets of endpoints to track: + +1. Endpoint Picker Extensions (EPPs) +2. InferencePool parents, e.g. Gateways + +Key ideas: + +- Map exported InferencePool to exporting controller and cluster. +- Name/namespace sameness with the exported InferencePool (avoids extra indirection). +- Conditions: Surface a controller-level status condition to indicate that the InferencePoolImport is ready to be used. +- Conditions: Surface parent-level status conditions to indicate that the InferencePoolImport is referenced by a parent, e.g. Gateway. + +See the full Go type below for additional details. + +## Controller Responsibilities + +**Export Controller:** + +- Discover exported InferencePools. +- For each ClusterSet member cluster, CRUD InferencePoolImport (mirrored namespace/name). +- Populate the exported InferencePool with status to indicate a unique name of the managing controller and an `Exported` condition that + indicates the status of the exported pool. +- Populate an InferencePoolImport `status.controllers[]` entry with the managing controller name, cluster, and status conditions associated + with the exported InferencePool. + +**Import Controller:** + +- Watch InferencePoolImports. +- Program the IG data plane to route matching requests based on the supported routing mode. +- Manage InferencePoolImport `status.controllers[].parentRefs` with a cluster-local parentRef, e.g. Gateway, when the InferencePoolImport + is referenced by a managed HTTProute. + +## Examples + +### Exporting Cluster (Cluster A) Manifests + +In this example, Cluster A exports the InferencePool to all clusters in the ClusterSet. This will +cause the exporting controller to create an InferencePoolImport resource in all clusters. + +```yaml +apiVersion: inference.networking.k8s.io/v1 +kind: InferencePool +metadata: + name: llm-pool + namespace: example + annotations: + inference.networking.x-k8s.io/export: "ClusterSet" # Export the pool to all clusters in the ClusterSet +spec: + endpointPickerRef: + name: epp + portNumber: 9002 + selector: + matchLabels: + app: my-model + targetPorts: + - number: 8080 +--- +apiVersion: v1 +kind: Service +metadata: + name: epp + namespace: example +spec: + selector: + app: epp + ports: + - name: ext-proc + port: 9002 + targetPort: 9002 + appProtocol: http2 + type: LoadBalancer # EPP exposed via LoadBalancer +``` + +### Importing Cluster (Cluster B) Manifests + +In this example, the Inference Platform Owner has configured an HTTPRoute to route to endpoints of the Cluster A InferencePool +by referencing the InferencePoolImport as a `backendRef`. The parent IG(s) of the HTTPRoute are responsible for routing to the +endpoints selected by the exported InferencePool's EPP. + +The InferencePoolImport is controller-managed; shown here only to illustrate the expected status shape. + +```yaml +apiVersion: inference.networking.x-k8s.io/v1alpha1 +kind: InferencePoolImport +metadata: + name: llm-pool # mirrors exporting InferencePool name + namespace: example # mirrors exporting InferencePool namespace +status: + controllers: + - controllerName: example.com/mcip-controller + type: MultiCluster + exportingClusters: + - name: cluster-a + - controllerName: example.com/ig-controller + type: GatewayClass + parents: + - parentRef: + group: gateway.networking.k8s.io + kind: Gateway + name: inf-gw # Cluster-local parent, e.g. gateway + conditions: + - type: Accepted + status: "True" +--- +# Route in the importing cluster that targets the imported pool +apiVersion: gateway.networking.k8s.io/v1beta1 +kind: HTTPRoute +metadata: + name: llm-route + namespace: example +spec: + parentRefs: + - name: inf-gw + hostnames: + - my.model.com + rules: + - matches: + - path: + type: PathPrefix + value: /completions + backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePoolImport + name: llm-pool +``` + +An implementation MUST conform to Gateway API specifications, including when the HTTPRoute contains InferencePool and InferencePoolImport `backendRefs`, +e.g. `weight`-based load balancing. In the following example, traffic MUST be split equally between Cluster A and B InferencePool endpoints when +using the following `backendRefs`: + +```yaml + backendRefs: + - group: inference.networking.k8s.io + kind: InferencePool + name: llm-pool + weight: 50 + - group: inference.networking.x-k8s.io + kind: InferencePoolImport + name: llm-pool + weight: 50 +``` + +**Note:** The above example does not export the local "llm-pool" InferencePool. If this InferencePool was exported, it would be included in +the example InferencePoolImport and the implementation would be responsible for balancing the traffic between the two pools. + +### Go Types + +The following Go types define the InferencePoolImport API being introduced by this proposal. Note that a separate Pull Request will be used +to finalize and merge all Go types into the repository. + +```go +package v1alpha1 + +import ( + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + + v1 "sigs.k8s.io/gateway-api-inference-extension/api/v1" +) + +// +kubebuilder:object:root=true +// +kubebuilder:resource:scope=Namespaced,shortName=ipimp +// +kubebuilder:subresource:status +// +// InferencePoolImport represents an imported InferencePool from another cluster. +// This resource is controller-managed; users typically do not author it directly. +type InferencePoolImport struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + + // Spec defines the desired state of the InferencePoolImport. + Spec InferencePoolImportSpec `json:"spec,omitempty"` + + // Status defines the current state of the InferencePoolImport. + Status InferencePoolImportStatus `json:"status,omitempty"` +} + +// Unused but defined for potential future use. +type InferencePoolImportSpec struct{} + +type InferencePoolImportStatus struct { + // Controllers is a list of controllers that are responsible for managing this InferencePoolImport. + // + // +kubebuilder:validation:Required + Controllers []ImportController `json:"controllers"` +} + +// ImportController defines a controller that is responsible for managing this InferencePoolImport. +type ImportController struct { + // Name is a domain/path string that indicates the name of the controller that manages this + // InferencePoolImport. This corresponds to the GatewayClass controllerName field when the + // controller will manage parents of type "Gateway". Otherwise, the name is implementation-specific. + // + // Example: "example.net/import-controller". + // + // The format of this field is DOMAIN "/" PATH, where DOMAIN and PATH are valid Kubernetes + // names (https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names). + // + // A controller MUST populate this field when writing status and ensure that entries to status + // populated with their controller name are removed when they are no longer necessary. + // + // +required + Name ControllerName `json:"name"` + + // ExportingClusters is a list of clusters that exported the InferencePool associated with this + // InferencePoolImport. Required when the controller is responsible for CRUD'ing the InferencePoolImport + // from the exported InferencePool(s). + // + // +optional + ExportingClusters []ExportingCluster `json:"exportingClusters"` + + // Parents is a list of parent resources, typically Gateways, that are associated with the + // InferencePoolImport, and the status of the InferencePoolImport with respect to each parent. + // + // Required when the controller manages the InferencePoolImport as an HTTPRoute backendRef. The controller + // must add an entry for each parent it manages and remove the parent entry when the controller no longer + // considers the InferencePoolImport to be associated with that parent. + // + // +optional + // +listType=atomic + Parents []v1.ParentStatus `json:"parents,omitempty"` + + // Conditions track the state of the InferencePoolImport. + // + // Known condition types are: + // + // * "Accepted" + // + // +optional + // +listType=map + // +listMapKey=type + // +kubebuilder:validation:MaxItems=8 + Conditions []metav1.Condition `json:"conditions,omitempty"` +} + +// ControllerName is the name of a controller that manages a resource. It must be a domain prefixed path. +// +// Valid values include: +// +// * "example.com/bar" +// +// Invalid values include: +// +// * "example.com" - must include path +// * "foo.example.com" - must include path +// +// +kubebuilder:validation:MinLength=1 +// +kubebuilder:validation:MaxLength=253 +// +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*\/[A-Za-z0-9\/\-._~%!$&'()*+,;=:]+$` +type ControllerName string + +// ExportingCluster defines a cluster that exported the InferencePool associated to this InferencePoolImport. +type ExportingCluster struct { + // Name of the exporting cluster (must be unique within the list). + // + // +kubebuilder:validation:Required + Name string `json:"name"` +} + +// +kubebuilder:object:root=true +type InferencePoolImportList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata,omitempty"` + Items []InferencePoolImport `json:"items"` +} +``` + +### Failure Mode + +EPP failure modes continue to work as-is and are independent of MCIP. + +#### EPP Selection + +Since an IG decides which EPP to use for endpoint selection when multiple InferencePool/InferencePoolImport `backendRefs` exist, +an implementation MAY use EPP metrics and/or health data to make a load-balancing decision. + +## Alternatives + +### Option 1: Reuse MCS API for EPP + +Reuse MCS to export EPP Services. This approach provides simple infra, but may be confusing to users (you “export EPPs” not pools) and +requires a separate MCS parent export for parent-based inter-cluster routing. + +**Pros**: + +- Reuses existing MCS infrastructure. +- Relatively simple to implement. + +**Cons**: + +- Referencing InferencePools in other clusters requires you to create an InferencePool locally. +- In this model, you don’t actually choose to export an InferencePool, you export the EPP or InferencePool parent(s) service, that could lead to confusion. +- InferencePool is meant to be a replacement for a Service so it may seem counterintuitive for a user to create a Service to achieve multi-cluster inference. + +## Option 2: New MCS API + +One of the key pain points we’re seeing here is that the current iteration of the MCS API requires a tight coupling between name/namespace and kind, with Service being the only kind of backend supported right now. This goes against the broader SIG-Network direction of introducing more focused kinds of backends (like InferencePool). To address this, we could create a resource that has an `exportRef` that allows for exporting different types of resources. + +While we were at it, we could combine the separate `export` and `import` resources that exist today, with `export` acting as the (optional) spec of this new resource, and `import` acting as `status` of the resource. Instead of `import` resources being automatically created, users would create them wherever they wanted to reference or export something to a MultiClusterService. + +Here’s a very rough example: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: MultiClusterService +metadata: + name: epp + namespace: example +spec: + exportRef: + group: v1 + kind: Service + name: epp + scope: ClusterSet +status: + conditions: + - type: Accepted + status: "True" + message: "MultiClusterService has been accepted" + lastTransitionTime: "2025-03-30T01:33:51Z" + targetCount: 1 + ports: + - protocol: TCP + appProtocol: HTTP + port: 8080 +``` + +### Open Questions + +#### EPP Discovery + +- Should EPP Deployment/Pod discovery be standardized (labels/port names) for health/metrics auto-discovery? + +#### Security + +- Provide a standard way to bootstrap mTLS between importing IG and exported EPP/parents, e.g. use BackendTLSPolicy? + +#### Ownership and Lifecycle + +- Garbage collection when export is withdrawn (delete import?) and how to drain traffic safely. See [this comment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1374#discussion_r2357523781) for additional context. + +### Prior Art + +- [GEP-1748: Gateway API Interaction with Multi-Cluster Services](https://gateway-api.sigs.k8s.io/geps/gep-1748/) +- [Envoy Gateway with Multi-Cluster Services](https://gateway.envoyproxy.io/latest/tasks/traffic/multicluster-service/) +- [Multi-Cluster Service API](https://multicluster.sigs.k8s.io/concepts/multicluster-services-api/) + +### References + +- [Initial Multi-Cluster Inference Design Doc](https://docs.google.com/document/d/1QGvG9ToaJ72vlCBdJe--hmrmLtgOV_ptJi9D58QMD2w/edit?tab=t.0#heading=h.q6xiq2fzcaia) + +### Notes for reviewers + +- The InferencePoolImport CRD is intentionally status-only to keep the UX simple and controller-driven. +- The InferencePool namespace sameness simplifies identity and lets HTTPRoute authors reference imports without new indirection.