You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An Inference Gateway (IG) provides efficient routing to LLM workloads in Kubernetes by sending requests to an Endpoint Picker (EPP) associated with an
12
-
[InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/) and routing the request to a backend model server based on
13
-
the EPP-provided endpoint. This proposal extends the current model to support multi-cluster routing so capacity in one cluster can serve traffic originating in another.
11
+
An Inference Gateway (IG) provides efficient routing to LLM workloads in Kubernetes by sending requests to an Endpoint Picker (EPP) associated with
12
+
an [InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool/) and routing the request to a backend model server
13
+
based on the EPP-provided endpoint. This proposal extends the current model to support multi-cluster routing so capacity in one cluster can serve
14
+
traffic originating in another.
14
15
15
16
### Why Multi-Cluster?
16
17
17
18
GPU capacity is scarce and fragmented. Many users operate multiple clusters across regions and providers. A single cluster rarely satisfies peak or
18
19
sustained demand, so a prescribed approach is required to share GPU capacity across clusters by:
19
20
20
21
- Exporting an InferencePool from a source (“exporting”) cluster.
21
-
- Importing the exported InferencePool into one or more destination (“importing”) clusters with enough detail for IGs to route requests to the associated remote model server Pods.
22
+
- Importing the exported InferencePool into one or more destination (“importing”) clusters with enough detail for IGs to route requests to the associated
23
+
remote model server Pods.
22
24
23
25
### Goals
24
26
@@ -39,37 +41,44 @@ The Multi-Cluster Inference (MCI) model will largely follow the Multi-Cluster Se
39
41
- A separate export resource will be avoided, e.g. ServiceExport, by inlining the concept within InferencePool.
40
42
41
43
An InferencePoolImport resource is introduced that is meant to be fully managed by an MCI controller. This resource provides the information
42
-
required for IGs to route LLM requests to model server endpoints of an InferencePool in remote clusters. How the IG routes the request to the remote cluster
43
-
is implementation-specific.
44
+
required for IGs to route LLM requests to model server endpoints of an InferencePool in remote clusters. How the IG routes the request to the remote
45
+
cluster is implementation-specific.
46
+
47
+
### Routing Modes
48
+
49
+
The proposal supports the following routing modes:
50
+
51
+
- Endpoint Mode: An IG of an importing cluster routes to endpoints selected by the EPP of the exported InferencePool. Pod and Service network connectivity
52
+
MUST exist between cluster members.
53
+
- Parent Mode: An IG of an importing cluster routes to parents, e.g. Gateways, of the exported InferencePool. Parent connectivity MUST exist between cluster
54
+
members.
44
55
45
56
### Workflow
46
57
47
58
1.**Export an InferencePool:** An [Inference Platform Owner](https://gateway-api-inference-extension.sigs.k8s.io/concepts/roles-and-personas/)
- Watches all ClusterSet member clusters for exported InferencePool resources.
61
+
- Watches all ClusterSet member clusters for exported InferencePool resources (must have access to the K8s API server).
51
62
- CRUDs an InferencePoolImport in each member cluster if:
52
63
- The cluster contains the namespace of the exported InferencePool (namespace sameness).
53
64
- When an InferencePoolImport with the same ns/name already exists in the cluster, update the associated `inferencepoolimport.status.clusters[]` entry.
54
65
- Populates InferencePoolImport status with the information required for importing IGs to route requests to exported InferencePool endpoints,
55
-
e.g. address of the referenced EPP or remote parent
56
-
so the importing IG can either:
57
-
- Connect directly to the remote EPP and then route to the selected endpoint (`EndpointMode`).
58
-
- Route via a remote parent (`ParentMode`).
66
+
e.g. address of the referenced EPP or remote parent so the importing IG can route to remote endpoints using either routing mode.
59
67
3.**IG Management Plane (Per Cluster):**
60
68
- Watches InferencePoolImport resources.
61
69
- Programs the managed IG data plane to either:
62
70
- Connect to the exported EPP via gRPC ext-proc for endpoint selection and optionally EPP health/metrics endpoints. Connect directly to exported
63
-
InferencePool endpoints. This mode requires inter-cluster `podCIDR` connectivity (`EndpointMode`).
64
-
- Connect to remote parents, e.g. IGs, of the exported InferencePool. This mode requires inter-cluster parent connectivity (`ParentMode`).
71
+
InferencePool endpoints (Endpoint Mode).
72
+
- Connect to remote parents, e.g. IGs, of the exported InferencePool (Parent Mode).
65
73
4.**Data Path:**
66
74
The data path is dependant on the export mode selected by the user.
67
-
-`EndpointMode`: Client → local IG → (make scheduling decision) → local/remote EPP → selected model server endpoint → response.
68
-
-`ParentMode`: Client → local IG → (make scheduling decision) → local EPP/remote parent → remote EPP → selected model server endpoint → response.
75
+
-Endpoint Mode: Client → local IG → (make scheduling decision) → local/remote EPP → selected model server endpoint → response.
76
+
-Parent Mode: Client → local IG → (make scheduling decision) → local EPP/remote parent → remote EPP → selected model server endpoint → response.
69
77
70
78
### InferencePoolImport Naming
71
79
72
-
The MCI controller will create an InferencePoolImport resource using the exported InferencePool namespace and name. A cluster name entry in `inferencepoolimport.statu.clusters[]` is added for each cluster that exports an InferencePool with the same ns/name.
80
+
The MCI controller will create an InferencePoolImport resource using the exported InferencePool namespace and name. A cluster name entry in
81
+
`inferencepoolimport.statu.clusters[]` is added for each cluster that exports an InferencePool with the same ns/name.
73
82
74
83
### InferencePool Selection
75
84
@@ -81,9 +90,9 @@ InferencePool selection is implementation-specific. The following are examples:
81
90
82
91
### API Changes
83
92
84
-
#### Export Annotations
93
+
#### Export Annotation
85
94
86
-
The following annotations are being proposed to indicate the desire to export the InferencePool to clusters of a ClusterSet.
95
+
The following annotation is being proposed to indicate the desire to export the InferencePool to member clusters of a ClusterSet.
87
96
88
97
The `inference.networking.x-k8s.io/export` annotation key indicates a desire to export the InferencePool:
89
98
@@ -95,25 +104,15 @@ Supported Values:
95
104
96
105
- `ClusterSet` – export to all members of the current ClusterSet.
97
106
98
-
The `inference.networking.x-k8s.io/export-mode` annotation key indicates the routing mode that importing IGs should use for the exported InferencePool:
The `inference.networking.x-k8s.io/export-mode` annotation requires the `inference.networking.x-k8s.io/export` annotation to be set.
105
-
106
-
Supported Values:
107
-
108
-
- `EndpointMode`– Export InferencePool information required for importing clusters to connect to EPP endpoints.
109
-
- `ParentMode`– Export InferencePool information required for importing clusters to connect to EPP endpoints through parent(s) of the InferencePool.
110
-
111
-
**Note:** Additional annotations, e.g. region/domain scoping, filter clusters in the ClusterSet, etc. and potentially adding an InferencePoolExport resource may be considered in the future.
107
+
**Note:** Additional annotations, e.g. region/domain scoping, filter clusters in the ClusterSet, routing mode configuration, etc. and
108
+
potentially adding an InferencePoolExport resource may be considered in the future.
112
109
113
110
#### InferencePoolImport
114
111
115
-
A cluster-local, controller-managed resource that represents an imported InferencePool. It primarily communicates the EPP or parents of the exported InferencePool(s) to the importing IG controller. It is not user-authored; status carries the effective import.
116
-
Inference Platform Owners can reference the InferencePoolImport, even if the local cluster does not have an InferencePool. In the context of Gateway API, it means that an HTTPRoute can be configured to reference an InferencePoolImport to route matching requests to remote InferencePool endpoints.
112
+
A cluster-local, controller-managed resource that represents an imported InferencePool. It primarily communicates the EPP or parents of the
113
+
exported InferencePool(s) to the importing IG controller. It is not user-authored; status carries the effective import. Inference Platform
114
+
Owners can reference the InferencePoolImport, even if the local cluster does not have an InferencePool. In the context of Gateway API, it
115
+
means that an HTTPRoute can be configured to reference an InferencePoolImport to route matching requests to remote InferencePool endpoints.
117
116
This API will be used almost exclusively for tracking endpoints, but unlike MCS, we actually have two distinct sets of endpoints to track:
118
117
119
118
1. Endpoint Picker Extensions (EPPs)
@@ -122,7 +121,7 @@ This API will be used almost exclusively for tracking endpoints, but unlike MCS,
122
121
Key ideas:
123
122
124
123
- Name/namespace sameness with the exported InferencePool (avoids extra indirection).
125
-
- Routing mode: whether the IG should connect to remote Endpoints or Parents.
124
+
- Routing mode: whether the IG should route directly to remote EPP-selected endpoints or through parents of the exported InferencePool.
126
125
- EPP details: network coordinates and optional health/metrics hints.
127
126
- Conditions: Accepted, Ready, etc.
128
127
@@ -142,18 +141,18 @@ To reduce controller fan-out, InferencePool status should be updated to surface:
142
141
143
142
**MCI Controller (Per ClusterSet):**
144
143
145
-
- Discover exported pools.
146
-
- For each target cluster, CRUD InferencePoolImport (mirrored namespace/name).
147
-
- Populate `status.clusters[]` entries with:
148
-
- EPP service endpoints/ports (and optional health/metrics),
149
-
- Optional remote parents (Gateway Services) if `RoutingModeMode=ParentMode`.
144
+
- Discover exported InferencePools.
145
+
- For each ClusterSet member cluster, CRUD InferencePoolImport (mirrored namespace/name).
- Finalize annotation key/values and ClusterSet discovery. Should we introduce an annotation to filter exporting to specific clusters in the ClusterSet?
561
-
- Should we allow per-cluster export configuration (weights, region, SLO tags) at export time?
562
-
563
-
#### InferencePool status surface
547
+
#### InferencePool Status
564
548
565
-
- Do we extend InferencePool status to publish EPP Service type/addresses/ports, health, metrics, and optional secret refs to simplify imports?
566
549
- Should EPP Deployment/Pod discovery be standardized (labels/port names) for health/metrics auto-discovery?
567
550
568
551
#### Security
569
552
570
553
- Provide a standard way to bootstrap mTLS between importing IG and exported EPP/parents, e.g. use BackendTLSPolicy?
571
-
- Should the MCI controller mirror secrets into the importing cluster (and how to scope/rotate them)?
572
-
- Do we need a ReferenceGrant-like mechanism for cross-namespace secrets referenced by InferencePoolImport?
554
+
- Should the MCI controller mirror secrets into the importing cluster, e.g. secure metric scraping (Endpoint Mode)?
573
555
574
556
#### Scheduling and Policy
575
557
576
-
- Do we define a common `RoutingMode` enum (`EndpointMode` vs `ParentMode`) as above? Is a mixed strategy allowed per InferencePoolImport?
577
558
- Should we define a standard cluster preference knob (e.g., PreferLocal, Any, region-affinity, weights) on InferencePoolImport status or IG-local policy CRD?
578
559
579
-
#### Topology and Reachability
580
-
581
-
- For `ParentMode`, do we require “HTTPRoute sameness” or any guarantees between exporting/importing clusters?
582
-
583
560
#### EPP Scale
584
561
585
562
- If the EPP has multiple replicas, should the MCI controller publish per-replica addresses, e.g. service subsetting, for health/metrics scraping?
586
563
587
-
#### Observability
588
-
589
-
- Refine InferencePool status conditions (e.g., EPPReady, ParentsReady, ResolvedRefs)?
590
-
- Should we reconsider using an export resource instead of an InferencePool annotation for UX purposes- specifically surfacing status conditions such as
591
-
not being able to export an InferencePool b/c the namespace of the exported InferencePool does not exist in importing clusters.
592
-
593
564
#### Ownership and Lifecycle
594
565
595
566
- Should the MCI controller be owned by the Gateway API Inference Extension project?
0 commit comments