You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/proposals/gateway-inference-extension.md
+26-1Lines changed: 26 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,9 @@ The terms AI workload, LLM workload, and model workload are used interchangeably
30
30
31
31
The intention of the Gateway API Inference Extension is to optimize load-balancing for self-hosted GenAI models in Kubernetes. These workloads can serve massive amounts of data, and their performance is influenced greatly by the infrastructure that they are running on (e.g. GPUs). Routing client requests to these types of backends requires specialized decision making to ensure the best performance and responses.
32
32
33
-
In order to make this routing decision, a component known as the Endpoint Picker (EPP) is deployed. The EPP uses configuration and metrics to determine which AI workload in an InferencePool should receive the request. It returns this endpoint in a header to the `inference Gateway` (which would be NGINX in our case) to forward the request to that endpoint. The model name that the request should be routed to can be contained in the body of the request, or in a header.
33
+
To make these routing decisions, a component known as the [Endpoint Picker (EPP)](https://gateway-api-inference-extension.sigs.k8s.io/#endpoint-picker) is deployed. The EPP uses configuration and metrics to determine which AI workload in an InferencePool should receive the request. It returns this endpoint in a header to the `inference Gateway` (which would be NGINX in our case) to forward the request to that endpoint. The model name that the request should be routed to can be contained in the body of the request, or in a header.
34
+
35
+
Check out the [request flow](https://gateway-api-inference-extension.sigs.k8s.io/#request-flow) section of the Gateway API documentation to learn more.
34
36
35
37
## Use Cases
36
38
@@ -140,6 +142,29 @@ Because of this, NGF should watch the endpoints associated with an InferencePool
140
142
141
143
**The main point of concern with this is how can we fallback to use the upstream servers if the EPP is unavailable to give us an endpoint?** This may have to be discovered during implementation.
142
144
145
+
### Flow Diagram
146
+
147
+
```mermaid
148
+
flowchart TD
149
+
A[Client Request] --> B[NGINX]
150
+
subgraph NGINX Pod
151
+
subgraph NGINX Container
152
+
B --1--> C[NJS Module: extract model name if needed]
153
+
C --2--> B
154
+
B --3--> D[NJS Module: Subrequest to Go App]
155
+
end
156
+
subgraph Go Application Container
157
+
E[Go Application]
158
+
end
159
+
D -- 4. subrequest --> E
160
+
end
161
+
E -- 5. gRPC ext_proc protocol --> F[Endpoint Picker Pod]
162
+
F -- 6. Endpoint in Header --> E
163
+
E --7--> D
164
+
D --8--> B
165
+
B --9--> G[AI Workload Endpoint]
166
+
```
167
+
143
168
## API, Customer Driven Interfaces, and User Experience
144
169
145
170
The infrastructure provider or cluster operator would first need to install the Gateway API Inference Extension CRDs, similar to how they install the Gateway API CRDs today. Two new CRDs are introduced, the `InferencePool` and `InferenceObjective`.
0 commit comments