Skip to content

Commit c826589

Browse files
committed
Add diagram
1 parent dbb55d5 commit c826589

File tree

1 file changed

+26
-1
lines changed

1 file changed

+26
-1
lines changed

docs/proposals/gateway-inference-extension.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ The terms AI workload, LLM workload, and model workload are used interchangeably
3030

3131
The intention of the Gateway API Inference Extension is to optimize load-balancing for self-hosted GenAI models in Kubernetes. These workloads can serve massive amounts of data, and their performance is influenced greatly by the infrastructure that they are running on (e.g. GPUs). Routing client requests to these types of backends requires specialized decision making to ensure the best performance and responses.
3232

33-
In order to make this routing decision, a component known as the Endpoint Picker (EPP) is deployed. The EPP uses configuration and metrics to determine which AI workload in an InferencePool should receive the request. It returns this endpoint in a header to the `inference Gateway` (which would be NGINX in our case) to forward the request to that endpoint. The model name that the request should be routed to can be contained in the body of the request, or in a header.
33+
To make these routing decisions, a component known as the [Endpoint Picker (EPP)](https://gateway-api-inference-extension.sigs.k8s.io/#endpoint-picker) is deployed. The EPP uses configuration and metrics to determine which AI workload in an InferencePool should receive the request. It returns this endpoint in a header to the `inference Gateway` (which would be NGINX in our case) to forward the request to that endpoint. The model name that the request should be routed to can be contained in the body of the request, or in a header.
34+
35+
Check out the [request flow](https://gateway-api-inference-extension.sigs.k8s.io/#request-flow) section of the Gateway API documentation to learn more.
3436

3537
## Use Cases
3638

@@ -140,6 +142,29 @@ Because of this, NGF should watch the endpoints associated with an InferencePool
140142

141143
**The main point of concern with this is how can we fallback to use the upstream servers if the EPP is unavailable to give us an endpoint?** This may have to be discovered during implementation.
142144

145+
### Flow Diagram
146+
147+
```mermaid
148+
flowchart TD
149+
A[Client Request] --> B[NGINX]
150+
subgraph NGINX Pod
151+
subgraph NGINX Container
152+
B --1--> C[NJS Module: extract model name if needed]
153+
C --2--> B
154+
B --3--> D[NJS Module: Subrequest to Go App]
155+
end
156+
subgraph Go Application Container
157+
E[Go Application]
158+
end
159+
D -- 4. subrequest --> E
160+
end
161+
E -- 5. gRPC ext_proc protocol --> F[Endpoint Picker Pod]
162+
F -- 6. Endpoint in Header --> E
163+
E --7--> D
164+
D --8--> B
165+
B --9--> G[AI Workload Endpoint]
166+
```
167+
143168
## API, Customer Driven Interfaces, and User Experience
144169

145170
The infrastructure provider or cluster operator would first need to install the Gateway API Inference Extension CRDs, similar to how they install the Gateway API CRDs today. Two new CRDs are introduced, the `InferencePool` and `InferenceObjective`.

0 commit comments

Comments
 (0)