You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some clarifications were made around traffic splitting and model redirects. Right now, the design for this isn't clear, and therefore isn't supported. Updating our design doc to remove this requirement.
Also added a section regarding setting status.
Copy file name to clipboardExpand all lines: docs/proposals/gateway-inference-extension.md
+3-77Lines changed: 3 additions & 77 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,81 +57,6 @@ The Go application could be built into the existing `nginx-gateway` binary, and
57
57
58
58
See the [Alternatives section](#alternatives) for a future improvement to this workflow.
59
59
60
-
### Model Name extraction
61
-
62
-
When a client sends a request to an AI workload, the desired model name (e.g. gpt-4o, llama, etc.) is included in the request body.
63
-
64
-
By default, the EPP gets the model name from the request body, and then picks the proper endpoint for that model name. However, the model name could also be provided via header (`X-Gateway-Model-Name`). For example, a user could specify a desire for a traffic split or model name redirect, and therefore NGINX would need to change the model name by setting the header.
65
-
66
-
Example that redirects requests to model name `food-review` to `food-review-v1`:
67
-
68
-
```yaml
69
-
kind: HTTPRoute
70
-
apiVersion: gateway.networking.k8s.io/v1
71
-
metadata:
72
-
name: my-route
73
-
spec:
74
-
parentRefs:
75
-
- name: my-inference-gateway
76
-
rules:
77
-
- matches:
78
-
- headers:
79
-
- type: Exact
80
-
name: X-Gateway-Model-Name
81
-
value: food-review
82
-
backendRefs:
83
-
- name: vllm-llama3-8b-instruct
84
-
kind: InferencePool
85
-
group: inference.networking.x-k8s.io
86
-
- filters:
87
-
- type: RequestHeaderModifier
88
-
requestHeaderModifier:
89
-
set:
90
-
- name: X-Gateway-Model-Name
91
-
value: food-review-v1
92
-
```
93
-
94
-
Example with traffic splitting:
95
-
96
-
```yaml
97
-
kind: HTTPRoute
98
-
apiVersion: gateway.networking.k8s.io/v1
99
-
metadata:
100
-
name: my-route
101
-
spec:
102
-
parentRefs:
103
-
- name: my-inference-gateway
104
-
rules:
105
-
- matches:
106
-
- headers:
107
-
- type: Exact
108
-
name: X-Gateway-Model-Name
109
-
value: food-review
110
-
backendRefs:
111
-
- name: vllm-llama3-8b-instruct
112
-
kind: InferencePool
113
-
group: inference.networking.x-k8s.io
114
-
weight: 90
115
-
- filters:
116
-
- type: RequestHeaderModifier
117
-
requestHeaderModifier:
118
-
set:
119
-
- name: X-Gateway-Model-Name
120
-
value: food-review-v1
121
-
- name: vllm-llama3-8b-instruct
122
-
kind: InferencePool
123
-
group: inference.networking.x-k8s.io
124
-
weight: 10
125
-
- filters:
126
-
- type: RequestHeaderModifier
127
-
requestHeaderModifier:
128
-
set:
129
-
- name: X-Gateway-Model-Name
130
-
value: food-review-v2
131
-
```
132
-
133
-
In both cases, NGINX would need to extract the model name from the request body. This will probably require an NJS module. If that model name matches the condition set in the Route, then NGINX sets the header appropriately when sending the request to the EPP. For the redirect example, NGINX would set the header to `food-review-v1`. For the traffic splitting example, NGINX would set the header to either `food-review-v1` or `food-review-v2` depending on the weighted traffic decision.
134
-
135
60
### Managing InferencePools
136
61
137
62
By default, the EPP should know which endpoints are a part of an InferencePool, and then pick the correct endpoint to send to. This means that NGINX does not need to have an upstream for the AI workload servers, since it just gets the endpoint it needs to send to from the EPP.
@@ -140,7 +65,9 @@ However, there could still be a valid use case for NGF to track and configure NG
140
65
141
66
Because of this, NGF should watch the endpoints associated with an InferencePool, and create an upstream. One way to accomplish this is for NGF to create a Headless "shadow" Service that encompasses those endpoints. By defining this Service, NGF can use all of its existing Service/EndpointSlice logic to build the upstreams as if it was a normal Service.
142
67
143
-
**The main point of concern with this is how can we fallback to use the upstream servers if the EPP is unavailable to give us an endpoint?** This may have to be discovered during implementation.
68
+
#### Status
69
+
70
+
Status conditions also need to be set on the InferencePool resources, per the API spec requirements and recommendations.
144
71
145
72
### Flow Diagram
146
73
@@ -233,4 +160,3 @@ If this Inference Extension feature gains traction and usage, it could be worth
0 commit comments