You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -172,22 +254,39 @@ type InferenceModelSpec struct {
172
254
// If not specified, the target model name is defaulted to the ModelName parameter.
173
255
// ModelName is often in reference to a LoRA adapter.
174
256
TargetModels []TargetModel
175
-
// Reference to the InferencePool that the model registers to. It must exist in the same namespace.
176
-
PoolReference *LocalObjectReference
257
+
// PoolRef is a reference to the inference pool, the pool must exist in the same namespace.
258
+
PoolRefPoolObjectReference
259
+
}
260
+
261
+
// PoolObjectReference identifies an API object within the namespace of the
262
+
// referrer.
263
+
typePoolObjectReferencestruct {
264
+
// Group is the group of the referent.
265
+
GroupGroup
266
+
267
+
// Kind is kind of the referent. For example "InferencePool".
268
+
KindKind
269
+
270
+
// Name is the name of the referent.
271
+
NameObjectName
177
272
}
178
273
179
274
// Defines how important it is to serve the model compared to other models.
180
275
// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field should ALWAYS be optional(use a pointer), and set no default.
181
276
// This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
182
277
typeCriticalitystring
183
278
const (
184
-
// Most important. Requests to this band will be shed last.
185
-
CriticalCriticality = "Critical"
186
-
// More important than Sheddable, less important than Critical.
187
-
// Requests in this band will be shed before critical traffic.
188
-
DefaultCriticality = "Default"
189
-
// Least important. Requests to this band will be shed before all other bands.
190
-
SheddableCriticality = "Sheddable"
279
+
// Critical defines the highest level of criticality. Requests to this band will be shed last.
280
+
CriticalCriticality = "Critical"
281
+
282
+
// Standard defines the base criticality level and is more important than Sheddable but less
283
+
// important than Critical. Requests in this band will be shed before critical traffic.
284
+
// Most models are expected to fall within this band.
285
+
StandardCriticality = "Standard"
286
+
287
+
// Sheddable defines the lowest level of criticality. Requests to this band will be shed before
288
+
// all other bands.
289
+
SheddableCriticality = "Sheddable"
191
290
)
192
291
193
292
// TargetModel represents a deployed model or a LoRA adapter. The
@@ -200,64 +299,62 @@ const (
200
299
typeTargetModelstruct {
201
300
// The name of the adapter as expected by the ModelServer.
202
301
Namestring
203
-
// Weight is used to determine the percentage of traffic that should be
302
+
// Weight is used to determine the percentage of traffic that should be
204
303
// sent to this target model when multiple versions of the model are specified.
205
-
Weight *int
304
+
Weight *int32
206
305
}
207
306
208
-
// LocalObjectReference identifies an API object within the namespace of the
209
-
// referrer.
210
-
typeLocalObjectReferencestruct {
211
-
// Group is the group of the referent.
212
-
GroupGroup
213
-
214
-
// Kind is kind of the referent. For example "InferencePool".
215
-
KindKind
216
-
217
-
// Name is the name of the referent.
218
-
NameObjectName
307
+
// InferenceModelStatus defines the observed state of InferenceModel
308
+
typeInferenceModelStatusstruct {
309
+
// Conditions track the state of the InferenceModel.
310
+
Conditions []metav1.Condition
219
311
}
220
-
221
312
```
222
313
223
314
### Yaml Examples
224
315
225
316
#### InferencePool(s)
226
317
Here we create a pool that selects the appropriate pods
227
318
```yaml
228
-
apiVersion: inference.x-k8s.io/v1alpha1
319
+
apiVersion: inference.x-k8s.io/v1alpha2
229
320
kind: InferencePool
230
321
metadata:
231
322
name: base-model-pool
232
-
modelServerSelector:
233
-
- app: llm-server
323
+
spec:
324
+
selector:
325
+
app: llm-server
326
+
targetNumber: 8080
327
+
extensionRef:
328
+
name: infra-backend-v1-app
234
329
```
235
330
236
331
#### InferenceModel
237
332
238
333
Here we consume the pool with two InferenceModels. Where `sql-code-assist` is both the name of the model and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified criticality. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on the InferencePool and routing to each InferencePool happens earlier (at the K8s Gateway).
0 commit comments