Revisiting the InferenceModel API and replacing it with InferenceObjectives

The `InferencePool` API is fairly lightweight right now, and what it conceptually represents seems to be well understood: “it is a special type of Service for inference deployments” . 

The `InferenceModel` API however is less so, this issue proposes to revise the semantics and name of the InferenceModel API for the following reasons:
- The API name gives the impression that it represents and manages the deployment of a model/adapter, which is not true.
- The API specs the model name as the only request matching rule. This means only one InferenceModel object can apply to the requests targeting a given model within a single `InferencePool`, this limits the ability to define different policies for different users or applications targeting the same model.
- Defining an `InferenceModel` for each adapter served by the pool can sometimes be seen as an operational overhead. Consider the case where the pool is serving hundreds of adapters.

The proposal is to introduce a new API named  `InferenceSchedulingObjective` that is focused on defining scheduling policies for a matching request flow, concretely: 
- Drop the InferenceModel API, but no changes to the InferencePool API
Create a new API named `InferenceSchedulingObjective`. The API will be limited to defining endpoint picking scheduling policies (aka serving objectives) for matching requests. The expectation is that the inference-scheduler (run by the EPP) will be the main controller actuating on this API.
- Extend request matching beyond model name to include headers. This allows defining different scheduling policies for different request flows (apps or users) while targeting the same model.
- Allow defining a default `InferenceSchedulingObjective` per `InferencePool` as a fall back policy when no one matches the request.
- Traffic splitting is not part of the InferenceSchedulingObjective API. Traffic splitting is not an endpoint scheduling objective, it is a request routing objective. As we describe below, with some creativity, we can offload traffic splitting to  `HTTPRoute`. An intended side effect of this is that users will be able to define different scheduling policies for different target models, something they can’t do with the current API.


Please check the following doc for a more detailed discussion of the proposal: https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revisiting the InferenceModel API and replacing it with InferenceObjectives #892

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revisiting the InferenceModel API and replacing it with InferenceObjectives #892

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions