Skip to content

Revisiting the InferenceModel API and replacing it with InferenceObjectives #892

@ahg-g

Description

@ahg-g

The InferencePool API is fairly lightweight right now, and what it conceptually represents seems to be well understood: “it is a special type of Service for inference deployments” .

The InferenceModel API however is less so, this issue proposes to revise the semantics and name of the InferenceModel API for the following reasons:

  • The API name gives the impression that it represents and manages the deployment of a model/adapter, which is not true.
  • The API specs the model name as the only request matching rule. This means only one InferenceModel object can apply to the requests targeting a given model within a single InferencePool, this limits the ability to define different policies for different users or applications targeting the same model.
  • Defining an InferenceModel for each adapter served by the pool can sometimes be seen as an operational overhead. Consider the case where the pool is serving hundreds of adapters.

The proposal is to introduce a new API named InferenceSchedulingObjective that is focused on defining scheduling policies for a matching request flow, concretely:

  • Drop the InferenceModel API, but no changes to the InferencePool API
    Create a new API named InferenceSchedulingObjective. The API will be limited to defining endpoint picking scheduling policies (aka serving objectives) for matching requests. The expectation is that the inference-scheduler (run by the EPP) will be the main controller actuating on this API.
  • Extend request matching beyond model name to include headers. This allows defining different scheduling policies for different request flows (apps or users) while targeting the same model.
  • Allow defining a default InferenceSchedulingObjective per InferencePool as a fall back policy when no one matches the request.
  • Traffic splitting is not part of the InferenceSchedulingObjective API. Traffic splitting is not an endpoint scheduling objective, it is a request routing objective. As we describe below, with some creativity, we can offload traffic splitting to HTTPRoute. An intended side effect of this is that users will be able to define different scheduling policies for different target models, something they can’t do with the current API.

Please check the following doc for a more detailed discussion of the proposal: https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0

Metadata

Metadata

Assignees

Labels

triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions