Skip to content

Commit 1698146

Browse files
authored
docs: use inference gateway terminology (#891)
* use inference gateway terminology * fixed comments * fixed comments
1 parent 33cda4b commit 1698146

File tree

5 files changed

+62
-23
lines changed

5 files changed

+62
-23
lines changed

site-src/concepts/api-overview.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,24 @@
11
# API Overview
22

33
## Background
4-
The Gateway API Inference Extension project is an extension of the Kubernetes Gateway API for serving Generative AI models on Kubernetes. Gateway API Inference Extension facilitates standardization of APIs for Kubernetes cluster operators and developers running generative AI inference, while allowing flexibility for underlying gateway implementations (such as Envoy Proxy) to iterate on mechanisms for optimized serving of models.
4+
Gateway API Inference Extension optimizes self-hosting Generative AI Models on Kubernetes.
5+
It provides optimized load-balancing for self-hosted Generative AI Models on Kubernetes.
6+
The project’s goal is to improve and standardize routing to inference workloads across the ecosystem.
57

6-
<img src="/images/inference-overview.svg" alt="Overview of API integration" class="center" width="1000" />
8+
This is achieved by leveraging Envoy's [External Processing](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) to extend any gateway that supports both ext-proc and [Gateway API](https://github.com/kubernetes-sigs/gateway-api) into an [inference gateway](../index.md#concepts-and-definitions).
9+
This extension extends popular gateways like Envoy Gateway, kgateway, and GKE Gateway - to become [Inference Gateway](../index.md#concepts-and-definitions) -
10+
supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes.
11+
This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat)
12+
to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers
13+
in a higher level **AI Gateways** like [LiteLLM](https://www.litellm.ai/), [Gloo AI Gateway](https://www.solo.io/products/gloo-ai-gateway), or [Apigee](https://cloud.google.com/apigee).
714

815
## API Resources
916

17+
Gateway API Inference Extension introduces two inference-focused API resources with distinct responsibilities,
18+
each aligning with a specific user persona in the Generative AI serving workflow.
19+
20+
<img src="/images/inference-overview.svg" alt="Overview of API integration" class="center" width="1000" />
21+
1022
### InferencePool
1123

1224
InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferencePool, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool) or go directly to the [InferencePool spec](/reference/spec/#inferencepool).

site-src/guides/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Getting started with Gateway API Inference Extension
1+
# Getting started with an Inference Gateway
22

33
??? example "Experimental"
44

@@ -98,7 +98,7 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
9898
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencepool-resources.yaml
9999
```
100100

101-
### Deploy Inference Gateway
101+
### Deploy an Inference Gateway
102102

103103
Choose one of the following options to deploy an Inference Gateway.
104104

site-src/guides/serve-multiple-genai-models.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
33
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
44
The company needs to ensure optimal serving performance for these LLMs.
5-
Using Gateway API Inference Extension, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
5+
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
66
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
77

88
## How
9-
The following diagram illustrates how Gateway API Inference Extension routes requests to different models based on the model name.
10-
The model name is extarcted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
9+
The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.
10+
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
1111
from the request body to the header. The header is then matched to dispatch
1212
requests to different `InferencePool` (and their EPPs) instances.
1313
![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)

site-src/guides/serve-multiple-lora-adapters.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
# Serve LoRA adapters on a shared pool
22
A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish.
33
They have a fine-tuned LoRA adapter for each language, but need to efficiently use their GPU and TPU capacity.
4-
You can use Gateway API Inference Extension to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator.
4+
You can use an Inference Gateway to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator.
55
This lets you reduce the number of required accelerators by densely packing multiple models in a shared pool.
66

77
## How
8-
The following diagram illustrates how Gateway API Inference Extension serves multiple LoRA adapters on a shared pool.
8+
The following diagram illustrates how Inference Gateway serves multiple LoRA adapters on a shared pool.
99
![Serving LoRA adapters on a shared pool](../images/serve-LoRA-adapters.png)
1010
This example illustrates how you can densely serve multiple LoRA adapters with distinct workload performance objectives on a common InferencePool.
1111
```yaml

site-src/index.md

Lines changed: 41 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
# Introduction
22

3-
Gateway API Inference Extension is an official Kubernetes project focused on
4-
extending [Gateway API](https://gateway-api.sigs.k8s.io/) with inference
5-
specific routing extensions.
3+
Gateway API Inference Extension is an official Kubernetes project that optimizes self-hosting Generative Models on Kubernetes.
64

75
The overall resource model focuses on 2 new inference-focused
86
[personas](/concepts/roles-and-personas) and corresponding resources that
@@ -11,20 +9,49 @@ they are expected to manage:
119
<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
1210
<img src="/images/resource-model.png" alt="Gateway API Inference Extension Resource Model" class="center" width="550" />
1311

12+
## Concepts and Definitions
13+
14+
The following specific terms to this project:
15+
16+
- **Inference Gateway**: A proxy/load-balancer that has been coupled with the
17+
EndPointer Picker extension. It provides optimized routing and load balancing for
18+
serving Kubernetes self-hosted generative Artificial Intelligence (AI)
19+
workloads. It simplifies the deployment, management, and observability of AI
20+
inference workloads.
21+
- **Inference Scheduler**: An extendable component that makes decisions about which endpoint is optimal (best cost /
22+
best performance) for an inference request based on `Metrics and Capabilities`
23+
from [Model Serving](/docs/proposals/003-model-server-protocol/README.md).
24+
- **Metrics and Capabilities**: Data provided by model serving platforms about
25+
performance, availability and capabilities to optimize routing. Includes
26+
things like [Prefix Cache] status or [LoRA Adapters] availability.
27+
- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
28+
29+
[Inference Gateway]:#concepts-and-definitions
30+
1431
## Key Features
15-
Gateway API Inference Extension, along with a reference implementation in Envoy Proxy, provides the following key features:
32+
Gateway API Inference Extension optimizes self-hosting Generative AI Models on Kubernetes.
33+
It provides optimized load-balancing for self-hosted Generative AI Models on Kubernetes.
34+
The project’s goal is to improve and standardize routing to inference workloads across the ecosystem.
35+
36+
This is achieved by leveraging Envoy's [External Processing](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) to extend any gateway that supports both ext-proc and [Gateway API](https://github.com/kubernetes-sigs/gateway-api) into an [inference gateway](../index.md#concepts-and-definitions).
37+
This extension extends popular gateways like Envoy Gateway, kgateway, and GKE Gateway - to become [Inference Gateway](../index.md#concepts-and-definitions) -
38+
supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes.
39+
This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat)
40+
to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers
41+
in a higher level **AI Gateways** like [LiteLLM](https://www.litellm.ai/), [Gloo AI Gateway](https://www.solo.io/products/gloo-ai-gateway), or [Apigee](https://cloud.google.com/apigee).
1642

17-
- **Model-aware routing**: Instead of simply routing based on the path of the request, Gateway API Inference Extension allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models.
1843

19-
- **Serving priority**: Gateway API Inference Extension allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Criticality*](/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization.
44+
- **Model-aware routing**: Instead of simply routing based on the path of the request, an **[inference gateway]** allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models.
2045

21-
- **Model rollouts**: Gateway API Inference Extension allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names.
46+
- **Serving priority**: an **[inference gateway]** allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Criticality*](/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization.
2247

23-
- **Extensibility for Inference Services**: Gateway API Inference Extension defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs.
48+
- **Model rollouts**: an **[inference gateway]** allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names.
2449

50+
- **Extensibility for Inference Services**: an **[inference gateway]** defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs.
2551

26-
- **Customizable Load Balancing for Inference**: Gateway API Inference Extension defines a pattern for customizable load balancing and request routing that is optimized for Inference. Gateway API Inference Extension provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters.
52+
- **Customizable Load Balancing for Inference**: an **[inference gateway]** defines a pattern for customizable load balancing and request routing that is optimized for Inference. An **[inference gateway]** provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters.
2753

54+
By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads.
2855

2956
## API Resources
3057

@@ -42,7 +69,7 @@ that are relevant to this project:
4269
Gateway API has [more than 25
4370
implementations](https://gateway-api.sigs.k8s.io/implementations/). As this
4471
pattern stabilizes, we expect a wide set of these implementations to support
45-
this project.
72+
this project to become an **[inference gateway]**
4673

4774
### Endpoint Picker
4875

@@ -71,16 +98,16 @@ to any Gateway API users or implementers.
7198
2. If the request should be routed to an InferencePool, the Gateway will forward
7299
the request information to the endpoint selection extension for that pool.
73100

74-
3. The extension will fetch metrics from whichever portion of the InferencePool
101+
3. The inference gateway will fetch metrics from whichever portion of the InferencePool
75102
endpoints can best achieve the configured objectives. Note that this kind of
76-
metrics probing may happen asynchronously, depending on the extension.
103+
metrics probing may happen asynchronously, depending on the inference gateway.
77104

78-
4. The extension will instruct the Gateway which endpoint the request should be
105+
4. The inference gateway will instruct the Gateway which endpoint the request should be
79106
routed to.
80107

81108
5. The Gateway will route the request to the desired endpoint.
82109

83-
<img src="/images/request-flow.png" alt="Gateway API Inference Extension Request Flow" class="center" />
110+
<img src="/images/request-flow.png" alt="Inference Gateway Request Flow" class="center" />
84111

85112

86113
## Who is working on Gateway API Inference Extension?

0 commit comments

Comments
 (0)