Skip to content

Commit ddee4c7

Browse files
Update guide, README, and values.yaml
1 parent 678f608 commit ddee4c7

File tree

3 files changed

+44
-21
lines changed

3 files changed

+44
-21
lines changed

config/charts/inferencepool/README.md

Lines changed: 1 addition & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -123,20 +123,7 @@ $ helm install triton-llama3-8b-instruct \
123123

124124
### Install with SLO-Aware Routing
125125

126-
To enable SLO-aware routing, you must enable the latency predictor, which is deployed as a set of sidecar containers alongside the Endpoint Picker. When the latency predictor is enabled, the `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
127-
128-
You can enable the latency predictor by setting `inferenceExtension.latencyPredictor.enabled` to `true` in your `values.yaml` file, or by using the `--set` flag on the command line.
129-
130-
Here is an example of how to install the chart with SLO-aware routing enabled:
131-
132-
```txt
133-
$ helm install vllm-llama3-8b-instruct . \
134-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
135-
--set inferenceExtension.monitoring.gke.enabled=true \
136-
--set inferenceExtension.latencyPredictor.enabled=true \
137-
--set provider.name=gke \
138-
-f values.yaml
139-
```
126+
For full details see the dedicated [SLO-Aware Routing Guide](../../../site-src/guides/slo-aware-routing.md)
140127

141128
#### SLO-Aware Router Environment Variables
142129

config/charts/inferencepool/values.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ inferenceExtension:
7979
trainingServer:
8080
image:
8181
hub: path/to/your/docker/repo # NOTE: Update with your Docker repository path for sidecars
82-
name: latencypredictor-v3-training-server
82+
name: latencypredictor-training-server
8383
tag: latest
8484
pullPolicy: Always
8585
port: 8000
@@ -120,7 +120,7 @@ inferenceExtension:
120120
startPort: 8001
121121
image:
122122
hub: path/to/your/docker/repo # NOTE: Update with your Docker repository path for sidecars
123-
name: latencypredictor-v3-prediction-server
123+
name: latencypredictor-prediction-server
124124
tag: latest
125125
pullPolicy: Always
126126
resources:

site-src/guides/slo-aware-routing.md

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@ The SLO-aware routing feature is implemented as a plugin for the Endpoint Picker
2222

2323
To use SLO-aware routing, you need to include the following headers in your inference requests:
2424

25-
- `x-prediction-based-scheduling`: Set to `true` to enable SLO-aware routing for the request.
25+
- `x-prediction-based-scheduling`: Set to `true` to enable SLO-aware routing for the request, setting this to false or omiting the header will use non-SLO routing, but will still use the latency data to train the predictor.
2626
- `x-slo-ttft-ms`: The Time to First Token SLO in milliseconds.
27-
- `x-slo-tpot-ms`: The Time Per Output Token SLO in milliseconds.
27+
- `x-slo-tpot-ms`: The Time Per Output Token SLO in milliseconds (this is vLLMs equivalent of ITL, is it **not** NTPOT).
2828

2929
## Headroom Selection Strategies
3030

@@ -42,13 +42,49 @@ The selection strategy can be configured via the `HEADROOM_SELECTION_STRATEGY` e
4242

4343
### Prerequisites
4444

45-
Before you begin, ensure you have a functional Inference Gateway with at least one model server deployed. If you haven't set this up yet, please follow the [Getting started guide](../index.md).
45+
Before you begin, ensure you have a functional Inference Gateway with at least one model server deployed. If you haven't set this up yet, please follow the [Getting Started Guide](./getting-started-latest.md).
4646

4747
### Deployment
4848

49-
To use SLO-aware routing, you must deploy the Endpoint Picker with the latency predictor sidecars. This can be done via the Helm chart by setting the `inferenceExtension.latencyPredictor.enabled` flag to `true`. When this flag is set, the necessary `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
49+
To enable SLO-aware routing, you must enable the latency predictor in the chart and have built the images for the training/prediction sidecars, which are then deployed as containers alongside the Endpoint Picker. When the latency predictor is enabled, the `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
5050

51-
For specific deployment instructions and details on configuring environment variables for SLO-aware routing, refer to the [InferencePool Helm Chart README](../../config/charts/inferencepool/README.md#slo-aware-router-environment-variables).
51+
#### Steps:
52+
53+
1. Build the predictor and sidecar images from inside the `latencypredictor` package. See the [Latency Predictor - Build Guide](../../../latencypredictor/README.md) for instructions.
54+
55+
2. Set your Docker repository path by replacing the placeholders in Helm chart [values.yaml](../../../config/charts/inferencepool/values.yaml) in the format `us-docker.pkg.dev/PROJECT_ID/REPOSITORY` based on what you used to build the sidecars in the Build Guide from step 1.
56+
57+
3. Deploy the chart with the latency predictor enabled by setting `inferenceExtension.latencyPredictor.enabled` to `true` in your `values.yaml` file, or by using the `--set` flag on the command line:
58+
59+
```txt
60+
helm install vllm-llama3-8b-instruct . \
61+
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
62+
--set inferenceExtension.monitoring.gke.enabled=true \
63+
--set inferenceExtension.latencyPredictor.enabled=true \
64+
--set provider.name=gke \
65+
-f values.yaml
66+
```
67+
68+
After these steps, Inference Gateway will be prepared to predict, train, and route requests based on their SLOs.
69+
70+
For details on configuring specific environment variables for SLO-aware routing, refer to the [InferencePool Helm Chart README](../../config/charts/inferencepool/README.md#slo-aware-router-environment-variables).
71+
72+
### Sending Requests
73+
74+
To send a request with SLO-Aware Routing, you will need to specify the request SLOs and whether to route or not in the request header. See [Request Headers](#request-headers) section above.
75+
76+
If you have a standard setup via using the [Getting Started Guide](./getting-started-latest.md) and then followed the steps outlined above, below is an example inference request with SLOs specified and routing enabled:
77+
78+
```txt
79+
export GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'):80
80+
81+
curl -v $GW_IP/v1/completions -H 'Content-Type: application/json' -H 'x-slo-ttft-ms: 100' -H 'x-slo-tpot-ms: 100' -H 'x-prediction-based-scheduling: true' -d '{
82+
"model": "meta-llama/Llama-3.1-8B-Instruct",
83+
"prompt": "Write as if you were a critic: San Francisco where the ",
84+
"max_tokens": 100,
85+
"temperature": 0, "stream_options": {"include_usage": "true"}, "stream" : "true"
86+
}'
87+
```
5288

5389
## Monitoring
5490

0 commit comments

Comments
 (0)