You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To enable SLO-aware routing, you must enable the latency predictor, which is deployed as a set of sidecar containers alongside the Endpoint Picker. When the latency predictor is enabled, the `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
127
-
128
-
You can enable the latency predictor by setting `inferenceExtension.latencyPredictor.enabled` to `true` in your `values.yaml` file, or by using the `--set` flag on the command line.
129
-
130
-
Here is an example of how to install the chart with SLO-aware routing enabled:
Copy file name to clipboardExpand all lines: site-src/guides/slo-aware-routing.md
+41-5Lines changed: 41 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,9 +22,9 @@ The SLO-aware routing feature is implemented as a plugin for the Endpoint Picker
22
22
23
23
To use SLO-aware routing, you need to include the following headers in your inference requests:
24
24
25
-
-`x-prediction-based-scheduling`: Set to `true` to enable SLO-aware routing for the request.
25
+
-`x-prediction-based-scheduling`: Set to `true` to enable SLO-aware routing for the request, setting this to false or omiting the header will use non-SLO routing, but will still use the latency data to train the predictor.
26
26
-`x-slo-ttft-ms`: The Time to First Token SLO in milliseconds.
27
-
-`x-slo-tpot-ms`: The Time Per Output Token SLO in milliseconds.
27
+
-`x-slo-tpot-ms`: The Time Per Output Token SLO in milliseconds (this is vLLMs equivalent of ITL, is it **not** NTPOT).
28
28
29
29
## Headroom Selection Strategies
30
30
@@ -42,13 +42,49 @@ The selection strategy can be configured via the `HEADROOM_SELECTION_STRATEGY` e
42
42
43
43
### Prerequisites
44
44
45
-
Before you begin, ensure you have a functional Inference Gateway with at least one model server deployed. If you haven't set this up yet, please follow the [Getting started guide](../index.md).
45
+
Before you begin, ensure you have a functional Inference Gateway with at least one model server deployed. If you haven't set this up yet, please follow the [Getting Started Guide](./getting-started-latest.md).
46
46
47
47
### Deployment
48
48
49
-
To use SLO-aware routing, you must deploy the Endpoint Picker with the latency predictor sidecars. This can be done via the Helm chart by setting the `inferenceExtension.latencyPredictor.enabled` flag to `true`. When this flag is set, the necessary`slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
49
+
To enable SLO-aware routing, you must enable the latency predictor in the chart and have built the images for the training/prediction sidecars, which are then deployed as containers alongside the Endpoint Picker. When the latency predictor is enabled, the `slo-aware-routing` and `slo-aware-profile-handler` plugins are automatically configured.
50
50
51
-
For specific deployment instructions and details on configuring environment variables for SLO-aware routing, refer to the [InferencePool Helm Chart README](../../config/charts/inferencepool/README.md#slo-aware-router-environment-variables).
51
+
#### Steps:
52
+
53
+
1. Build the predictor and sidecar images from inside the `latencypredictor` package. See the [Latency Predictor - Build Guide](../../../latencypredictor/README.md) for instructions.
54
+
55
+
2. Set your Docker repository path by replacing the placeholders in Helm chart [values.yaml](../../../config/charts/inferencepool/values.yaml) in the format `us-docker.pkg.dev/PROJECT_ID/REPOSITORY` based on what you used to build the sidecars in the Build Guide from step 1.
56
+
57
+
3. Deploy the chart with the latency predictor enabled by setting `inferenceExtension.latencyPredictor.enabled` to `true` in your `values.yaml` file, or by using the `--set` flag on the command line:
After these steps, Inference Gateway will be prepared to predict, train, and route requests based on their SLOs.
69
+
70
+
For details on configuring specific environment variables for SLO-aware routing, refer to the [InferencePool Helm Chart README](../../config/charts/inferencepool/README.md#slo-aware-router-environment-variables).
71
+
72
+
### Sending Requests
73
+
74
+
To send a request with SLO-Aware Routing, you will need to specify the request SLOs and whether to route or not in the request header. See [Request Headers](#request-headers) section above.
75
+
76
+
If you have a standard setup via using the [Getting Started Guide](./getting-started-latest.md) and then followed the steps outlined above, below is an example inference request with SLOs specified and routing enabled:
77
+
78
+
```txt
79
+
export GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'):80
0 commit comments