diff --git a/site-src/guides/multi-pool-setup.md b/site-src/guides/multi-pool-setup.md new file mode 100644 index 000000000..c68f4afa1 --- /dev/null +++ b/site-src/guides/multi-pool-setup.md @@ -0,0 +1,92 @@ +# Metrics & Observability + +This guide describes how to setup multiple InferencePools within a gateway, it assumes familiarity with the concepts and resources covered in the [Getting Started](index.md) guide. + +## **Prerequisites** + +- A fork of: https://github.com/kubernetes-sigs/gateway-api-inference-extension + +## **Steps** + +In this guide we will create 2 Inference Pools, but this applies to any number. + +1. Create 2 model server deployments with _distinct_ labels. If aquiring adequate accelerators is a blocker, consider the lightweight [sim server](https://github.com/llm-d/llm-d-inference-sim). + +2. Using helm, we will install the chart *twice*: + +**NOTE:** To confirm requests are sent to each pool, it is reccomended to use the [values](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/charts/inferencepool/values.yaml) file, and set the log verbosity flag to 3. + + ```bash + helm install \ + --set inferencePool.modelServers.matchLabels.= \ + --set provider.name=$GATEWAY_PROVIDER \ + --version $IGW_CHART_VERSION \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool + ``` + + ```bash + helm install \ + --set inferencePool.modelServers.matchLabels.= \ + --set provider.name=$GATEWAY_PROVIDER \ + --version $IGW_CHART_VERSION \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool + ``` + +3. Using HTTPRoute, we will create 2 match rules each corresponding to an inference pool. + +**NOTE:** You could create an httpRoute per Inference Pool. In this example, a single HTTPRoute was used. This is not complete yaml, but an example of how 2 pools can be routed to. In this example, the pools use the same path, but for pool 2, require that a header of `pool-name: ` be present. + +This allows catch-all behavior for pool 1. + + ```yaml + apiVersion: gateway.networking.k8s.io/v1 + kind: HTTPRoute + + ... + + rules: + - backendRefs: + - group: inference.networking.k8s.io + kind: InferencePool + name: + matches: + - path: + type: PathPrefix + value: / + - backendRefs: + - group: inference.networking.k8s.io + kind: InferencePool + name: + matches: + - path: + type: PathPrefix + value: / + headers: + - type: Exact + name: pool-name + value: + ``` + +4. Run requests to each pool + +Each inferencePool should be able to serve traffic, to validate we can run the following curl commands: + +```bash +curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ +"model": "meta-llama/Llama-3.1-8B-Instruct", +"prompt": "Write as if you were a critic: San Francisco", +"max_tokens": 100, +"temperature": 0 +}' +``` + +```bash +curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -H 'pool-name: ' -d '{ +"model": "meta-llama/Llama-3.1-8B-Instruct", +"prompt": "Write as if you were a critic: San Francisco", +"max_tokens": 100, +"temperature": 0 +}' +``` + +Each of these requests should have gone to a different pool, and this can be validated by checking the logs of each EPP that was deployed. \ No newline at end of file