|
1 | 1 | ## Quickstart |
2 | 2 |
|
3 | | -This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running! |
4 | | - |
5 | | -### Requirements |
6 | | - - Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher |
7 | | - - A cluster with: |
8 | | - - Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind, |
9 | | - you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer). |
10 | | - - 3 GPUs to run the sample model server. Adjust the number of replicas in `./manifests/vllm/deployment.yaml` as needed. |
11 | | - |
12 | | -### Steps |
13 | | - |
14 | | -1. **Deploy Sample Model Server** |
15 | | - |
16 | | - Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model. |
17 | | - Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway. |
18 | | - ```bash |
19 | | - kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2 |
20 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/vllm/deployment.yaml |
21 | | - ``` |
22 | | - |
23 | | -1. **Install the Inference Extension CRDs:** |
24 | | - |
25 | | - ```sh |
26 | | - kubectl apply -k https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd |
27 | | - ``` |
28 | | - |
29 | | -1. **Deploy InferenceModel** |
30 | | - |
31 | | - Deploy the sample InferenceModel which is configured to load balance traffic between the `tweet-summary-0` and `tweet-summary-1` |
32 | | - [LoRA adapters](https://docs.vllm.ai/en/latest/features/lora.html) of the sample model server. |
33 | | - ```bash |
34 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/inferencemodel.yaml |
35 | | - ``` |
36 | | - |
37 | | -1. **Update Envoy Gateway Config to enable Patch Policy** |
38 | | - |
39 | | - Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run: |
40 | | - ```bash |
41 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/enable_patch_policy.yaml |
42 | | - kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system |
43 | | - ``` |
44 | | - Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again. |
45 | | - |
46 | | -1. **Deploy Gateway** |
47 | | - |
48 | | - ```bash |
49 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/gateway.yaml |
50 | | - ``` |
51 | | - > **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./manifests/gateway/ext-proc.yaml` file, and an additional `./manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.*** |
52 | | -
|
53 | | - Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: |
54 | | - ```bash |
55 | | - $ kubectl get gateway inference-gateway |
56 | | - NAME CLASS ADDRESS PROGRAMMED AGE |
57 | | - inference-gateway inference-gateway <MY_ADDRESS> True 22s |
58 | | - ``` |
59 | | - |
60 | | -1. **Deploy the Inference Extension and InferencePool** |
61 | | - |
62 | | - ```bash |
63 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/ext_proc.yaml |
64 | | - ``` |
65 | | - |
66 | | -1. **Deploy Envoy Gateway Custom Policies** |
67 | | - |
68 | | - ```bash |
69 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/extension_policy.yaml |
70 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/patch_policy.yaml |
71 | | - ``` |
72 | | - > **_NOTE:_** This is also per InferencePool, and will need to be configured to support the new pool should you wish to experiment further. |
73 | | -
|
74 | | -1. **OPTIONALLY**: Apply Traffic Policy |
75 | | - |
76 | | - For high-traffic benchmarking you can apply this manifest to avoid any defaults that can cause timeouts/errors. |
77 | | - |
78 | | - ```bash |
79 | | - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/traffic_policy.yaml |
80 | | - ``` |
81 | | - |
82 | | -1. **Try it out** |
83 | | - |
84 | | - Wait until the gateway is ready. |
85 | | - |
86 | | - ```bash |
87 | | - IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') |
88 | | - PORT=8081 |
89 | | - |
90 | | - curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ |
91 | | - "model": "tweet-summary", |
92 | | - "prompt": "Write as if you were a critic: San Francisco", |
93 | | - "max_tokens": 100, |
94 | | - "temperature": 0 |
95 | | - }' |
96 | | - ``` |
| 3 | +Please refer to our Getting started guide here: https://gateway-api-inference-extension.sigs.k8s.io/guides/ |
0 commit comments