|
| 1 | +# Launching vLLM endpoints |
| 2 | + |
| 3 | +This folder contains a Kubernetes deployment example (`k8.yaml`) and guidance |
| 4 | +for launching vLLM endpoints that can serve LALMs. |
| 5 | + |
| 6 | + |
| 7 | +You can use any one of the two recommended approaches below: |
| 8 | +- Local: run a vLLM server on your workstation or VM (good for development). |
| 9 | +- Kubernetes: deploy the provided `k8.yaml` to a GPU-capable cluster. |
| 10 | + |
| 11 | + |
| 12 | +Keep these high-level notes in mind: |
| 13 | +- Do NOT commit real secrets (Hugging Face tokens) into source control. Use |
| 14 | + Kubernetes Secrets or environment variables stored securely. |
| 15 | +- The `k8.yaml` file uses a placeholder image (`<YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>`). |
| 16 | + Replace that with an image that has required audio dependencies (ffmpeg, soundfile, librosa, torchaudio, |
| 17 | + any model-specific libs) before applying. |
| 18 | +- The example exposes ports 8000..8007. If you only need a single instance, |
| 19 | + reducing the number of containers/ports in the Pod is fine. |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +**Useful links** |
| 24 | + |
| 25 | +- vLLM docs (overview & quickstart): https://docs.vllm.ai/en/latest/getting_started/quickstart/ |
| 26 | +- vLLM CLI `serve` docs: https://docs.vllm.ai/en/latest/cli/serve/ |
| 27 | +- vLLM Kubernetes / deployment docs: https://docs.vllm.ai/en/latest/deployment/k8s/ |
| 28 | +- vLLM audio / multimodal docs and examples: |
| 29 | + - Audio assets API: https://docs.vllm.ai/en/latest/api/vllm/assets/audio/ |
| 30 | + - Audio example (offline / language + audio): https://docs.vllm.ai/en/latest/examples/offline_inference/audio_language/ |
| 31 | + |
| 32 | +These audio-specific links describe how vLLM handles audio assets, required |
| 33 | +dependencies and example code for audio-language workflows. |
| 34 | + |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | +## **A. Local (development)** |
| 41 | + |
| 42 | +1) Prerequisites |
| 43 | + |
| 44 | +- GPU node or a machine with a compatible PyTorch/CUDA setup (or CPU only for small models). |
| 45 | +- Python 3.10+ and a virtual environment is recommended. |
| 46 | +- A Hugging Face token with access to the model, set in `HUGGING_FACE_HUB_TOKEN`. |
| 47 | + |
| 48 | +2) Install vLLM (recommended minimal steps) |
| 49 | + |
| 50 | +```bash |
| 51 | +# create & activate a venv (example using uv as in vLLM docs, or use python -m venv) |
| 52 | +python -m venv .venv |
| 53 | +source .venv/bin/activate |
| 54 | +pip install --upgrade pip |
| 55 | +# install vllm and choose a torch backend if needed |
| 56 | +pip install vllm --upgrade |
| 57 | + |
| 58 | +# macOS (Homebrew): |
| 59 | +brew install ffmpeg libsndfile |
| 60 | +pip install soundfile librosa torchaudio |
| 61 | + |
| 62 | +# Ubuntu/Debian: |
| 63 | +sudo apt-get update && sudo apt-get install -y ffmpeg libsndfile1 |
| 64 | +pip install soundfile librosa torchaudio |
| 65 | +``` |
| 66 | + |
| 67 | +3) Start the server |
| 68 | + |
| 69 | +The vLLM CLI provides a `serve` entrypoint that starts an OpenAI-compatible HTTP |
| 70 | +server. Example: |
| 71 | + |
| 72 | +```bash |
| 73 | +# serve a HF model on localhost:8000 |
| 74 | +export HUGGING_FACE_HUB_TOKEN="<YOUR_HF_TOKEN>" |
| 75 | +vllm serve microsoft/Phi-4-multimodal-instruct --port 8000 --host 0.0.0.0 |
| 76 | +``` |
| 77 | + |
| 78 | +Notes: |
| 79 | +- Use `--api-key` or set `VLLM_API_KEY` if you want the server to require an API key. |
| 80 | +- Many LALMs need additional Python packages or system |
| 81 | + libraries. Commonly required packages: `soundfile`, `librosa`, `torchaudio`, |
| 82 | + and system `ffmpeg`/`libsndfile`. The exact requirements depend on the model |
| 83 | + and any tokenizer/preprocessor it uses. Check the model's Hugging Face page |
| 84 | + and the vLLM audio docs linked above. |
| 85 | +- If you plan to use GPU acceleration, ensure a compatible PyTorch/CUDA |
| 86 | + combination is installed in the environment (or use vLLM Docker images with |
| 87 | + prebuilt CUDA support). If you run into missing symbols, check CUDA/PyTorch |
| 88 | + compatibility and rebuild or pick a different image. |
| 89 | + |
| 90 | +4) Point `run_configs` to the local endpoint |
| 91 | + |
| 92 | +Update your run config to use the local server URL (example YAML snippet): |
| 93 | + |
| 94 | +```yaml |
| 95 | +# example run_configs entry |
| 96 | +# For OpenAI-compatible API calls use endpoints like /v1/completions or /v1/chat/completions |
| 97 | +url: "http://localhost:8000/v1/completions" |
| 98 | +``` |
| 99 | +
|
| 100 | +
|
| 101 | +
|
| 102 | +
|
| 103 | +
|
| 104 | +## **B. Kubernetes — use the provided `k8.yaml`** |
| 105 | + |
| 106 | +What the example does: |
| 107 | + |
| 108 | +- Launches a single Pod template containing multiple vLLM containers (ports 8000..8007). |
| 109 | +- Each container is configured with the same model and listens on a distinct port. |
| 110 | +- A `Service` of type `NodePort` exposes the Pod ports on the cluster nodes. |
| 111 | + |
| 112 | +Pre-apply checklist (LALMs) |
| 113 | + |
| 114 | +1. Replace the placeholder image in `k8.yaml`: |
| 115 | + |
| 116 | + - Find and replace `<YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>` with an image |
| 117 | + that includes: |
| 118 | + - vLLM installed |
| 119 | + - Python audio libs used by your model: `soundfile`, `librosa`, `torchaudio`, etc. |
| 120 | + - System binaries: `ffmpeg` and `libsndfile` (or equivalents). |
| 121 | + |
| 122 | +2. Secrets: create a Kubernetes Secret for your Hugging Face token, e.g.: |
| 123 | + |
| 124 | +```bash |
| 125 | +kubectl -n <namespace> create secret generic hf-token \ |
| 126 | + --from-literal=HUGGING_FACE_HUB_TOKEN='<YOUR_HF_TOKEN>' |
| 127 | +``` |
| 128 | + |
| 129 | +Then update `k8.yaml` container env to use `valueFrom.secretKeyRef` instead of a plain `value`. |
| 130 | + |
| 131 | +3. Cluster requirements |
| 132 | + |
| 133 | +- GPU-enabled nodes and drivers (matching the image / CUDA version) |
| 134 | +- If using Run:AI or a custom scheduler, ensure `schedulerName` matches your cluster. Remove |
| 135 | + or edit `schedulerName` if not applicable. |
| 136 | + |
| 137 | +Apply the example |
| 138 | + |
| 139 | +```bash |
| 140 | +# make any replacements (image, secret references), then: |
| 141 | +kubectl apply -f vllm_configs/k8.yaml |
| 142 | +
|
| 143 | +# monitor rollout |
| 144 | +kubectl -n <namespace> rollout status deployment/infer-phi4-multimodal-instruct |
| 145 | +kubectl -n <namespace> get pods -l app=infer-phi4-multimodal-instruct |
| 146 | +``` |
| 147 | + |
| 148 | +Accessing the service |
| 149 | + |
| 150 | +- The `Service` in `k8.yaml` is `NodePort`. To see which node port range your cluster assigned, |
| 151 | + run: |
| 152 | + |
| 153 | +```bash |
| 154 | +kubectl -n <namespace> get svc infer-phi4-multimodal-instruct-service -o wide |
| 155 | +``` |
| 156 | + |
| 157 | +- You can then use `http://<node-ip>:<nodePort>` for the port you want (8000..8007 map to |
| 158 | + cluster node ports). For production, consider exposing via `LoadBalancer` or an Ingress. |
| 159 | + |
| 160 | + |
| 161 | + |
| 162 | +Troubleshooting: |
| 163 | +- Check container logs: `kubectl -n <namespace> logs <pod> -c deployment0` (replace container name). |
| 164 | +- If model fails to load: check `HUGGING_FACE_HUB_TOKEN`, image CUDA/PyTorch compatibility, and |
| 165 | + that `--trust_remote_code` is set only when you trust the model repo. |
| 166 | + |
0 commit comments