Skip to content

Commit e4d2203

Browse files
add vllm configs and readme (#21)
1 parent a572abe commit e4d2203

File tree

2 files changed

+682
-0
lines changed

2 files changed

+682
-0
lines changed

vllm_configs/README.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Launching vLLM endpoints
2+
3+
This folder contains a Kubernetes deployment example (`k8.yaml`) and guidance
4+
for launching vLLM endpoints that can serve LALMs.
5+
6+
7+
You can use any one of the two recommended approaches below:
8+
- Local: run a vLLM server on your workstation or VM (good for development).
9+
- Kubernetes: deploy the provided `k8.yaml` to a GPU-capable cluster.
10+
11+
12+
Keep these high-level notes in mind:
13+
- Do NOT commit real secrets (Hugging Face tokens) into source control. Use
14+
Kubernetes Secrets or environment variables stored securely.
15+
- The `k8.yaml` file uses a placeholder image (`<YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>`).
16+
Replace that with an image that has required audio dependencies (ffmpeg, soundfile, librosa, torchaudio,
17+
any model-specific libs) before applying.
18+
- The example exposes ports 8000..8007. If you only need a single instance,
19+
reducing the number of containers/ports in the Pod is fine.
20+
21+
22+
23+
**Useful links**
24+
25+
- vLLM docs (overview & quickstart): https://docs.vllm.ai/en/latest/getting_started/quickstart/
26+
- vLLM CLI `serve` docs: https://docs.vllm.ai/en/latest/cli/serve/
27+
- vLLM Kubernetes / deployment docs: https://docs.vllm.ai/en/latest/deployment/k8s/
28+
- vLLM audio / multimodal docs and examples:
29+
- Audio assets API: https://docs.vllm.ai/en/latest/api/vllm/assets/audio/
30+
- Audio example (offline / language + audio): https://docs.vllm.ai/en/latest/examples/offline_inference/audio_language/
31+
32+
These audio-specific links describe how vLLM handles audio assets, required
33+
dependencies and example code for audio-language workflows.
34+
35+
36+
37+
38+
39+
40+
## **A. Local (development)**
41+
42+
1) Prerequisites
43+
44+
- GPU node or a machine with a compatible PyTorch/CUDA setup (or CPU only for small models).
45+
- Python 3.10+ and a virtual environment is recommended.
46+
- A Hugging Face token with access to the model, set in `HUGGING_FACE_HUB_TOKEN`.
47+
48+
2) Install vLLM (recommended minimal steps)
49+
50+
```bash
51+
# create & activate a venv (example using uv as in vLLM docs, or use python -m venv)
52+
python -m venv .venv
53+
source .venv/bin/activate
54+
pip install --upgrade pip
55+
# install vllm and choose a torch backend if needed
56+
pip install vllm --upgrade
57+
58+
# macOS (Homebrew):
59+
brew install ffmpeg libsndfile
60+
pip install soundfile librosa torchaudio
61+
62+
# Ubuntu/Debian:
63+
sudo apt-get update && sudo apt-get install -y ffmpeg libsndfile1
64+
pip install soundfile librosa torchaudio
65+
```
66+
67+
3) Start the server
68+
69+
The vLLM CLI provides a `serve` entrypoint that starts an OpenAI-compatible HTTP
70+
server. Example:
71+
72+
```bash
73+
# serve a HF model on localhost:8000
74+
export HUGGING_FACE_HUB_TOKEN="<YOUR_HF_TOKEN>"
75+
vllm serve microsoft/Phi-4-multimodal-instruct --port 8000 --host 0.0.0.0
76+
```
77+
78+
Notes:
79+
- Use `--api-key` or set `VLLM_API_KEY` if you want the server to require an API key.
80+
- Many LALMs need additional Python packages or system
81+
libraries. Commonly required packages: `soundfile`, `librosa`, `torchaudio`,
82+
and system `ffmpeg`/`libsndfile`. The exact requirements depend on the model
83+
and any tokenizer/preprocessor it uses. Check the model's Hugging Face page
84+
and the vLLM audio docs linked above.
85+
- If you plan to use GPU acceleration, ensure a compatible PyTorch/CUDA
86+
combination is installed in the environment (or use vLLM Docker images with
87+
prebuilt CUDA support). If you run into missing symbols, check CUDA/PyTorch
88+
compatibility and rebuild or pick a different image.
89+
90+
4) Point `run_configs` to the local endpoint
91+
92+
Update your run config to use the local server URL (example YAML snippet):
93+
94+
```yaml
95+
# example run_configs entry
96+
# For OpenAI-compatible API calls use endpoints like /v1/completions or /v1/chat/completions
97+
url: "http://localhost:8000/v1/completions"
98+
```
99+
100+
101+
102+
103+
104+
## **B. Kubernetes — use the provided `k8.yaml`**
105+
106+
What the example does:
107+
108+
- Launches a single Pod template containing multiple vLLM containers (ports 8000..8007).
109+
- Each container is configured with the same model and listens on a distinct port.
110+
- A `Service` of type `NodePort` exposes the Pod ports on the cluster nodes.
111+
112+
Pre-apply checklist (LALMs)
113+
114+
1. Replace the placeholder image in `k8.yaml`:
115+
116+
- Find and replace `<YOUR_VLLM_IMAGE_WITH_AUDIO_DEPENDANCIES_INSTALLED>` with an image
117+
that includes:
118+
- vLLM installed
119+
- Python audio libs used by your model: `soundfile`, `librosa`, `torchaudio`, etc.
120+
- System binaries: `ffmpeg` and `libsndfile` (or equivalents).
121+
122+
2. Secrets: create a Kubernetes Secret for your Hugging Face token, e.g.:
123+
124+
```bash
125+
kubectl -n <namespace> create secret generic hf-token \
126+
--from-literal=HUGGING_FACE_HUB_TOKEN='<YOUR_HF_TOKEN>'
127+
```
128+
129+
Then update `k8.yaml` container env to use `valueFrom.secretKeyRef` instead of a plain `value`.
130+
131+
3. Cluster requirements
132+
133+
- GPU-enabled nodes and drivers (matching the image / CUDA version)
134+
- If using Run:AI or a custom scheduler, ensure `schedulerName` matches your cluster. Remove
135+
or edit `schedulerName` if not applicable.
136+
137+
Apply the example
138+
139+
```bash
140+
# make any replacements (image, secret references), then:
141+
kubectl apply -f vllm_configs/k8.yaml
142+
143+
# monitor rollout
144+
kubectl -n <namespace> rollout status deployment/infer-phi4-multimodal-instruct
145+
kubectl -n <namespace> get pods -l app=infer-phi4-multimodal-instruct
146+
```
147+
148+
Accessing the service
149+
150+
- The `Service` in `k8.yaml` is `NodePort`. To see which node port range your cluster assigned,
151+
run:
152+
153+
```bash
154+
kubectl -n <namespace> get svc infer-phi4-multimodal-instruct-service -o wide
155+
```
156+
157+
- You can then use `http://<node-ip>:<nodePort>` for the port you want (8000..8007 map to
158+
cluster node ports). For production, consider exposing via `LoadBalancer` or an Ingress.
159+
160+
161+
162+
Troubleshooting:
163+
- Check container logs: `kubectl -n <namespace> logs <pod> -c deployment0` (replace container name).
164+
- If model fails to load: check `HUGGING_FACE_HUB_TOKEN`, image CUDA/PyTorch compatibility, and
165+
that `--trust_remote_code` is set only when you trust the model repo.
166+

0 commit comments

Comments
 (0)