Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 21 additions & 201 deletions vllm/README.md
Original file line number Diff line number Diff line change
@@ -1,213 +1,33 @@
# vLLM Truss to deploy chat completion model
# Deploying a Chat Completion Model with vLLM Truss

## What is this Truss example doing
This repository provides two approaches for deploying OpenAI-compatible chat completion models using vLLM and Truss. Select the option that best suits your use case.

This is a general purpose [Truss](https://truss.baseten.co/) that can deploy an asynchronous vLLM engine([AsyncLLMEngine](https://docs.vllm.ai/en/latest/dev/engine/async_llm_engine.html#asyncllmengine)) of any customized configuration with [all compatible chat completion models](https://docs.vllm.ai/en/latest/models/supported_models.html). We create this example to give you the most codeless experience, so you can configure all vLLM engine parameters in `config.yaml`, without making code changes in `model.py` for most of the use cases.
---

## Configure your Truss by modifying the config.yaml
## Deployment Options

### Basic options using 1 GPU
### 1. **vLLM Server via `vllm serve` (Strongly Recommended)**

Here is the minimum config file you will need to deploy a model using vLLM on 1 GPU.
The only parameters you need to touch are:
- `model_name`
- `repo_id`
- `accelerator`
**Overview:**
Leverage the built-in vLLM server for an OpenAI-compatible, codeless deployment. This is the recommended method for most users who want a fast, production-ready setup.

```
model_name: "Llama 3.1 8B Instruct VLLM"
python_version: py311
model_metadata:
example_model_input: {"prompt": "what is the meaning of life"}
repo_id: meta-llama/Llama-3.1-8B-Instruct
openai_compatible: true
vllm_config: null
requirements:
- vllm==0.5.4
resources:
accelerator: A100
use_gpu: true
runtime:
predict_concurrency: 128
secrets:
hf_access_token: null
```
**How to Use:**
- See the [`vllm_server`](./vllm_server) directory for more details and instructions.

### Basic options using multiple GPUs
**Why use this?**
- Minimal setup, codeless solution
- OpenAI-compatible

If your model needs more than 1 GPU to run using tensor parallel, you will need to change `accelerator`, and to set `tensor_parallel_size` and `distributed_executor_backend` accordingly.
---

```
model_name: "Llama 3.1 8B Instruct VLLM"
python_version: py311
model_metadata:
example_model_input: {"prompt": "what is the meaning of life"}
repo_id: meta-llama/Llama-3.1-8B-Instruct
openai_compatible: false
vllm_config:
tensor_parallel_size: 4
max_model_len: 4096
distributed_executor_backend: mp
requirements:
- vllm==0.5.4
resources:
accelerator: A10G:4
use_gpu: true
runtime:
predict_concurrency: 128
secrets:
hf_access_token: null
```
### 2. **vLLM with Truss Server**

### Use vLLM's OpenAI compatible server
**Overview:**
For advanced users who need custom inference logic, additional pre/post-processing, or further flexibility.

To use vLLM in [OpenAI compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) mode, simply set `openai_compatible: true` under `model_metadata`.
**How to Use:**
- Refer to the [`truss_server`](./truss_server) directory for details and configuration examples.

### Customize vLLM engine parameters

For advanced users who want to override [vLLM engine arguments](https://docs.vllm.ai/en/latest/models/engine_args.html), you can add all arguments to `vllm_config` under `model_metadata`.

#### Example 1: using model quantization

```
model_name: Mistral 7B v2 vLLM AWQ - T4
environment_variables: {}
external_package_dirs: []
model_metadata:
repo_id: TheBloke/Mistral-7B-Instruct-v0.2-AWQ
vllm_config:
quantization: "awq"
dtype: "float16"
max_model_len: 8000
max_num_seqs: 8
python_version: py310
requirements:
- vllm==0.5.4
resources:
accelerator: T4
use_gpu: true
secrets:
hf_access_token: null
system_packages: []
runtime:
predict_concurrency: 128
```

#### Example 2: using customized vLLM image

You can even override with your own customized vLLM docker image to work with models that are not supported yet by vanilla vLLM.

```
model_name: Ultravox v0.2
base_image:
image: vshulman/vllm-openai-fixie:latest
python_executable_path: /usr/bin/python3
model_metadata:
repo_id: fixie-ai/ultravox-v0.2
vllm_config:
audio_token_id: 128002
environment_variables: {}
external_package_dirs: []
python_version: py310
runtime:
predict_concurrency: 512
requirements:
- httpx
resources:
accelerator: A100
use_gpu: true
secrets:
hf_access_token: null
system_packages:
- python3.10-venv
```

## Deploy your Truss

1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
2. Install the latest version of Truss: `pip install --upgrade truss`
3. With `vllm` as your working directory, you can deploy the model with:

```sh
truss push --trusted
```

Paste your Baseten API key if prompted.

For more information, see [Truss documentation](https://truss.baseten.co).

## Call your model

Once your deployment is up, there are [many ways](https://docs.baseten.co/invoke/quickstart) to call your model.

### curl command

#### If you are NOT using OpenAI compatible server

```
curl -X POST https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{"prompt": "what is the meaning of life"}'
```


#### If you are using OpenAI compatible server

```
curl -X POST "https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict" \
-H "Content-Type: application/json" \
-H 'Authorization: Api-Key {BASETEN_API_KEY}' \
-d '{
"messages": [{"role": "user", "content": "What even is AGI?"}],
"max_tokens": 256
}'
```

To access [production metrics](https://docs.vllm.ai/en/latest/serving/metrics.html) reported by OpenAI compatible server, simply add `metrics: true` to the request.

```
curl -X POST "https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict" \
-H "Content-Type: application/json" \
-H 'Authorization: Api-Key {BASETEN_API_KEY}' \
-d '{
"metrics": true
}'
```

### OpenAI SDK (if you are using OpenAI compatible server)

```
from openai import OpenAI
import os

model_id = "abcd1234" # Replace with your model ID
deployment_id = "4321cbda" # [Optional] Replace with your deployment ID

client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url=f"https://bridge.baseten.co/{model_id}/v1/direct"
)

response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
],
extra_body={
"baseten": {
"model_id": model_id,
"deployment_id": deployment_id
}
}
)
print(response.choices[0].message.content)

```

For more information, see [API reference](https://docs.baseten.co/api-reference/openai).

## Support

If you have any questions or need assistance, please open an issue in this repository or contact our support team.
**Why use this?**
- Fully customizable inference and server logic
- OpenAI-compatible with minimal client changes
Loading
Loading