diff --git a/vllm/README.md b/vllm/README.md index 17fc3f903..d1fc66528 100644 --- a/vllm/README.md +++ b/vllm/README.md @@ -1,213 +1,33 @@ -# vLLM Truss to deploy chat completion model +# Deploying a Chat Completion Model with vLLM Truss -## What is this Truss example doing +This repository provides two approaches for deploying OpenAI-compatible chat completion models using vLLM and Truss. Select the option that best suits your use case. -This is a general purpose [Truss](https://truss.baseten.co/) that can deploy an asynchronous vLLM engine([AsyncLLMEngine](https://docs.vllm.ai/en/latest/dev/engine/async_llm_engine.html#asyncllmengine)) of any customized configuration with [all compatible chat completion models](https://docs.vllm.ai/en/latest/models/supported_models.html). We create this example to give you the most codeless experience, so you can configure all vLLM engine parameters in `config.yaml`, without making code changes in `model.py` for most of the use cases. +--- -## Configure your Truss by modifying the config.yaml +## Deployment Options -### Basic options using 1 GPU +### 1. **vLLM Server via `vllm serve` (Strongly Recommended)** -Here is the minimum config file you will need to deploy a model using vLLM on 1 GPU. -The only parameters you need to touch are: -- `model_name` -- `repo_id` -- `accelerator` +**Overview:** +Leverage the built-in vLLM server for an OpenAI-compatible, codeless deployment. This is the recommended method for most users who want a fast, production-ready setup. -``` -model_name: "Llama 3.1 8B Instruct VLLM" -python_version: py311 -model_metadata: - example_model_input: {"prompt": "what is the meaning of life"} - repo_id: meta-llama/Llama-3.1-8B-Instruct - openai_compatible: true - vllm_config: null -requirements: - - vllm==0.5.4 -resources: - accelerator: A100 - use_gpu: true -runtime: - predict_concurrency: 128 -secrets: - hf_access_token: null -``` +**How to Use:** +- See the [`vllm_server`](./vllm_server) directory for more details and instructions. -### Basic options using multiple GPUs +**Why use this?** +- Minimal setup, codeless solution +- OpenAI-compatible -If your model needs more than 1 GPU to run using tensor parallel, you will need to change `accelerator`, and to set `tensor_parallel_size` and `distributed_executor_backend` accordingly. +--- -``` -model_name: "Llama 3.1 8B Instruct VLLM" -python_version: py311 -model_metadata: - example_model_input: {"prompt": "what is the meaning of life"} - repo_id: meta-llama/Llama-3.1-8B-Instruct - openai_compatible: false - vllm_config: - tensor_parallel_size: 4 - max_model_len: 4096 - distributed_executor_backend: mp -requirements: - - vllm==0.5.4 -resources: - accelerator: A10G:4 - use_gpu: true -runtime: - predict_concurrency: 128 -secrets: - hf_access_token: null -``` +### 2. **vLLM with Truss Server** -### Use vLLM's OpenAI compatible server +**Overview:** +For advanced users who need custom inference logic, additional pre/post-processing, or further flexibility. -To use vLLM in [OpenAI compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) mode, simply set `openai_compatible: true` under `model_metadata`. +**How to Use:** +- Refer to the [`truss_server`](./truss_server) directory for details and configuration examples. -### Customize vLLM engine parameters - -For advanced users who want to override [vLLM engine arguments](https://docs.vllm.ai/en/latest/models/engine_args.html), you can add all arguments to `vllm_config` under `model_metadata`. - -#### Example 1: using model quantization - -``` -model_name: Mistral 7B v2 vLLM AWQ - T4 -environment_variables: {} -external_package_dirs: [] -model_metadata: - repo_id: TheBloke/Mistral-7B-Instruct-v0.2-AWQ - vllm_config: - quantization: "awq" - dtype: "float16" - max_model_len: 8000 - max_num_seqs: 8 -python_version: py310 -requirements: - - vllm==0.5.4 -resources: - accelerator: T4 - use_gpu: true -secrets: - hf_access_token: null -system_packages: [] -runtime: - predict_concurrency: 128 -``` - -#### Example 2: using customized vLLM image - -You can even override with your own customized vLLM docker image to work with models that are not supported yet by vanilla vLLM. - -``` -model_name: Ultravox v0.2 -base_image: - image: vshulman/vllm-openai-fixie:latest - python_executable_path: /usr/bin/python3 -model_metadata: - repo_id: fixie-ai/ultravox-v0.2 - vllm_config: - audio_token_id: 128002 -environment_variables: {} -external_package_dirs: [] -python_version: py310 -runtime: - predict_concurrency: 512 -requirements: - - httpx -resources: - accelerator: A100 - use_gpu: true -secrets: - hf_access_token: null -system_packages: -- python3.10-venv -``` - -## Deploy your Truss - -1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys). -2. Install the latest version of Truss: `pip install --upgrade truss` -3. With `vllm` as your working directory, you can deploy the model with: - - ```sh - truss push --trusted - ``` - - Paste your Baseten API key if prompted. - -For more information, see [Truss documentation](https://truss.baseten.co). - -## Call your model - -Once your deployment is up, there are [many ways](https://docs.baseten.co/invoke/quickstart) to call your model. - -### curl command - -#### If you are NOT using OpenAI compatible server - -``` -curl -X POST https://model-.api.baseten.co/development/predict \ - -H "Authorization: Api-Key $BASETEN_API_KEY" \ - -d '{"prompt": "what is the meaning of life"}' -``` - - -#### If you are using OpenAI compatible server - -``` -curl -X POST "https://model-.api.baseten.co/development/predict" \ - -H "Content-Type: application/json" \ - -H 'Authorization: Api-Key {BASETEN_API_KEY}' \ - -d '{ - "messages": [{"role": "user", "content": "What even is AGI?"}], - "max_tokens": 256 - }' -``` - -To access [production metrics](https://docs.vllm.ai/en/latest/serving/metrics.html) reported by OpenAI compatible server, simply add `metrics: true` to the request. - -``` -curl -X POST "https://model-.api.baseten.co/development/predict" \ - -H "Content-Type: application/json" \ - -H 'Authorization: Api-Key {BASETEN_API_KEY}' \ - -d '{ - "metrics": true - }' -``` - -### OpenAI SDK (if you are using OpenAI compatible server) - -``` -from openai import OpenAI -import os - -model_id = "abcd1234" # Replace with your model ID -deployment_id = "4321cbda" # [Optional] Replace with your deployment ID - -client = OpenAI( - api_key=os.environ["BASETEN_API_KEY"], - base_url=f"https://bridge.baseten.co/{model_id}/v1/direct" -) - -response = client.chat.completions.create( - model="meta-llama/Llama-3.1-8B-Instruct", - messages=[ - {"role": "user", "content": "Who won the world series in 2020?"}, - {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, - {"role": "user", "content": "Where was it played?"} - ], - extra_body={ - "baseten": { - "model_id": model_id, - "deployment_id": deployment_id - } - } -) -print(response.choices[0].message.content) - -``` - -For more information, see [API reference](https://docs.baseten.co/api-reference/openai). - -## Support - -If you have any questions or need assistance, please open an issue in this repository or contact our support team. +**Why use this?** +- Fully customizable inference and server logic +- OpenAI-compatible with minimal client changes diff --git a/vllm/truss_server/README.md b/vllm/truss_server/README.md new file mode 100644 index 000000000..ffc67c820 --- /dev/null +++ b/vllm/truss_server/README.md @@ -0,0 +1,229 @@ +# vLLM Truss: Deploy Chat Completion Models + +## Overview + +This repository demonstrates how to deploy [vLLM](https://github.com/vllm-project/vllm) using a Truss server. +**Use this approach only if you need custom inference logic or flexibility.** +For most users, we recommend the easier [vLLM server example](https://github.com/basetenlabs/truss-examples/tree/main/vllm/vllm_server), which is also OpenAI-compatible. + +This Truss works with asynchronous vLLM engines ([AsyncLLMEngine](https://docs.vllm.ai/en/v0.6.5/dev/engine/async_llm_engine.html#asyncllmengine)) and [all supported chat completion models](https://docs.vllm.ai/en/latest/models/supported_models.html). + +--- + +## Configure Your Truss (`config.yaml`) + +### Single GPU Example + +To deploy on a single GPU, update these fields: +- `model_name` +- `repo_id` +- `accelerator` + +
+Minimal config example + +```yaml +model_name: "Llama 3.1 8B Instruct VLLM" +python_version: py311 +model_metadata: + example_model_input: {"prompt": "what is the meaning of life"} + repo_id: meta-llama/Llama-3.1-8B-Instruct + openai_compatible: true + vllm_config: null +requirements: + - vllm==0.5.4 +resources: + accelerator: A100 + use_gpu: true +runtime: + predict_concurrency: 128 +secrets: + hf_access_token: null +``` +
+ +--- + +### Multi-GPU Example (Tensor Parallelism) + +For multi-GPU deployments, set: +- `accelerator` (e.g., `A10G:4`) +- `model_metadata.vllm_config.tensor_parallel_size` +- `model_metadata.vllm_config.distributed_executor_backend` + +
+Multi-GPU config example + +```yaml +model_name: "Llama 3.1 8B Instruct VLLM" +python_version: py311 +model_metadata: + example_model_input: {"prompt": "what is the meaning of life"} + repo_id: meta-llama/Llama-3.1-8B-Instruct + openai_compatible: false + vllm_config: + tensor_parallel_size: 4 + max_model_len: 4096 + distributed_executor_backend: mp +requirements: + - vllm==0.5.4 +resources: + accelerator: A10G:4 + use_gpu: true +runtime: + predict_concurrency: 128 +secrets: + hf_access_token: null +``` +
+ +--- + +### Customization + +Override any [vLLM engine argument](https://docs.vllm.ai/en/latest/models/engine_args.html) by adding it to `vllm_config` in `model_metadata`. + +
+Example: Model Quantization + +```yaml +model_name: Mistral 7B v2 vLLM AWQ - T4 +model_metadata: + repo_id: TheBloke/Mistral-7B-Instruct-v0.2-AWQ + vllm_config: + quantization: "awq" + dtype: "float16" + max_model_len: 8000 + max_num_seqs: 8 +python_version: py310 +requirements: + - vllm==0.5.4 +resources: + accelerator: T4 + use_gpu: true +runtime: + predict_concurrency: 128 +secrets: + hf_access_token: null +``` +
+ + +You can even override with your own customized vLLM docker image to work with models that are not supported yet by vanilla vLLM. + +
+Example: Custom Docker Image + +```yaml +model_name: Ultravox v0.2 +base_image: + image: vshulman/vllm-openai-fixie:latest + python_executable_path: /usr/bin/python3 +model_metadata: + repo_id: fixie-ai/ultravox-v0.2 + vllm_config: + audio_token_id: 128002 +python_version: py310 +requirements: + - httpx +resources: + accelerator: A100 + use_gpu: true +runtime: + predict_concurrency: 512 +secrets: + hf_access_token: null +system_packages: + - python3.10-venv +``` +
+ +--- + +## Deploy Your Truss + +First [sign up for Baseten](https://app.baseten.co/signup) and get an [API key](https://app.baseten.co/settings/account/api_keys). + +```sh +# Install truss +pip install --upgrade truss + +# Deploy your model from the `vllm` directory +truss push + +``` + +--- + +## Call Your Model + +After deploying, invoke your model. + +### Curl: Not OpenAI Compatible + +```sh +curl -X POST https://model-.api.baseten.co/development/predict \ + -H "Authorization: Api-Key $BASETEN_API_KEY" \ + -d '{"prompt": "what is the meaning of life"}' +``` + +### Curl: OpenAI Compatible + +```sh +curl -X POST "https://model-.api.baseten.co/development/predict" \ + -H "Content-Type: application/json" \ + -H 'Authorization: Api-Key {BASETEN_API_KEY}' \ + -d '{ + "messages": [{"role": "user", "content": "What even is AGI?"}], + "max_tokens": 256 + }' +``` + +**Production Metrics:** +Add `"metrics": true` to your request for detailed metrics: + +```sh +curl -X POST "https://model-.api.baseten.co/development/predict" \ + -H "Content-Type: application/json" \ + -H 'Authorization: Api-Key {BASETEN_API_KEY}' \ + -d '{"metrics": true}' +``` + +--- + +### OpenAI SDK (OpenAI-Compatible Only) + +```python +from openai import OpenAI +import os + +model_id = "abcd1234" # Replace with your model ID +deployment_id = "4321cbda" # [Optional] + +client = OpenAI( + api_key=os.environ["BASETEN_API_KEY"], + base_url=f"https://bridge.baseten.co/{model_id}/v1/direct" +) + +response = client.chat.completions.create( + model="meta-llama/Llama-3.1-8B-Instruct", + messages=[ + {"role": "user", "content": "Who won the world series in 2020?"}, + {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, + {"role": "user", "content": "Where was it played?"} + ], + extra_body={ + "baseten": { + "model_id": model_id, + "deployment_id": deployment_id + } + } +) +print(response.choices[0].message.content) +``` + +--- + +## Support + +Need help? [contact Baseten support](https://www.baseten.co/talk-to-us/). diff --git a/vllm/config.yaml b/vllm/truss_server/config.yaml similarity index 100% rename from vllm/config.yaml rename to vllm/truss_server/config.yaml diff --git a/vllm/model/__init__.py b/vllm/truss_server/model/__init__.py similarity index 100% rename from vllm/model/__init__.py rename to vllm/truss_server/model/__init__.py diff --git a/vllm/model/helper.py b/vllm/truss_server/model/helper.py similarity index 100% rename from vllm/model/helper.py rename to vllm/truss_server/model/helper.py diff --git a/vllm/model/model.py b/vllm/truss_server/model/model.py similarity index 100% rename from vllm/model/model.py rename to vllm/truss_server/model/model.py diff --git a/vllm/vllm_server/README.md b/vllm/vllm_server/README.md new file mode 100644 index 000000000..ff21ca3ab --- /dev/null +++ b/vllm/vllm_server/README.md @@ -0,0 +1,40 @@ +# vLLM Truss: Deploy a Chat Completion Model + +## Overview + +This Truss example offers a **codeless, OpenAI-compatible solution** to run a vLLM server within a Truss container. With minimal configuration, you can deploy powerful language models on our cloud—just update your settings and Truss will handle the rest. + +--- + +## Configuration Guide + +All deployment options are controlled via the `config.yaml` file. Follow the instructions below based on your GPU requirements: + +### 🚀 Basic: Single GPU Deployment + +To deploy a model using a single GPU, simply modify the following parameters in `config.yaml`: +- `model_name` +- `repo_id` +- `accelerator` + +No additional changes are required. + +--- + +### 🖥️ Advanced: Multi-GPU Deployment (Tensor Parallelism) + +If your model requires multiple GPUs, such as for tensor parallelism, you’ll need to configure: + +- `accelerator` + Example for 4 H100 GPUs: + ```yaml + accelerator: H100:4 + ``` +- `tensor_parallel_size` +- `distributed_executor_backend` + +These last two are arguments for the `vllm serve` command within `config.yaml`. Add to the command as follows: `--tensor-parallel-size 4 --distributed-executor-backend mp` + +## Support + +Need help? [contact Baseten support](https://www.baseten.co/talk-to-us/). diff --git a/vllm/vllm_server/config.yaml b/vllm/vllm_server/config.yaml new file mode 100644 index 000000000..56dd7a754 --- /dev/null +++ b/vllm/vllm_server/config.yaml @@ -0,0 +1,36 @@ +description: Llama 3.1 8B Instruct model is lightweight, multilingual and fine-tuned on human preferences for safety and helpfulness. +base_image: + image: vllm/vllm-openai:v0.9.2 +model_metadata: + repo_id: meta-llama/Llama-3.1-8B-Instruct + example_model_input: { + "model": "", + "messages": [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "What is the meaning of life?" + } + ] + } + ] + } + tags: + - openai-compatible +docker_server: + start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype half --max-model-len 65536 --port 8000 --served-model-name llama --tensor-parallel-size 1 --gpu-memory-utilization 0.95" + readiness_endpoint: /health + liveness_endpoint: /health + predict_endpoint: /v1/chat/completions + server_port: 8000 +resources: + accelerator: H100_40GB + use_gpu: true +runtime: + predict_concurrency : 16 +model_name: Llama 3.1 8B Instruct +secrets: + hf_access_token: null +requirements: []