diff --git a/vllm/README.md b/vllm/README.md
index 17fc3f903..d1fc66528 100644
--- a/vllm/README.md
+++ b/vllm/README.md
@@ -1,213 +1,33 @@
-# vLLM Truss to deploy chat completion model
+# Deploying a Chat Completion Model with vLLM Truss
 
-## What is this Truss example doing
+This repository provides two approaches for deploying OpenAI-compatible chat completion models using vLLM and Truss. Select the option that best suits your use case.
 
-This is a general purpose [Truss](https://truss.baseten.co/) that can deploy an asynchronous vLLM engine([AsyncLLMEngine](https://docs.vllm.ai/en/latest/dev/engine/async_llm_engine.html#asyncllmengine)) of any customized configuration with [all compatible chat completion models](https://docs.vllm.ai/en/latest/models/supported_models.html). We create this example to give you the most codeless experience, so you can configure all vLLM engine parameters in `config.yaml`, without making code changes in `model.py` for most of the use cases.
+---
 
-## Configure your Truss by modifying the config.yaml
+## Deployment Options
 
-### Basic options using 1 GPU
+### 1. **vLLM Server via `vllm serve` (Strongly Recommended)**
 
-Here is the minimum config file you will need to deploy a model using vLLM on 1 GPU.
-The only parameters you need to touch are:
-- `model_name`
-- `repo_id`
-- `accelerator`
+**Overview:**
+Leverage the built-in vLLM server for an OpenAI-compatible, codeless deployment. This is the recommended method for most users who want a fast, production-ready setup.
 
-```
-model_name: "Llama 3.1 8B Instruct VLLM"
-python_version: py311
-model_metadata:
-  example_model_input: {"prompt": "what is the meaning of life"}
-  repo_id: meta-llama/Llama-3.1-8B-Instruct
-  openai_compatible: true
-  vllm_config: null
-requirements:
-  - vllm==0.5.4
-resources:
-  accelerator: A100
-  use_gpu: true
-runtime:
-  predict_concurrency: 128
-secrets:
-  hf_access_token: null
-```
+**How to Use:**
+- See the [`vllm_server`](./vllm_server) directory for more details and instructions.
 
-### Basic options using multiple GPUs
+**Why use this?**
+- Minimal setup, codeless solution
+- OpenAI-compatible
 
-If your model needs more than 1 GPU to run using tensor parallel, you will need to change `accelerator`, and to set `tensor_parallel_size` and `distributed_executor_backend` accordingly.
+---
 
-```
-model_name: "Llama 3.1 8B Instruct VLLM"
-python_version: py311
-model_metadata:
-  example_model_input: {"prompt": "what is the meaning of life"}
-  repo_id: meta-llama/Llama-3.1-8B-Instruct
-  openai_compatible: false
-  vllm_config:
-    tensor_parallel_size: 4
-    max_model_len: 4096
-    distributed_executor_backend: mp
-requirements:
-  - vllm==0.5.4
-resources:
-  accelerator: A10G:4
-  use_gpu: true
-runtime:
-  predict_concurrency: 128
-secrets:
-  hf_access_token: null
-```
+### 2. **vLLM with Truss Server**
 
-### Use vLLM's OpenAI compatible server
+**Overview:**
+For advanced users who need custom inference logic, additional pre/post-processing, or further flexibility.
 
-To use vLLM in [OpenAI compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) mode, simply set `openai_compatible: true` under `model_metadata`.
+**How to Use:**
+- Refer to the [`truss_server`](./truss_server) directory for details and configuration examples.
 
-### Customize vLLM engine parameters
-
-For advanced users who want to override [vLLM engine arguments](https://docs.vllm.ai/en/latest/models/engine_args.html), you can add all arguments to `vllm_config` under `model_metadata`.
-
-#### Example 1: using model quantization
-
-```
-model_name: Mistral 7B v2 vLLM AWQ - T4
-environment_variables: {}
-external_package_dirs: []
-model_metadata:
-  repo_id: TheBloke/Mistral-7B-Instruct-v0.2-AWQ
-  vllm_config:
-    quantization: "awq"
-    dtype: "float16"
-    max_model_len: 8000
-    max_num_seqs: 8
-python_version: py310
-requirements:
-  - vllm==0.5.4
-resources:
-  accelerator: T4
-  use_gpu: true
-secrets:
-  hf_access_token: null
-system_packages: []
-runtime:
-  predict_concurrency: 128
-```
-
-#### Example 2: using customized vLLM image
-
-You can even override with your own customized vLLM docker image to work with models that are not supported yet by vanilla vLLM.
-
-```
-model_name: Ultravox v0.2
-base_image:
-  image: vshulman/vllm-openai-fixie:latest
-  python_executable_path: /usr/bin/python3
-model_metadata:
-  repo_id: fixie-ai/ultravox-v0.2
-  vllm_config:
-    audio_token_id: 128002
-environment_variables: {}
-external_package_dirs: []
-python_version: py310
-runtime:
-  predict_concurrency: 512
-requirements:
-  - httpx
-resources:
-  accelerator: A100
-  use_gpu: true
-secrets:
-  hf_access_token: null
-system_packages:
-- python3.10-venv
-```
-
-## Deploy your Truss
-
-1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
-2. Install the latest version of Truss: `pip install --upgrade truss`
-3. With `vllm` as your working directory, you can deploy the model with:
-
-    ```sh
-    truss push --trusted
-    ```
-
-    Paste your Baseten API key if prompted.
-
-For more information, see [Truss documentation](https://truss.baseten.co).
-
-## Call your model
-
-Once your deployment is up, there are [many ways](https://docs.baseten.co/invoke/quickstart) to call your model.
-
-### curl command
-
-#### If you are NOT using OpenAI compatible server
-
-```
-curl -X POST https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict \
-     -H "Authorization: Api-Key $BASETEN_API_KEY" \
-     -d '{"prompt": "what is the meaning of life"}'
-```
-
-
-#### If you are using OpenAI compatible server
-
-```
-curl -X POST "https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict" \
-     -H "Content-Type: application/json" \
-     -H 'Authorization: Api-Key {BASETEN_API_KEY}' \
-     -d '{
-           "messages": [{"role": "user", "content": "What even is AGI?"}],
-           "max_tokens": 256
-         }'
-```
-
-To access [production metrics](https://docs.vllm.ai/en/latest/serving/metrics.html) reported by OpenAI compatible server, simply add `metrics: true` to the request.
-
-```
-curl -X POST "https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict" \
-     -H "Content-Type: application/json" \
-     -H 'Authorization: Api-Key {BASETEN_API_KEY}' \
-     -d '{
-           "metrics": true
-         }'
-```
-
-### OpenAI SDK (if you are using OpenAI compatible server)
-
-```
-from openai import OpenAI
-import os
-
-model_id = "abcd1234" # Replace with your model ID
-deployment_id = "4321cbda" # [Optional] Replace with your deployment ID
-
-client = OpenAI(
-    api_key=os.environ["BASETEN_API_KEY"],
-    base_url=f"https://bridge.baseten.co/{model_id}/v1/direct"
-)
-
-response = client.chat.completions.create(
-  model="meta-llama/Llama-3.1-8B-Instruct",
-  messages=[
-    {"role": "user", "content": "Who won the world series in 2020?"},
-    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
-    {"role": "user", "content": "Where was it played?"}
-  ],
-  extra_body={
-    "baseten": {
-      "model_id": model_id,
-      "deployment_id": deployment_id
-    }
-  }
-)
-print(response.choices[0].message.content)
-
-```
-
-For more information, see [API reference](https://docs.baseten.co/api-reference/openai).
-
-## Support
-
-If you have any questions or need assistance, please open an issue in this repository or contact our support team.
+**Why use this?**
+- Fully customizable inference and server logic
+- OpenAI-compatible with minimal client changes
diff --git a/vllm/truss_server/README.md b/vllm/truss_server/README.md
new file mode 100644
index 000000000..ffc67c820
--- /dev/null
+++ b/vllm/truss_server/README.md
@@ -0,0 +1,229 @@
+# vLLM Truss: Deploy Chat Completion Models
+
+## Overview
+
+This repository demonstrates how to deploy [vLLM](https://github.com/vllm-project/vllm) using a Truss server.
+**Use this approach only if you need custom inference logic or flexibility.**
+For most users, we recommend the easier [vLLM server example](https://github.com/basetenlabs/truss-examples/tree/main/vllm/vllm_server), which is also OpenAI-compatible.
+
+This Truss works with asynchronous vLLM engines ([AsyncLLMEngine](https://docs.vllm.ai/en/v0.6.5/dev/engine/async_llm_engine.html#asyncllmengine)) and [all supported chat completion models](https://docs.vllm.ai/en/latest/models/supported_models.html).
+
+---
+
+## Configure Your Truss (`config.yaml`)
+
+### Single GPU Example
+
+To deploy on a single GPU, update these fields:
+- `model_name`
+- `repo_id`
+- `accelerator`
+
+<details>
+<summary>Minimal config example</summary>
+
+```yaml
+model_name: "Llama 3.1 8B Instruct VLLM"
+python_version: py311
+model_metadata:
+  example_model_input: {"prompt": "what is the meaning of life"}
+  repo_id: meta-llama/Llama-3.1-8B-Instruct
+  openai_compatible: true
+  vllm_config: null
+requirements:
+  - vllm==0.5.4
+resources:
+  accelerator: A100
+  use_gpu: true
+runtime:
+  predict_concurrency: 128
+secrets:
+  hf_access_token: null
+```
+</details>
+
+---
+
+### Multi-GPU Example (Tensor Parallelism)
+
+For multi-GPU deployments, set:
+- `accelerator` (e.g., `A10G:4`)
+- `model_metadata.vllm_config.tensor_parallel_size`
+- `model_metadata.vllm_config.distributed_executor_backend`
+
+<details>
+<summary>Multi-GPU config example</summary>
+
+```yaml
+model_name: "Llama 3.1 8B Instruct VLLM"
+python_version: py311
+model_metadata:
+  example_model_input: {"prompt": "what is the meaning of life"}
+  repo_id: meta-llama/Llama-3.1-8B-Instruct
+  openai_compatible: false
+  vllm_config:
+    tensor_parallel_size: 4
+    max_model_len: 4096
+    distributed_executor_backend: mp
+requirements:
+  - vllm==0.5.4
+resources:
+  accelerator: A10G:4
+  use_gpu: true
+runtime:
+  predict_concurrency: 128
+secrets:
+  hf_access_token: null
+```
+</details>
+
+---
+
+### Customization
+
+Override any [vLLM engine argument](https://docs.vllm.ai/en/latest/models/engine_args.html) by adding it to `vllm_config` in `model_metadata`.
+
+<details>
+<summary>Example: Model Quantization</summary>
+
+```yaml
+model_name: Mistral 7B v2 vLLM AWQ - T4
+model_metadata:
+  repo_id: TheBloke/Mistral-7B-Instruct-v0.2-AWQ
+  vllm_config:
+    quantization: "awq"
+    dtype: "float16"
+    max_model_len: 8000
+    max_num_seqs: 8
+python_version: py310
+requirements:
+  - vllm==0.5.4
+resources:
+  accelerator: T4
+  use_gpu: true
+runtime:
+  predict_concurrency: 128
+secrets:
+  hf_access_token: null
+```
+</details>
+
+
+You can even override with your own customized vLLM docker image to work with models that are not supported yet by vanilla vLLM.
+
+<details>
+<summary>Example: Custom Docker Image</summary>
+
+```yaml
+model_name: Ultravox v0.2
+base_image:
+  image: vshulman/vllm-openai-fixie:latest
+  python_executable_path: /usr/bin/python3
+model_metadata:
+  repo_id: fixie-ai/ultravox-v0.2
+  vllm_config:
+    audio_token_id: 128002
+python_version: py310
+requirements:
+  - httpx
+resources:
+  accelerator: A100
+  use_gpu: true
+runtime:
+  predict_concurrency: 512
+secrets:
+  hf_access_token: null
+system_packages:
+  - python3.10-venv
+```
+</details>
+
+---
+
+## Deploy Your Truss
+
+First [sign up for Baseten](https://app.baseten.co/signup) and get an [API key](https://app.baseten.co/settings/account/api_keys).
+
+```sh
+# Install truss
+pip install --upgrade truss
+
+# Deploy your model from the `vllm` directory
+truss push
+
+```
+
+---
+
+## Call Your Model
+
+After deploying, invoke your model.
+
+### Curl: Not OpenAI Compatible
+
+```sh
+curl -X POST https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict \
+     -H "Authorization: Api-Key $BASETEN_API_KEY" \
+     -d '{"prompt": "what is the meaning of life"}'
+```
+
+### Curl: OpenAI Compatible
+
+```sh
+curl -X POST "https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict" \
+     -H "Content-Type: application/json" \
+     -H 'Authorization: Api-Key {BASETEN_API_KEY}' \
+     -d '{
+           "messages": [{"role": "user", "content": "What even is AGI?"}],
+           "max_tokens": 256
+         }'
+```
+
+**Production Metrics:**
+Add `"metrics": true` to your request for detailed metrics:
+
+```sh
+curl -X POST "https://model-<YOUR_MODEL_ID>.api.baseten.co/development/predict" \
+     -H "Content-Type: application/json" \
+     -H 'Authorization: Api-Key {BASETEN_API_KEY}' \
+     -d '{"metrics": true}'
+```
+
+---
+
+### OpenAI SDK (OpenAI-Compatible Only)
+
+```python
+from openai import OpenAI
+import os
+
+model_id = "abcd1234"  # Replace with your model ID
+deployment_id = "4321cbda"  # [Optional]
+
+client = OpenAI(
+    api_key=os.environ["BASETEN_API_KEY"],
+    base_url=f"https://bridge.baseten.co/{model_id}/v1/direct"
+)
+
+response = client.chat.completions.create(
+  model="meta-llama/Llama-3.1-8B-Instruct",
+  messages=[
+    {"role": "user", "content": "Who won the world series in 2020?"},
+    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
+    {"role": "user", "content": "Where was it played?"}
+  ],
+  extra_body={
+    "baseten": {
+      "model_id": model_id,
+      "deployment_id": deployment_id
+    }
+  }
+)
+print(response.choices[0].message.content)
+```
+
+---
+
+## Support
+
+Need help? [contact Baseten support](https://www.baseten.co/talk-to-us/).
diff --git a/vllm/config.yaml b/vllm/truss_server/config.yaml
similarity index 100%
rename from vllm/config.yaml
rename to vllm/truss_server/config.yaml
diff --git a/vllm/model/__init__.py b/vllm/truss_server/model/__init__.py
similarity index 100%
rename from vllm/model/__init__.py
rename to vllm/truss_server/model/__init__.py
diff --git a/vllm/model/helper.py b/vllm/truss_server/model/helper.py
similarity index 100%
rename from vllm/model/helper.py
rename to vllm/truss_server/model/helper.py
diff --git a/vllm/model/model.py b/vllm/truss_server/model/model.py
similarity index 100%
rename from vllm/model/model.py
rename to vllm/truss_server/model/model.py
diff --git a/vllm/vllm_server/README.md b/vllm/vllm_server/README.md
new file mode 100644
index 000000000..ff21ca3ab
--- /dev/null
+++ b/vllm/vllm_server/README.md
@@ -0,0 +1,40 @@
+# vLLM Truss: Deploy a Chat Completion Model
+
+## Overview
+
+This Truss example offers a **codeless, OpenAI-compatible solution** to run a vLLM server within a Truss container. With minimal configuration, you can deploy powerful language models on our cloud—just update your settings and Truss will handle the rest.
+
+---
+
+## Configuration Guide
+
+All deployment options are controlled via the `config.yaml` file. Follow the instructions below based on your GPU requirements:
+
+### 🚀 Basic: Single GPU Deployment
+
+To deploy a model using a single GPU, simply modify the following parameters in `config.yaml`:
+- `model_name`
+- `repo_id`
+- `accelerator`
+
+No additional changes are required.
+
+---
+
+### 🖥️ Advanced: Multi-GPU Deployment (Tensor Parallelism)
+
+If your model requires multiple GPUs, such as for tensor parallelism, you’ll need to configure:
+
+- `accelerator`
+  Example for 4 H100 GPUs:
+  ```yaml
+  accelerator: H100:4
+  ```
+- `tensor_parallel_size`
+- `distributed_executor_backend`
+
+These last two are arguments for the `vllm serve` command within `config.yaml`. Add to the command as follows: `--tensor-parallel-size 4 --distributed-executor-backend mp`
+
+## Support
+
+Need help? [contact Baseten support](https://www.baseten.co/talk-to-us/).
diff --git a/vllm/vllm_server/config.yaml b/vllm/vllm_server/config.yaml
new file mode 100644
index 000000000..56dd7a754
--- /dev/null
+++ b/vllm/vllm_server/config.yaml
@@ -0,0 +1,36 @@
+description: Llama 3.1 8B Instruct model is lightweight, multilingual and fine-tuned on human preferences for safety and helpfulness.
+base_image:
+  image: vllm/vllm-openai:v0.9.2
+model_metadata:
+  repo_id: meta-llama/Llama-3.1-8B-Instruct
+  example_model_input: {
+    "model": "",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "What is the meaning of life?"
+                }
+            ]
+        }
+    ]
+  }
+  tags:
+    - openai-compatible
+docker_server:
+  start_command: sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype half --max-model-len 65536 --port 8000 --served-model-name llama --tensor-parallel-size 1 --gpu-memory-utilization 0.95"
+  readiness_endpoint: /health
+  liveness_endpoint: /health
+  predict_endpoint: /v1/chat/completions
+  server_port: 8000
+resources:
+  accelerator: H100_40GB
+  use_gpu: true
+runtime:
+  predict_concurrency : 16
+model_name: Llama 3.1 8B Instruct
+secrets:
+  hf_access_token: null
+requirements: []