Skip to content

Commit 429d897

Browse files
deep-diverywang96
andauthored
add doc about serving option on dstack (#3074)
Co-authored-by: Roger Wang <[email protected]>
1 parent a9bcc7a commit 429d897

File tree

2 files changed

+104
-0
lines changed

2 files changed

+104
-0
lines changed
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
.. _deploying_with_dstack:
2+
3+
Deploying with dstack
4+
============================
5+
6+
.. raw:: html
7+
8+
<p align="center">
9+
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
10+
</p>
11+
12+
vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
13+
14+
To install dstack client, run:
15+
16+
.. code-block:: console
17+
18+
$ pip install "dstack[all]
19+
$ dstack server
20+
21+
Next, to configure your dstack project, run:
22+
23+
.. code-block:: console
24+
25+
$ mkdir -p vllm-dstack
26+
$ cd vllm-dstack
27+
$ dstack init
28+
29+
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
30+
31+
.. code-block:: yaml
32+
33+
type: service
34+
35+
python: "3.11"
36+
env:
37+
- MODEL=NousResearch/Llama-2-7b-chat-hf
38+
port: 8000
39+
resources:
40+
gpu: 24GB
41+
commands:
42+
- pip install vllm
43+
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
44+
model:
45+
format: openai
46+
type: chat
47+
name: NousResearch/Llama-2-7b-chat-hf
48+
49+
Then, run the following CLI for provisioning:
50+
51+
.. code-block:: console
52+
53+
$ dstack run . -f serve.dstack.yml
54+
55+
⠸ Getting run plan...
56+
Configuration serve.dstack.yml
57+
Project deep-diver-main
58+
User deep-diver
59+
Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
60+
Max price -
61+
Max duration -
62+
Spot policy auto
63+
Retry policy no
64+
65+
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
66+
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
67+
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
68+
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
69+
...
70+
Shown 3 of 193 offers, $5.876 max
71+
72+
Continue? [y/n]: y
73+
⠙ Submitting run...
74+
⠏ Launching spicy-treefrog-1 (pulling)
75+
spicy-treefrog-1 provisioning completed (running)
76+
Service is published at ...
77+
78+
After the provisioning, you can interact with the model by using the OpenAI SDK:
79+
80+
.. code-block:: python
81+
82+
from openai import OpenAI
83+
84+
client = OpenAI(
85+
base_url="https://gateway.<gateway domain>",
86+
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
87+
)
88+
89+
completion = client.chat.completions.create(
90+
model="NousResearch/Llama-2-7b-chat-hf",
91+
messages=[
92+
{
93+
"role": "user",
94+
"content": "Compose a poem that explains the concept of recursion in programming.",
95+
}
96+
]
97+
)
98+
99+
print(completion.choices[0].message.content)
100+
101+
.. note::
102+
103+
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__

docs/source/serving/integrations.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@ Integrations
99
deploying_with_triton
1010
deploying_with_bentoml
1111
deploying_with_lws
12+
deploying_with_dstack
1213
serving_with_langchain

0 commit comments

Comments
 (0)