Skip to content

Commit 76d33b4

Browse files
fix doc (#330)
1 parent d2a7812 commit 76d33b4

File tree

1 file changed

+51
-13
lines changed

1 file changed

+51
-13
lines changed

docs/source/cources/deployment.md

Lines changed: 51 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -63,25 +63,63 @@ VLLM支持绝大多数LLM模型的推理加速。它使用如下的方案大幅
6363
pip install vllm
6464
```
6565

66-
之后直接运行即可:
67-
6866
```shell
69-
VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server --model qwen/Qwen-1_8B-Chat --trust-remote-code
67+
import os
68+
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
69+
from vllm import LLM, SamplingParams
70+
prompts = [
71+
"Hello, my name is",
72+
"The president of the United States is",
73+
"The capital of France is",
74+
"The future of AI is",
75+
]
76+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
77+
llm = LLM(model="qwen/Qwen-1_8B", trust_remote_code=True)
78+
outputs = llm.generate(prompts, sampling_params)
79+
80+
# Print the outputs.
81+
for output in outputs:
82+
prompt = output.prompt
83+
generated_text = output.outputs[0].text
84+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
7085
```
7186
72-
之后就可以调用服务:
87+
注意,截止到本文档编写完成,VLLM对Chat模型的推理支持(模板和结束符)存在问题,在实际进行部署时请考虑使用SWIFT或者FastChat。
7388
74-
```shell
75-
curl http://localhost:8000/v1/completions \
76-
-H "Content-Type: application/json" \
77-
-d '{
78-
"model": "qwen/Qwen-1_8B-Chat",
79-
"prompt": "San Francisco is a",
80-
"max_tokens": 7,
81-
"temperature": 0
82-
}'
89+
> LLM的generate方法支持直接输入拼接好的tokens(prompt_token_ids参数,此时不要传入prompts参数),所以外部可以按照自己的模板进行拼接后传入VLLM,SWIFT就是使用了这种方法
90+
91+
在量化章节中我们讲解了[AWQ量化](https://docs.vllm.ai/en/latest/quantization/auto_awq.html),VLLM直接支持传入量化后的模型进行推理:
92+
93+
```python
94+
from vllm import LLM, SamplingParams
95+
import os
96+
import torch
97+
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
98+
99+
# Sample prompts.
100+
prompts = [
101+
"Hello, my name is",
102+
"The president of the United States is",
103+
"The capital of France is",
104+
"The future of AI is",
105+
]
106+
# Create a sampling params object.
107+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
108+
109+
# Create an LLM.
110+
llm = LLM(model="ticoAg/Qwen-1_8B-Chat-Int4-awq", quantization="AWQ", dtype=torch.float16, trust_remote_code=True)
111+
# Generate texts from the prompts. The output is a list of RequestOutput objects
112+
# that contain the prompt, generated text, and other information.
113+
outputs = llm.generate(prompts, sampling_params)
114+
# Print the outputs.
115+
for output in outputs:
116+
prompt = output.prompt
117+
generated_text = output.outputs[0].text
118+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
83119
```
84120
121+
VLLM官方文档可以查看[这里](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)。
122+
85123
# SWIFT
86124
87125
在SWIFT中,我们支持了VLLM的推理加速手段。

0 commit comments

Comments
 (0)