@@ -63,25 +63,63 @@ VLLM支持绝大多数LLM模型的推理加速。它使用如下的方案大幅
6363pip install vllm
6464```
6565
66- 之后直接运行即可:
67-
6866``` shell
69- VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server --model qwen/Qwen-1_8B-Chat --trust-remote-code
67+ import os
68+ os.environ[' VLLM_USE_MODELSCOPE' ] = ' True'
69+ from vllm import LLM, SamplingParams
70+ prompts = [
71+ " Hello, my name is" ,
72+ " The president of the United States is" ,
73+ " The capital of France is" ,
74+ " The future of AI is" ,
75+ ]
76+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
77+ llm = LLM(model=" qwen/Qwen-1_8B" , trust_remote_code=True)
78+ outputs = llm.generate(prompts, sampling_params)
79+
80+ # Print the outputs.
81+ for output in outputs:
82+ prompt = output.prompt
83+ generated_text = output.outputs[0].text
84+ print(f" Prompt: {prompt!r}, Generated text: {generated_text!r}" )
7085` ` `
7186
72- 之后就可以调用服务:
87+ 注意,截止到本文档编写完成,VLLM对Chat模型的推理支持(模板和结束符)存在问题,在实际进行部署时请考虑使用SWIFT或者FastChat。
7388
74- ``` shell
75- curl http://localhost:8000/v1/completions \
76- -H " Content-Type: application/json" \
77- -d ' {
78- "model": "qwen/Qwen-1_8B-Chat",
79- "prompt": "San Francisco is a",
80- "max_tokens": 7,
81- "temperature": 0
82- }'
89+ > LLM的generate方法支持直接输入拼接好的tokens(prompt_token_ids参数,此时不要传入prompts参数),所以外部可以按照自己的模板进行拼接后传入VLLM,SWIFT就是使用了这种方法
90+
91+ 在量化章节中我们讲解了[AWQ量化](https://docs.vllm.ai/en/latest/quantization/auto_awq.html),VLLM直接支持传入量化后的模型进行推理:
92+
93+ ` ` ` python
94+ from vllm import LLM, SamplingParams
95+ import os
96+ import torch
97+ os.environ[' VLLM_USE_MODELSCOPE' ] = ' True'
98+
99+ # Sample prompts.
100+ prompts = [
101+ " Hello, my name is" ,
102+ " The president of the United States is" ,
103+ " The capital of France is" ,
104+ " The future of AI is" ,
105+ ]
106+ # Create a sampling params object.
107+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
108+
109+ # Create an LLM.
110+ llm = LLM(model=" ticoAg/Qwen-1_8B-Chat-Int4-awq" , quantization=" AWQ" , dtype=torch.float16, trust_remote_code=True)
111+ # Generate texts from the prompts. The output is a list of RequestOutput objects
112+ # that contain the prompt, generated text, and other information.
113+ outputs = llm.generate(prompts, sampling_params)
114+ # Print the outputs.
115+ for output in outputs:
116+ prompt = output.prompt
117+ generated_text = output.outputs[0].text
118+ print(f" Prompt: {prompt!r}, Generated text: {generated_text!r}" )
83119` ` `
84120
121+ VLLM官方文档可以查看[这里](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)。
122+
85123# SWIFT
86124
87125在SWIFT中,我们支持了VLLM的推理加速手段。
0 commit comments