Skip to content

Commit 585a22f

Browse files
authored
support LLM & lmdeploy (#1272)
1 parent 2d07f9f commit 585a22f

14 files changed

+769
-28
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# LmDeploy推理加速与部署
2+
3+
## 目录
4+
- [环境准备](#环境准备)
5+
- [推理加速](#推理加速)
6+
- [部署](#部署)
7+
- [多模态](#多模态)
8+
9+
## 环境准备
10+
GPU设备: A10, 3090, V100, A100均可.
11+
```bash
12+
# 设置pip全局镜像 (加速下载)
13+
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
14+
# 安装ms-swift
15+
git clone https://github.com/modelscope/swift.git
16+
cd swift
17+
pip install -e '.[llm]'
18+
19+
pip install lmdeploy
20+
```
21+
22+
## 推理加速
23+
24+
### 使用python
25+
26+
```python
27+
import os
28+
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
29+
30+
from swift.llm import (
31+
ModelType, get_lmdeploy_engine, get_default_template_type,
32+
get_template, inference_lmdeploy, inference_stream_lmdeploy
33+
)
34+
35+
model_type = ModelType.qwen_7b_chat
36+
lmdeploy_engine = get_lmdeploy_engine(model_type)
37+
template_type = get_default_template_type(model_type)
38+
template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
39+
# 与`transformers.GenerationConfig`类似的接口
40+
lmdeploy_engine.generation_config.max_new_tokens = 256
41+
generation_info = {}
42+
43+
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
44+
resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
45+
for request, resp in zip(request_list, resp_list):
46+
print(f"query: {request['query']}")
47+
print(f"response: {resp['response']}")
48+
print(generation_info)
49+
50+
# stream
51+
history1 = resp_list[1]['history']
52+
request_list = [{'query': '这有什么好吃的', 'history': history1}]
53+
gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
54+
query = request_list[0]['query']
55+
print_idx = 0
56+
print(f'query: {query}\nresponse: ', end='')
57+
for resp_list in gen:
58+
resp = resp_list[0]
59+
response = resp['response']
60+
delta = response[print_idx:]
61+
print(delta, end='', flush=True)
62+
print_idx = len(response)
63+
print()
64+
65+
history = resp_list[0]['history']
66+
print(f'history: {history}')
67+
print(generation_info)
68+
"""
69+
query: 你好!
70+
response: 你好!有什么我能帮助你的吗?
71+
query: 浙江的省会在哪?
72+
response: 浙江省会是杭州市。
73+
{'num_prompt_tokens': 46, 'num_generated_tokens': 13, 'num_samples': 2, 'runtime': 0.2037766759749502, 'samples/s': 9.81466593480922, 'tokens/s': 63.79532857625993}
74+
query: 这有什么好吃的
75+
response: 杭州有许多美食,比如西湖醋鱼、东坡肉、龙井虾仁、油炸臭豆腐等,都是当地非常有名的传统名菜。此外,当地的点心也非常有特色,比如桂花糕、马蹄酥、绿豆糕等。
76+
history: [['浙江的省会在哪?', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州有许多美食,比如西湖醋鱼、东坡肉、龙井虾仁、油炸臭豆腐等,都是当地非常有名的传统名菜。此外,当地的点心也非常有特色,比如桂花糕、马蹄酥、绿豆糕等。']]
77+
{'num_prompt_tokens': 44, 'num_generated_tokens': 53, 'num_samples': 1, 'runtime': 0.6306625790311955, 'samples/s': 1.5856339558566632, 'tokens/s': 84.03859966040315}
78+
"""
79+
```
80+
81+
**TP:**
82+
83+
```python
84+
import os
85+
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
86+
87+
from swift.llm import (
88+
ModelType, get_lmdeploy_engine, get_default_template_type,
89+
get_template, inference_lmdeploy, inference_stream_lmdeploy
90+
)
91+
92+
model_type = ModelType.qwen_7b_chat
93+
lmdeploy_engine = get_lmdeploy_engine(model_type, tp=2)
94+
template_type = get_default_template_type(model_type)
95+
template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
96+
# 与`transformers.GenerationConfig`类似的接口
97+
lmdeploy_engine.generation_config.max_new_tokens = 256
98+
generation_info = {}
99+
100+
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
101+
resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
102+
for request, resp in zip(request_list, resp_list):
103+
print(f"query: {request['query']}")
104+
print(f"response: {resp['response']}")
105+
print(generation_info)
106+
107+
# stream
108+
history1 = resp_list[1]['history']
109+
request_list = [{'query': '这有什么好吃的', 'history': history1}]
110+
gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
111+
query = request_list[0]['query']
112+
print_idx = 0
113+
print(f'query: {query}\nresponse: ', end='')
114+
for resp_list in gen:
115+
resp = resp_list[0]
116+
response = resp['response']
117+
delta = response[print_idx:]
118+
print(delta, end='', flush=True)
119+
print_idx = len(response)
120+
print()
121+
122+
history = resp_list[0]['history']
123+
print(f'history: {history}')
124+
print(generation_info)
125+
"""
126+
query: 你好!
127+
response: 你好!有什么我能帮助你的吗?
128+
query: 浙江的省会在哪?
129+
response: 浙江省会是杭州市。
130+
{'num_prompt_tokens': 46, 'num_generated_tokens': 13, 'num_samples': 2, 'runtime': 0.2080078640137799, 'samples/s': 9.61502109298861, 'tokens/s': 62.497637104425955}
131+
query: 这有什么好吃的
132+
response: 杭州有许多美食,比如西湖醋鱼、东坡肉、龙井虾仁、油焖笋等等。杭州的特色小吃也很有风味,比如桂花糕、叫花鸡、油爆虾等。此外,杭州还有许多美味的甜品,如月饼、麻薯、绿豆糕等。
133+
history: [['浙江的省会在哪?', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州有许多美食,比如西湖醋鱼、东坡肉、龙井虾仁、油焖笋等等。杭州的特色小吃也很有风味,比如桂花糕、叫花鸡、油爆虾等。此外,杭州还有许多美味的甜品,如月饼、麻薯、绿豆糕等。']]
134+
{'num_prompt_tokens': 44, 'num_generated_tokens': 64, 'num_samples': 1, 'runtime': 0.5715192809584551, 'samples/s': 1.7497222461558426, 'tokens/s': 111.98222375397393}
135+
"""
136+
```
137+
138+
139+
### 使用CLI
140+
敬请期待...
141+
142+
## 部署
143+
敬请期待...
144+
145+
## 多模态
146+
敬请期待...

docs/source/LLM/VLLM推理加速与部署.md

Lines changed: 58 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ pip install -r requirements/llm.txt -U
2727
```
2828

2929
## 推理加速
30-
vllm不支持bnb量化的模型. vllm支持的模型可以查看[支持的模型](支持的模型和数据集.md#模型).
30+
vllm支持的模型可以查看[支持的模型](支持的模型和数据集.md#模型).
3131

32-
### qwen-7b-chat
32+
### 使用python
3333
```python
3434
import os
3535
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
@@ -86,6 +86,62 @@ history: [['浙江的省会在哪?', '浙江省会是杭州市。'], ['这有
8686
"""
8787
```
8888

89+
**TP:**
90+
```python
91+
import os
92+
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
93+
from swift.llm import (
94+
ModelType, get_vllm_engine, get_default_template_type,
95+
get_template, inference_vllm, inference_stream_vllm
96+
)
97+
if __name__ == '__main__':
98+
model_type = ModelType.qwen_7b_chat
99+
llm_engine = get_vllm_engine(model_type, tensor_parallel_size=2)
100+
template_type = get_default_template_type(model_type)
101+
template = get_template(template_type, llm_engine.hf_tokenizer)
102+
# 与`transformers.GenerationConfig`类似的接口
103+
llm_engine.generation_config.max_new_tokens = 256
104+
generation_info = {}
105+
106+
request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}]
107+
resp_list = inference_vllm(llm_engine, template, request_list, generation_info=generation_info)
108+
for request, resp in zip(request_list, resp_list):
109+
print(f"query: {request['query']}")
110+
print(f"response: {resp['response']}")
111+
print(generation_info)
112+
113+
# stream
114+
history1 = resp_list[1]['history']
115+
request_list = [{'query': '这有什么好吃的', 'history': history1}]
116+
gen = inference_stream_vllm(llm_engine, template, request_list, generation_info=generation_info)
117+
query = request_list[0]['query']
118+
print_idx = 0
119+
print(f'query: {query}\nresponse: ', end='')
120+
for resp_list in gen:
121+
resp = resp_list[0]
122+
response = resp['response']
123+
delta = response[print_idx:]
124+
print(delta, end='', flush=True)
125+
print_idx = len(response)
126+
print()
127+
128+
history = resp_list[0]['history']
129+
print(f'history: {history}')
130+
print(generation_info)
131+
"""Out[0]
132+
query: 你好!
133+
response: 你好!很高兴为你服务。有什么我可以帮助你的吗?
134+
query: 浙江的省会在哪?
135+
response: 浙江省会是杭州市。
136+
{'num_prompt_tokens': 46, 'num_generated_tokens': 19, 'num_samples': 2, 'runtime': 0.18170836701756343, 'samples/s': 11.006647810591383, 'tokens/s': 104.56315420061814}
137+
query: 这有什么好吃的
138+
response: 杭州是一个美食之城,拥有许多著名的菜肴和小吃,例如西湖醋鱼、东坡肉、叫化童子鸡等。此外,杭州还有许多小吃店,可以品尝到各种各样的本地美食。
139+
history: [['浙江的省会在哪?', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州是一个美食之城,拥有许多著名的菜肴和小吃,例如西湖醋鱼、东坡肉、叫化童子鸡等。此外,杭州还有许多小吃店,可以品尝到各种各样的本地美食。']]
140+
{'num_prompt_tokens': 44, 'num_generated_tokens': 46, 'num_samples': 1, 'runtime': 0.47030443901894614, 'samples/s': 2.1262822908624837, 'tokens/s': 97.80898537967424}
141+
"""
142+
```
143+
144+
89145
### 使用CLI
90146
```bash
91147
# qwen

docs/source/LLM/index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,17 @@
44

55
1. [LLM推理文档](LLM推理文档.md)
66
2. [LLM微调文档](LLM微调文档.md)
7-
3. [DPO训练文档](DPO训练文档.md)
7+
3. [人类偏好对齐训练文档](人类偏好对齐训练文档.md)
88
4. [界面训练与推理](../GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md)
99
5. [LLM评测文档](LLM评测文档.md)
1010
6. [LLM量化文档](LLM量化文档.md)
1111
7. [VLLM推理加速与部署](VLLM推理加速与部署.md)
12-
8. [LLM实验文档](LLM实验文档.md)
13-
9. [ORPO最佳实践](ORPO算法最佳实践.md)
14-
10. [SimPO最佳实践](SimPO算法最佳实践.md)
15-
11. [人类偏好对齐训练文档](人类偏好对齐训练文档.md)
12+
8. [LmDeploy推理加速与部署](LmDeploy推理加速与部署.md)
13+
9. [LLM实验文档](LLM实验文档.md)
14+
10. [DPO训练文档](DPO训练文档.md)
15+
11. [ORPO最佳实践](ORPO算法最佳实践.md)
16+
12. [SimPO最佳实践](SimPO算法最佳实践.md)
17+
13. [人类偏好对齐训练文档](人类偏好对齐训练文档.md)
1618

1719
### ⭐️最佳实践系列
1820

docs/source/index.rst

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Swift DOCUMENTATION
2626
LLM/LLM评测文档.md
2727
LLM/LLM量化文档.md
2828
LLM/VLLM推理加速与部署.md
29+
LLM/LmDeploy推理加速与部署.md
2930
LLM/LLM实验文档.md
3031
LLM/命令行参数.md
3132
LLM/支持的模型和数据集.md
@@ -52,19 +53,20 @@ Swift DOCUMENTATION
5253
Multi-Modal/internlm-xcomposer2最佳实践.md
5354
Multi-Modal/phi3-vision最佳实践.md
5455
Multi-Modal/llava最佳实践.md
56+
Multi-Modal/llava-video最佳实践.md
5557
Multi-Modal/yi-vl最佳实践.md
5658
Multi-Modal/mplug-owl2最佳实践.md
59+
Multi-Modal/florence最佳实践.md
5760
Multi-Modal/cogvlm最佳实践.md
5861
Multi-Modal/cogvlm2最佳实践.md
59-
Multi-Modal/cogvlm2-video最佳实践.md
60-
Multi-Modal/florence最佳实践.md
61-
Multi-Modal/mplug-owl2最佳实践.md
6262
Multi-Modal/glm4v最佳实践.md
63+
Multi-Modal/cogvlm2-video最佳实践.md
6364
Multi-Modal/minicpm-v最佳实践.md
6465
Multi-Modal/minicpm-v-2最佳实践.md
6566
Multi-Modal/minicpm-v-2.5最佳实践.md
6667
Multi-Modal/internvl最佳实践.md
6768
Multi-Modal/MLLM部署文档.md
69+
Multi-Modal/vLLM推理加速文档.md
6870

6971
.. toctree::
7072
:maxdepth: 2
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# LmDeploy Inference Acceleration and Deployment
2+
3+
## Table of Contents
4+
- [Environment Preparation](#environment-preparation)
5+
- [Inference Acceleration](#inference-acceleration)
6+
- [Deployment](#deployment)
7+
- [Multimodal](#multimodal)
8+
9+
## Environment Preparation
10+
GPU devices: A10, 3090, V100, A100 are all supported.
11+
```bash
12+
# Set pip global mirror (speeds up downloads)
13+
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
14+
# Install ms-swift
15+
git clone https://github.com/modelscope/swift.git
16+
cd swift
17+
pip install -e '.[llm]'
18+
19+
pip install lmdeploy
20+
```
21+
22+
## Inference Acceleration
23+
24+
### Using Python
25+
26+
```python
27+
import os
28+
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
29+
30+
from swift.llm import (
31+
ModelType, get_lmdeploy_engine, get_default_template_type,
32+
get_template, inference_lmdeploy, inference_stream_lmdeploy
33+
)
34+
35+
model_type = ModelType.qwen_7b_chat
36+
lmdeploy_engine = get_lmdeploy_engine(model_type)
37+
template_type = get_default_template_type(model_type)
38+
template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
39+
# Similar to `transformers.GenerationConfig` interface
40+
lmdeploy_engine.generation_config.max_new_tokens = 256
41+
generation_info = {}
42+
43+
request_list = [{'query': 'Hello!'}, {'query': 'Where is the capital of Zhejiang?'}]
44+
resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
45+
for request, resp in zip(request_list, resp_list):
46+
print(f"query: {request['query']}")
47+
print(f"response: {resp['response']}")
48+
print(generation_info)
49+
50+
# stream
51+
history1 = resp_list[1]['history']
52+
request_list = [{'query': 'Is there anything tasty here?', 'history': history1}]
53+
gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
54+
query = request_list[0]['query']
55+
print_idx = 0
56+
print(f'query: {query}\nresponse: ', end='')
57+
for resp_list in gen:
58+
resp = resp_list[0]
59+
response = resp['response']
60+
delta = response[print_idx:]
61+
print(delta, end='', flush=True)
62+
print_idx = len(response)
63+
print()
64+
65+
history = resp_list[0]['history']
66+
print(f'history: {history}')
67+
print(generation_info)
68+
"""
69+
query: Hello!
70+
response: Hello there! How can I help you today?
71+
query: Where is the capital of Zhejiang?
72+
response: The capital of Zhejiang is Hangzhou. It is located in southeastern China, along the lower reaches of the Qiantang River (also known as the West Lake), and is one of the most prosperous cities in the country. Hangzhou is famous for its natural beauty, cultural heritage, and economic development, with a rich history dating back over 2,000 years. The city is home to many historic landmarks and attractions, including the West Lake, Lingyin Temple, and the Longjing Tea Plantations. Additionally, Hangzhou is a major center for technology, finance, and transportation in China.
73+
{'num_prompt_tokens': 49, 'num_generated_tokens': 135, 'num_samples': 2, 'runtime': 1.5066149180056527, 'samples/s': 1.3274792225258558, 'tokens/s': 89.60484752049527}
74+
query: Is there anything tasty here?
75+
response: Yes, Hangzhou is known for its delicious cuisine! The city has a long history of culinary arts and is considered to be one of the birthplaces of Chinese cuisine. Some of the most popular dishes from Hangzhou include:
76+
77+
* Dongpo Pork: A dish made with pork belly that has been braised in a soy sauce-based broth until it is tender and flavorful.
78+
* West Lake Fish in Vinegar Gravy: A dish made with freshwater fish that has been simmered in a tangy vinegar sauce.
79+
* Longjing Tea Soup: A soup made with Dragon Well tea leaves and chicken or pork, often served as a light meal or appetizer.
80+
* Xiao Long Bao: Small steamed dumplings filled with meat or vegetables and served with a savory broth.
81+
82+
In addition to these classic dishes, Hangzhou also has a thriving street food scene, with vendors selling everything from steamed buns to grilled meats and seafood. So if you're a foodie, you'll definitely want to try some of the local specialties while you're in Hangzhou!
83+
history: [['Where is the capital of Zhejiang?', 'The capital of Zhejiang is Hangzhou. It is located in southeastern China, along the lower reaches of the Qiantang River (also known as the West Lake), and is one of the most prosperous cities in the country. Hangzhou is famous for its natural beauty, cultural heritage, and economic development, with a rich history dating back over 2,000 years. The city is home to many historic landmarks and attractions, including the West Lake, Lingyin Temple, and the Longjing Tea Plantations. Additionally, Hangzhou is a major center for technology, finance, and transportation in China.'], ['Is there anything tasty here?', "Yes, Hangzhou is known for its delicious cuisine! The city has a long history of culinary arts and is considered to be one of the birthplaces of Chinese cuisine. Some of the most popular dishes from Hangzhou include:\n\n * Dongpo Pork: A dish made with pork belly that has been braised in a soy sauce-based broth until it is tender and flavorful.\n * West Lake Fish in Vinegar Gravy: A dish made with freshwater fish that has been simmered in a tangy vinegar sauce.\n * Longjing Tea Soup: A soup made with Dragon Well tea leaves and chicken or pork, often served as a light meal or appetizer.\n * Xiao Long Bao: Small steamed dumplings filled with meat or vegetables and served with a savory broth.\n\nIn addition to these classic dishes, Hangzhou also has a thriving street food scene, with vendors selling everything from steamed buns to grilled meats and seafood. So if you're a foodie, you'll definitely want to try some of the local specialties while you're in Hangzhou!"]]
84+
{'num_prompt_tokens': 169, 'num_generated_tokens': 216, 'num_samples': 1, 'runtime': 2.4760487159946933, 'samples/s': 0.4038692750834161, 'tokens/s': 87.23576341801788}
85+
"""
86+
```
87+
88+
### Using CLI
89+
Comming soon...
90+
91+
## Deployment
92+
Comming soon...
93+
94+
## Multimodal
95+
Comming soon...

docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ pip install -r requirements/llm.txt -U
2424
```
2525

2626
## Inference Acceleration
27-
vllm does not support bnb quantized models. The models supported by vllm can be found in [Supported Models](Supported-models-datasets.md#Models).
27+
The models supported by vllm can be found in [Supported Models](Supported-models-datasets.md#Models).
2828

29-
### qwen-7b-chat
29+
### Using Python
3030
```python
3131
import os
3232
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

0 commit comments

Comments
 (0)