modelscope
diff --git a/‎docs/source/LLM/LmDeploy推理加速与部署.md‎
Lines changed: 146 additions & 0 deletions b/‎docs/source/LLM/LmDeploy推理加速与部署.md‎
Lines changed: 146 additions & 0 deletions
diff --git a/‎docs/source/LLM/VLLM推理加速与部署.md‎
Lines changed: 58 additions & 2 deletions b/‎docs/source/LLM/VLLM推理加速与部署.md‎
Lines changed: 58 additions & 2 deletions
diff --git a/‎docs/source/LLM/index.md‎
Lines changed: 7 additions & 5 deletions b/‎docs/source/LLM/index.md‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎docs/source/index.rst‎
Lines changed: 5 additions & 3 deletions b/‎docs/source/index.rst‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎docs/source_en/LLM/LmDeploy-inference-acceleration-and-deployment.md‎
Lines changed: 95 additions & 0 deletions b/‎docs/source_en/LLM/LmDeploy-inference-acceleration-and-deployment.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md‎
Lines changed: 2 additions & 2 deletions
@@ -0,0 +1,146 @@
+# LmDeploy推理加速与部署
+
+## 目录
+- [环境准备](#环境准备)
+- [推理加速](#推理加速)
+- [部署](#部署)
+- [多模态](#多模态)
+
+## 环境准备
+GPU设备: A10, 3090, V100, A100均可.
+```bash
+# 设置pip全局镜像 (加速下载)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+pip install lmdeploy
+```
+
+## 推理加速
+
+### 使用python
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_lmdeploy_engine, get_default_template_type,
+    get_template, inference_lmdeploy, inference_stream_lmdeploy
+)
+
+model_type = ModelType.qwen_7b_chat
+lmdeploy_engine = get_lmdeploy_engine(model_type)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
+# 与`transformers.GenerationConfig`类似的接口
+lmdeploy_engine.generation_config.max_new_tokens = 256
+generation_info = {}
+
+request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+print(generation_info)
+
+# stream
+history1 = resp_list[1]['history']
+request_list = [{'query': '这有什么好吃的', 'history': history1}]
+gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+query = request_list[0]['query']
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for resp_list in gen:
+    resp = resp_list[0]
+    response = resp['response']
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+history = resp_list[0]['history']
+print(f'history: {history}')
+print(generation_info)
+"""
+query: 你好!
+response: 你好！有什么我能帮助你的吗？
+query: 浙江的省会在哪？
+response: 浙江省会是杭州市。
+{'num_prompt_tokens': 46, 'num_generated_tokens': 13, 'num_samples': 2, 'runtime': 0.2037766759749502, 'samples/s': 9.81466593480922, 'tokens/s': 63.79532857625993}
+query: 这有什么好吃的
+response: 杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油炸臭豆腐等，都是当地非常有名的传统名菜。此外，当地的点心也非常有特色，比如桂花糕、马蹄酥、绿豆糕等。
+history: [['浙江的省会在哪？', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油炸臭豆腐等，都是当地非常有名的传统名菜。此外，当地的点心也非常有特色，比如桂花糕、马蹄酥、绿豆糕等。']]
+{'num_prompt_tokens': 44, 'num_generated_tokens': 53, 'num_samples': 1, 'runtime': 0.6306625790311955, 'samples/s': 1.5856339558566632, 'tokens/s': 84.03859966040315}
+"""
+```
+
+**TP:**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
+
+from swift.llm import (
+    ModelType, get_lmdeploy_engine, get_default_template_type,
+    get_template, inference_lmdeploy, inference_stream_lmdeploy
+)
+
+model_type = ModelType.qwen_7b_chat
+lmdeploy_engine = get_lmdeploy_engine(model_type, tp=2)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
+# 与`transformers.GenerationConfig`类似的接口
+lmdeploy_engine.generation_config.max_new_tokens = 256
+generation_info = {}
+
+request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+print(generation_info)
+
+# stream
+history1 = resp_list[1]['history']
+request_list = [{'query': '这有什么好吃的', 'history': history1}]
+gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+query = request_list[0]['query']
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for resp_list in gen:
+    resp = resp_list[0]
+    response = resp['response']
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+history = resp_list[0]['history']
+print(f'history: {history}')
+print(generation_info)
+"""
+query: 你好!
+response: 你好！有什么我能帮助你的吗？
+query: 浙江的省会在哪？
+response: 浙江省会是杭州市。
+{'num_prompt_tokens': 46, 'num_generated_tokens': 13, 'num_samples': 2, 'runtime': 0.2080078640137799, 'samples/s': 9.61502109298861, 'tokens/s': 62.497637104425955}
+query: 这有什么好吃的
+response: 杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油焖笋等等。杭州的特色小吃也很有风味，比如桂花糕、叫花鸡、油爆虾等。此外，杭州还有许多美味的甜品，如月饼、麻薯、绿豆糕等。
+history: [['浙江的省会在哪？', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油焖笋等等。杭州的特色小吃也很有风味，比如桂花糕、叫花鸡、油爆虾等。此外，杭州还有许多美味的甜品，如月饼、麻薯、绿豆糕等。']]
+{'num_prompt_tokens': 44, 'num_generated_tokens': 64, 'num_samples': 1, 'runtime': 0.5715192809584551, 'samples/s': 1.7497222461558426, 'tokens/s': 111.98222375397393}
+"""
+```
+
+
+### 使用CLI
+敬请期待...
+
+## 部署
+敬请期待...
+
+## 多模态
+敬请期待...
@@ -27,9 +27,9 @@ pip install -r requirements/llm.txt  -U
 ```
 
 ## 推理加速
-vllm不支持bnb量化的模型. vllm支持的模型可以查看[支持的模型](支持的模型和数据集.md#模型).
+vllm支持的模型可以查看[支持的模型](支持的模型和数据集.md#模型).
 
-### qwen-7b-chat
+### 使用python
 ```python
 import os
 os.environ['CUDA_VISIBLE_DEVICES'] = '0'
@@ -86,6 +86,62 @@ history: [['浙江的省会在哪？', '浙江省会是杭州市。'], ['这有
 """
 ```
 
+**TP:**
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
+from swift.llm import (
+    ModelType, get_vllm_engine, get_default_template_type,
+    get_template, inference_vllm, inference_stream_vllm
+)
+if __name__ == '__main__':
+    model_type = ModelType.qwen_7b_chat
+    llm_engine = get_vllm_engine(model_type, tensor_parallel_size=2)
+    template_type = get_default_template_type(model_type)
+    template = get_template(template_type, llm_engine.hf_tokenizer)
+    # 与`transformers.GenerationConfig`类似的接口
+    llm_engine.generation_config.max_new_tokens = 256
+    generation_info = {}
+
+    request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+    resp_list = inference_vllm(llm_engine, template, request_list, generation_info=generation_info)
+    for request, resp in zip(request_list, resp_list):
+        print(f"query: {request['query']}")
+        print(f"response: {resp['response']}")
+    print(generation_info)
+
+    # stream
+    history1 = resp_list[1]['history']
+    request_list = [{'query': '这有什么好吃的', 'history': history1}]
+    gen = inference_stream_vllm(llm_engine, template, request_list, generation_info=generation_info)
+    query = request_list[0]['query']
+    print_idx = 0
+    print(f'query: {query}\nresponse: ', end='')
+    for resp_list in gen:
+        resp = resp_list[0]
+        response = resp['response']
+        delta = response[print_idx:]
+        print(delta, end='', flush=True)
+        print_idx = len(response)
+    print()
+
+    history = resp_list[0]['history']
+    print(f'history: {history}')
+    print(generation_info)
+"""Out[0]
+query: 你好!
+response: 你好！很高兴为你服务。有什么我可以帮助你的吗？
+query: 浙江的省会在哪？
+response: 浙江省会是杭州市。
+{'num_prompt_tokens': 46, 'num_generated_tokens': 19, 'num_samples': 2, 'runtime': 0.18170836701756343, 'samples/s': 11.006647810591383, 'tokens/s': 104.56315420061814}
+query: 这有什么好吃的
+response: 杭州是一个美食之城，拥有许多著名的菜肴和小吃，例如西湖醋鱼、东坡肉、叫化童子鸡等。此外，杭州还有许多小吃店，可以品尝到各种各样的本地美食。
+history: [['浙江的省会在哪？', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州是一个美食之城，拥有许多著名的菜肴和小吃，例如西湖醋鱼、东坡肉、叫化童子鸡等。此外，杭州还有许多小吃店，可以品尝到各种各样的本地美食。']]
+{'num_prompt_tokens': 44, 'num_generated_tokens': 46, 'num_samples': 1, 'runtime': 0.47030443901894614, 'samples/s': 2.1262822908624837, 'tokens/s': 97.80898537967424}
+"""
+```
+
+
 ### 使用CLI
 ```bash
 # qwen
 
@@ -4,15 +4,17 @@
 
 1. [LLM推理文档](LLM推理文档.md)
 2. [LLM微调文档](LLM微调文档.md)
-3. [DPO训练文档](DPO训练文档.md)
+3. [人类偏好对齐训练文档](人类偏好对齐训练文档.md)
 4. [界面训练与推理](../GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md)
 5. [LLM评测文档](LLM评测文档.md)
 6. [LLM量化文档](LLM量化文档.md)
 7. [VLLM推理加速与部署](VLLM推理加速与部署.md)
-8. [LLM实验文档](LLM实验文档.md)
-9. [ORPO最佳实践](ORPO算法最佳实践.md)
-10. [SimPO最佳实践](SimPO算法最佳实践.md)
-11. [人类偏好对齐训练文档](人类偏好对齐训练文档.md)
+8. [LmDeploy推理加速与部署](LmDeploy推理加速与部署.md)
+9. [LLM实验文档](LLM实验文档.md)
+10. [DPO训练文档](DPO训练文档.md)
+11. [ORPO最佳实践](ORPO算法最佳实践.md)
+12. [SimPO最佳实践](SimPO算法最佳实践.md)
+13. [人类偏好对齐训练文档](人类偏好对齐训练文档.md)
 
 ### ⭐️最佳实践系列
 
 
@@ -26,6 +26,7 @@ Swift DOCUMENTATION
    LLM/LLM评测文档.md
    LLM/LLM量化文档.md
    LLM/VLLM推理加速与部署.md
+   LLM/LmDeploy推理加速与部署.md
    LLM/LLM实验文档.md
    LLM/命令行参数.md
    LLM/支持的模型和数据集.md
@@ -52,19 +53,20 @@ Swift DOCUMENTATION
    Multi-Modal/internlm-xcomposer2最佳实践.md
    Multi-Modal/phi3-vision最佳实践.md
    Multi-Modal/llava最佳实践.md
+   Multi-Modal/llava-video最佳实践.md
    Multi-Modal/yi-vl最佳实践.md
    Multi-Modal/mplug-owl2最佳实践.md
+   Multi-Modal/florence最佳实践.md
    Multi-Modal/cogvlm最佳实践.md
    Multi-Modal/cogvlm2最佳实践.md
-   Multi-Modal/cogvlm2-video最佳实践.md
-   Multi-Modal/florence最佳实践.md
-   Multi-Modal/mplug-owl2最佳实践.md
    Multi-Modal/glm4v最佳实践.md
+   Multi-Modal/cogvlm2-video最佳实践.md
    Multi-Modal/minicpm-v最佳实践.md
    Multi-Modal/minicpm-v-2最佳实践.md
    Multi-Modal/minicpm-v-2.5最佳实践.md
    Multi-Modal/internvl最佳实践.md
    Multi-Modal/MLLM部署文档.md
+   Multi-Modal/vLLM推理加速文档.md
 
 .. toctree::
    :maxdepth: 2
 
@@ -0,0 +1,95 @@
+# LmDeploy Inference Acceleration and Deployment
+
+## Table of Contents
+- [Environment Preparation](#environment-preparation)
+- [Inference Acceleration](#inference-acceleration)
+- [Deployment](#deployment)
+- [Multimodal](#multimodal)
+
+## Environment Preparation
+GPU devices: A10, 3090, V100, A100 are all supported.
+```bash
+# Set pip global mirror (speeds up downloads)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# Install ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+pip install lmdeploy
+```
+
+## Inference Acceleration
+
+### Using Python
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_lmdeploy_engine, get_default_template_type,
+    get_template, inference_lmdeploy, inference_stream_lmdeploy
+)
+
+model_type = ModelType.qwen_7b_chat
+lmdeploy_engine = get_lmdeploy_engine(model_type)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
+# Similar to `transformers.GenerationConfig` interface
+lmdeploy_engine.generation_config.max_new_tokens = 256
+generation_info = {}
+
+request_list = [{'query': 'Hello!'}, {'query': 'Where is the capital of Zhejiang?'}]
+resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+print(generation_info)
+
+# stream
+history1 = resp_list[1]['history']
+request_list = [{'query': 'Is there anything tasty here?', 'history': history1}]
+gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+query = request_list[0]['query']
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for resp_list in gen:
+    resp = resp_list[0]
+    response = resp['response']
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+history = resp_list[0]['history']
+print(f'history: {history}')
+print(generation_info)
+"""
+query: Hello!
+response: Hello there! How can I help you today?
+query: Where is the capital of Zhejiang?
+response: The capital of Zhejiang is Hangzhou. It is located in southeastern China, along the lower reaches of the Qiantang River (also known as the West Lake), and is one of the most prosperous cities in the country. Hangzhou is famous for its natural beauty, cultural heritage, and economic development, with a rich history dating back over 2,000 years. The city is home to many historic landmarks and attractions, including the West Lake, Lingyin Temple, and the Longjing Tea Plantations. Additionally, Hangzhou is a major center for technology, finance, and transportation in China.
+{'num_prompt_tokens': 49, 'num_generated_tokens': 135, 'num_samples': 2, 'runtime': 1.5066149180056527, 'samples/s': 1.3274792225258558, 'tokens/s': 89.60484752049527}
+query: Is there anything tasty here?
+response: Yes, Hangzhou is known for its delicious cuisine! The city has a long history of culinary arts and is considered to be one of the birthplaces of Chinese cuisine. Some of the most popular dishes from Hangzhou include:
+
+  * Dongpo Pork: A dish made with pork belly that has been braised in a soy sauce-based broth until it is tender and flavorful.
+  * West Lake Fish in Vinegar Gravy: A dish made with freshwater fish that has been simmered in a tangy vinegar sauce.
+  * Longjing Tea Soup: A soup made with Dragon Well tea leaves and chicken or pork, often served as a light meal or appetizer.
+  * Xiao Long Bao: Small steamed dumplings filled with meat or vegetables and served with a savory broth.
+
+In addition to these classic dishes, Hangzhou also has a thriving street food scene, with vendors selling everything from steamed buns to grilled meats and seafood. So if you're a foodie, you'll definitely want to try some of the local specialties while you're in Hangzhou!
+history: [['Where is the capital of Zhejiang?', 'The capital of Zhejiang is Hangzhou. It is located in southeastern China, along the lower reaches of the Qiantang River (also known as the West Lake), and is one of the most prosperous cities in the country. Hangzhou is famous for its natural beauty, cultural heritage, and economic development, with a rich history dating back over 2,000 years. The city is home to many historic landmarks and attractions, including the West Lake, Lingyin Temple, and the Longjing Tea Plantations. Additionally, Hangzhou is a major center for technology, finance, and transportation in China.'], ['Is there anything tasty here?', "Yes, Hangzhou is known for its delicious cuisine! The city has a long history of culinary arts and is considered to be one of the birthplaces of Chinese cuisine. Some of the most popular dishes from Hangzhou include:\n\n  * Dongpo Pork: A dish made with pork belly that has been braised in a soy sauce-based broth until it is tender and flavorful.\n  * West Lake Fish in Vinegar Gravy: A dish made with freshwater fish that has been simmered in a tangy vinegar sauce.\n  * Longjing Tea Soup: A soup made with Dragon Well tea leaves and chicken or pork, often served as a light meal or appetizer.\n  * Xiao Long Bao: Small steamed dumplings filled with meat or vegetables and served with a savory broth.\n\nIn addition to these classic dishes, Hangzhou also has a thriving street food scene, with vendors selling everything from steamed buns to grilled meats and seafood. So if you're a foodie, you'll definitely want to try some of the local specialties while you're in Hangzhou!"]]
+{'num_prompt_tokens': 169, 'num_generated_tokens': 216, 'num_samples': 1, 'runtime': 2.4760487159946933, 'samples/s': 0.4038692750834161, 'tokens/s': 87.23576341801788}
+"""
+```
+
+### Using CLI
+Comming soon...
+
+## Deployment
+Comming soon...
+
+## Multimodal
+Comming soon...
@@ -24,9 +24,9 @@ pip install -r requirements/llm.txt -U
 ```
 
 ## Inference Acceleration
-vllm does not support bnb quantized models. The models supported by vllm can be found in [Supported Models](Supported-models-datasets.md#Models).
+The models supported by vllm can be found in [Supported Models](Supported-models-datasets.md#Models).
 
-### qwen-7b-chat
+### Using Python
 ```python
 import os
 os.environ['CUDA_VISIBLE_DEVICES'] = '0'