Skip to content

Commit 285ac51

Browse files
authored
support qwen2-audio (#1633)
1 parent d55fbb1 commit 285ac51

17 files changed

+132
-20
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ You can contact us and communicate with us by adding our group:
5555
<img src="asset/discord_qr.jpg" width="200" height="200"> | <img src="asset/wechat.png" width="200" height="200">
5656

5757
## 🎉 News
58+
- 🔥2024.08.09: Support for inference and fine-tuning of the qwen2-audio model. Best practice can be found [here](https://github.com/modelscope/ms-swift/issues/1653).
5859
- 🔥2024.08.08: Supports the qwen2-math series models: 1.5B, 7B, 72B. Use `swift infer --model_type qwen2-math-1_5b-instruct` for an experience.
5960
- 🔥2024.08.07: Support for using vLLM for accelerating inference and deployment of multimodal large models such as the llava series and phi3-vision models. You can refer to the [Multimodal & vLLM Inference Acceleration Documentation](docs/source_en/Multi-Modal/vllm-inference-acceleration.md) for more information.
6061
- 2024.08.06: Support for minicpm-v-v2_6-chat is available. You can use `swift infer --model_type minicpm-v-v2_6-chat` for inference experience. Best practices can be found [here](https://github.com/modelscope/swift/issues/1613).
@@ -609,7 +610,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
609610
| Model Type | Model Introduction | Language | Model Size | Model Type |
610611
|---------------------------------------------------------|----------------------------------------------------------------------------------------|--------------------|---------------------------------------|--------------------------|
611612
| Qwen-VL | [Tongyi Qwen vision model](https://github.com/QwenLM) | Chinese<br>English | 7B<br>including quantized versions | base model<br>chat model |
612-
| Qwen-Audio | [Tongyi Qwen speech model](https://github.com/QwenLM) | Chinese<br>English | 7B | base model<br>chat model |
613+
| Qwen-Audio<br>Qwen2-Audio | [Tongyi Qwen speech model](https://github.com/QwenLM) | Chinese<br>English | 7B | base model<br>chat model |
613614
| YI-VL | [01AI's YI series vision models](https://github.com/01-ai) | Chinese<br>English | 6B-34B | chat model |
614615
| XComposer2<br>XComposer2.5 | [Pujiang AI Lab InternLM vision model](https://github.com/InternLM/InternLM-XComposer) | Chinese<br>English | 7B | chat model |
615616
| DeepSeek-VL | [DeepSeek series vision models](https://github.com/deepseek-ai) | Chinese<br>English | 1.3B-7B | chat model |

README_CN.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ SWIFT具有丰富全面的文档,请查看我们的文档网站:
5656

5757

5858
## 🎉 新闻
59+
- 🔥2024.08.09: 支持qwen2-audio模型的推理与微调. 最佳实践可以查看[这里](https://github.com/modelscope/ms-swift/issues/1653).
5960
- 🔥2024.08.08: 支持qwen2-math系列模型, 1.5B, 7B, 72B. 使用`swift infer --model_type qwen2-math-1_5b-instruct`进行体验.
6061
- 🔥2024.08.07: 支持使用vllm对多模态大模型: llava系列, internvl2系列, phi3-vision, minicpm-v2.5进行推理加速和部署. 可以查看[多模态&vLLM推理加速文档](docs/source/Multi-Modal/vLLM推理加速文档.md)获取更多信息.
6162
- 2024.08.06: 支持minicpm-v-v2_6-chat, 使用`swift infer --model_type minicpm-v-v2_6-chat`进行推理体验, 最佳实践可以查看[这里](https://github.com/modelscope/swift/issues/1613).
@@ -603,7 +604,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
603604
| 模型类型 | 模型介绍 | 语言 | 模型大小 | 模型类型 |
604605
|---------------------------------------------------------|----------------------------------------------------------------------------|----------|------------------|------------------|
605606
| Qwen-VL | [通义千问视觉模型](https://github.com/QwenLM) | 中文<br>英文 | 7B<br>包含量化版本 | base模型<br>chat模型 |
606-
| Qwen-Audio | [通义千问语音模型](https://github.com/QwenLM) | 中文<br>英文 | 7B | base模型<br>chat模型 |
607+
| Qwen-Audio<br>Qwen2-Audio | [通义千问语音模型](https://github.com/QwenLM) | 中文<br>英文 | 7B | base模型<br>chat模型 |
607608
| YI-VL | [01AI的YI系列视觉模型](https://github.com/01-ai) | 中文<br>英文 | 6B-34B | chat模型 |
608609
| XComposer2<br>XComposer2.5 | [浦江实验室书生浦语视觉模型](https://github.com/InternLM/InternLM-XComposer) | 中文<br>英文 | 7B | chat模型 |
609610
| DeepSeek-VL | [幻方系列视觉模型](https://github.com/deepseek-ai) | 中文<br>英文 | 1.3B-7B | chat模型 |

docs/source/LLM/VLLM推理加速与部署.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
# VLLM推理加速与部署
33
vllm支持的模型可以查看[支持的模型](支持的模型和数据集.md#模型).
44

5+
llama3.1 405b推理加速与部署最佳实践可以查看[这里](https://github.com/modelscope/ms-swift/issues/1484)
6+
57
## 目录
68
- [环境准备](#环境准备)
79
- [推理加速](#推理加速)

docs/source/LLM/支持的模型和数据集.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -360,6 +360,8 @@
360360
|qwen-vl-chat-int4|[qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)|c_attn|qwen|&#x2714;|&#x2718;|&#x2718;|&#x2718;|auto_gptq>=0.5|vision|[Qwen/Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)|
361361
|qwen-audio|[qwen/Qwen-Audio](https://modelscope.cn/models/qwen/Qwen-Audio/summary)|^(transformer.h)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|qwen-audio-generation|&#x2714;|&#x2718;|&#x2718;|&#x2718;||audio|[Qwen/Qwen-Audio](https://huggingface.co/Qwen/Qwen-Audio)|
362362
|qwen-audio-chat|[qwen/Qwen-Audio-Chat](https://modelscope.cn/models/qwen/Qwen-Audio-Chat/summary)|^(transformer.h)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|qwen-audio|&#x2714;|&#x2718;|&#x2718;|&#x2718;||audio|[Qwen/Qwen-Audio-Chat](https://huggingface.co/Qwen/Qwen-Audio-Chat)|
363+
|qwen2-audio-7b|[qwen/Qwen2-Audio-7B](https://modelscope.cn/models/qwen/Qwen2-Audio-7B/summary)|^(language_model\|multi_modal_projector)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|qwen2-audio-generation|&#x2714;|&#x2718;|&#x2718;|&#x2718;|librosa|audio|[Qwen/Qwen2-Audio-7B](https://huggingface.co/Qwen/Qwen2-Audio-7B)|
364+
|qwen2-audio-7b-instruct|[qwen/Qwen2-Audio-7B-Instruct](https://modelscope.cn/models/qwen/Qwen2-Audio-7B-Instruct/summary)|^(language_model\|multi_modal_projector)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|qwen2-audio|&#x2714;|&#x2718;|&#x2718;|&#x2718;|librosa|audio|[Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct)|
363365
|glm4v-9b-chat|[ZhipuAI/glm-4v-9b](https://modelscope.cn/models/ZhipuAI/glm-4v-9b/summary)|^(transformer.encoder)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|glm4v|&#x2718;|&#x2718;|&#x2718;|&#x2718;|transformers>=4.42|vision|[THUDM/glm-4v-9b](https://huggingface.co/THUDM/glm-4v-9b)|
364366
|llava1_5-7b-instruct|[swift/llava-1.5-7b-hf](https://modelscope.cn/models/swift/llava-1.5-7b-hf/summary)|^(language_model\|multi_modal_projector)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|llava1_5|&#x2714;|&#x2714;|&#x2714;|&#x2718;|transformers>=4.36|vision|[llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)|
365367
|llava1_5-13b-instruct|[swift/llava-1.5-13b-hf](https://modelscope.cn/models/swift/llava-1.5-13b-hf/summary)|^(language_model\|multi_modal_projector)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|llava1_5|&#x2714;|&#x2714;|&#x2714;|&#x2718;|transformers>=4.36|vision|[llava-hf/llava-1.5-13b-hf](https://huggingface.co/llava-hf/llava-1.5-13b-hf)|
@@ -377,9 +379,9 @@
377379
|yi-vl-6b-chat|[01ai/Yi-VL-6B](https://modelscope.cn/models/01ai/Yi-VL-6B/summary)|^(model.layers\|model.mm_projector)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|yi-vl|&#x2714;|&#x2718;|&#x2718;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-6B](https://huggingface.co/01-ai/Yi-VL-6B)|
378380
|yi-vl-34b-chat|[01ai/Yi-VL-34B](https://modelscope.cn/models/01ai/Yi-VL-34B/summary)|^(model.layers\|model.mm_projector)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|yi-vl|&#x2714;|&#x2718;|&#x2718;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)|
379381
|llava-llama-3-8b-v1_1|[AI-ModelScope/llava-llama-3-8b-v1_1-transformers](https://modelscope.cn/models/AI-ModelScope/llava-llama-3-8b-v1_1-transformers/summary)|^(language_model\|multi_modal_projector)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|llava-llama-instruct|&#x2714;|&#x2718;|&#x2714;|&#x2718;|transformers>=4.36|vision|[xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers)|
380-
|internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|^(model\|vision_proj)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|internlm-xcomposer2|&#x2718;|&#x2718;|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
381-
|internlm-xcomposer2-4khd-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-4khd-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-4khd-7b/summary)|^(model\|vision_proj)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|internlm-xcomposer2-4khd|&#x2718;|&#x2718;|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-4khd-7b](https://huggingface.co/internlm/internlm-xcomposer2-4khd-7b)|
382-
|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|^(model\|vision_proj)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|internlm-xcomposer2_5|&#x2714;|&#x2718;|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
382+
|internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|attention.wqkv, attention.wo, feed_forward.w1, feed_forward.w2, feed_forward.w3|internlm-xcomposer2|&#x2718;|&#x2718;|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
383+
|internlm-xcomposer2-4khd-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-4khd-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-4khd-7b/summary)|attention.wqkv, attention.wo, feed_forward.w1, feed_forward.w2, feed_forward.w3|internlm-xcomposer2-4khd|&#x2718;|&#x2718;|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-4khd-7b](https://huggingface.co/internlm/internlm-xcomposer2-4khd-7b)|
384+
|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|attention.wqkv, attention.wo, feed_forward.w1, feed_forward.w2, feed_forward.w3|internlm-xcomposer2_5|&#x2714;|&#x2718;|&#x2714;|&#x2718;|decord|vision|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
383385
|internvl-chat-v1_5|[AI-ModelScope/InternVL-Chat-V1-5](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)|^(language_model\|mlp1)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|internvl|&#x2714;|&#x2718;|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)|
384386
|internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|^(language_model\|mlp1)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|internvl|&#x2714;|&#x2718;|&#x2718;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
385387
|mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|^(language_model\|mlp1)(?!.\*(lm_head\|output\|emb\|wte\|shared)).\*|internvl|&#x2714;|&#x2718;|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|

docs/source/Multi-Modal/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
一轮对话可以包含多张图片(或不含图片):
1313
1. [Qwen-VL最佳实践](qwen-vl最佳实践.md)
14-
2. [Qwen-Audio最佳实践](qwen-audio最佳实践.md)
14+
2. [Qwen-Audio最佳实践](qwen-audio最佳实践.md), [Qwen2-Audio最佳实践](https://github.com/modelscope/ms-swift/issues/1653)
1515
3. [Llava最佳实践](llava最佳实践.md), [LLava Video最佳实践](llava-video最佳实践.md)
1616
4. [InternVL系列最佳实践](internvl最佳实践.md)
1717
5. [Deepseek-VL最佳实践](deepseek-vl最佳实践.md)
@@ -26,4 +26,4 @@
2626

2727
整个对话围绕一张图片(可能可以不含图片):
2828
1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md), [glm4v最佳实践](glm4v最佳实践.md), [CogVLM2-Video最佳实践](cogvlm2-video最佳实践.md)
29-
2. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md)
29+
2. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md), [MiniCPM-V-2.6最佳实践](https://github.com/modelscope/ms-swift/issues/1613)

docs/source/Multi-Modal/minicpm-v-2.5最佳实践.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11

22
# MiniCPM-V-2.5 最佳实践
33

4+
MiniCPM-V-2.6 最佳实践: [https://github.com/modelscope/ms-swift/issues/1613](https://github.com/modelscope/ms-swift/issues/1613)
5+
46
## 目录
57
- [环境准备](#环境准备)
68
- [推理](#推理)

docs/source/Multi-Modal/qwen-audio最佳实践.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Qwen-Audio 最佳实践
22

3+
Qwen2-Audio的最佳实践可以查看: [https://github.com/modelscope/ms-swift/issues/1653](https://github.com/modelscope/ms-swift/issues/1653)
4+
35
## 目录
46
- [环境准备](#环境准备)
57
- [推理](#推理)

0 commit comments

Comments
 (0)