Skip to content

Commit 81e48a5

Browse files
authored
support llava video (#1307)
1 parent 2443d0a commit 81e48a5

26 files changed

+549
-54
lines changed

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
- [WeChat Group](#-Wechat-Group)
3737

3838
## 📝 Introduction
39-
SWIFT supports training, inference, evaluation and deployment of **300+ LLMs and 40+ MLLMs** (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition to supporting the lightweight training solutions provided by [PEFT](https://github.com/huggingface/peft), we also provide a complete **Adapters library** to support the latest training techniques such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used directly in your own custom workflow without our training scripts.
39+
SWIFT supports training, inference, evaluation and deployment of **300+ LLMs and 50+ MLLMs** (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition to supporting the lightweight training solutions provided by [PEFT](https://github.com/huggingface/peft), we also provide a complete **Adapters library** to support the latest training techniques such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used directly in your own custom workflow without our training scripts.
4040

4141
To facilitate use by users unfamiliar with deep learning, we provide a Gradio web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners.
4242

@@ -47,6 +47,7 @@ SWIFT has rich documentations for users, please check [here](https://github.com/
4747
SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) and [ModelScope studio](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary), please feel free to try!
4848

4949
## 🎉 News
50+
- 2024.07.06: Support for the llava-next-video series models: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-dpo-instruct. You can refer to [llava-video best practice](docs/source_en/Multi-Modal/llava-video-best-practice.md) for more information.
5051
- 2024.07.06: Support internvl2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
5152
- 2024.07.06: Support codegeex4-9b-chat.
5253
- 2024.07.04: Support internlm2_5-7b series: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
@@ -63,6 +64,8 @@ SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spa
6364
- 🔥2024.06.01: Supports **SimPO** training! See [document](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/SimPO.md) to start training!
6465
- 🔥2024.06.01: Support for deploying large multimodal models, please refer to the [Multimodal Deployment Documentation](docs/source_en/Multi-Modal/mutlimodal-deployment.md) for more information.
6566
- 2024.05.31: Supports Mini-Internvl model, Use model_type `mini-internvl-chat-2b-v1_5` and `mini-internvl-chat-4b-v1_5`to train.
67+
<details><summary>More</summary>
68+
6669
- 2024.05.24: Supports Phi3-vision model, Use model_type `phi3-vision-128k-instruct` to train.
6770
- 2024.05.22: Supports DeepSeek-V2-Lite series models, model_type are `deepseek-v2-lite` and `deepseek-v2-lite-chat`
6871
- 2024.05.22: Supports TeleChat-12B-v2 model with quantized version, model_type are `telechat-12b-v2` and `telechat-12b-v2-gptq-int4`
@@ -79,8 +82,6 @@ SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spa
7982
- 2024.04.29: Supports inference and fine-tuning of InternVL-Chat-V1.5 model. For best practice, you can refer to [here](https://github.com/modelscope/swift/tree/main/docs/source_en/Multi-Modal/internvl-best-practice.md).
8083
- 🔥2024.04.26: Support **LISA** and **unsloth** training! Specify `--lisa_activated_layers=2` to use LISA(to reduce the memory cost to 30 percent!), specify `--tuner_backend unsloth` to use unsloth to train a huge model(full or lora) with lesser memory(30 percent or lesser) and faster speed(5x)!
8184
- 🔥2024.04.26: Support the fine-tuning and inference of Qwen1.5-110B and Qwen1.5-110B-Chat model, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_110b_chat/lora_ddp_ds/sft.sh) to start training!
82-
<details><summary>More</summary>
83-
8485
- 2024.04.24: Support for inference and fine-tuning of Phi3 series models. Including: [phi3-4b-4k-instruct](examples/pytorch/llm/scripts/phi3_4b_4k_instruct/lora), phi3-4b-128k-instruct.
8586
- 2024.04.22: Support for inference, fine-tuning, and deployment of **chinese-llama-alpaca-2** series models. This includes:chinese-llama-2-1.3b, chinese-llama-2-7b, chinese-llama-2-13b, chinese-alpaca-2-1.3b, chinese-alpaca-2-7b and chinese-alpaca-2-13b along with their corresponding 16k and 64k long text versions.
8687
- 2024.04.22: Support for inference and fine-tuning of Llama3 GPTQ-Int4, GPTQ-Int8, and AWQ series models. Support for inference and fine-tuning of chatglm3-6b-128k, Openbuddy-Llama3.
@@ -558,7 +559,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
558559
| MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5 | [OpenBmB MiniCPM vision model](https://github.com/OpenBMB/MiniCPM) | Chinese<br>English | 3B-9B | chat model |
559560
| CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V | [Zhipu ChatGLM visual QA and Agent model](https://github.com/THUDM/) | Chinese<br>English | 9B-19B | chat model |
560561
| Llava1.5<br>Llava1.6 | [Llava series models](https://github.com/haotian-liu/LLaVA) | English | 7B-34B | chat model |
561-
| Llava-Next | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT) | Chinese<br>English | 8B-110B | chat model |
562+
| Llava-Next<br>Llava-Next-Video | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT) | Chinese<br>English | 7B-110B | chat model |
562563
| mPLUG-Owl | [mPLUG-Owl series models](https://github.com/X-PLUG/mPLUG-Owl) | English | 11B | chat model |
563564
| InternVL<br>Mini-Internvl<br>Internvl2 | [InternVL](https://github.com/OpenGVLab/InternVL) | Chinese<br>English | 2B-25.5B<br>including quantized version | chat model |
564565
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | English | 8B | chat model |

README_CN.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
- [微信用户群](#-微信用户群)
3838

3939
## 📝 简介
40-
SWIFT支持**300+ LLM和40+ MLLM**(多模态大模型)的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中,实现模型训练评测到应用的完整链路。我们除支持了[PEFT](https://github.com/huggingface/peft)提供的轻量训练方案外,也提供了一个完整的**Adapters库**以支持最新的训练技术,如NEFTune、LoRA+、LLaMA-PRO等,这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
40+
SWIFT支持**300+ LLM和50+ MLLM**(多模态大模型)的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中,实现模型训练评测到应用的完整链路。我们除支持了[PEFT](https://github.com/huggingface/peft)提供的轻量训练方案外,也提供了一个完整的**Adapters库**以支持最新的训练技术,如NEFTune、LoRA+、LLaMA-PRO等,这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
4141

4242
为方便不熟悉深度学习的用户使用,我们提供了一个Gradio的web-ui用于控制训练和推理,并提供了配套的深度学习课程和最佳实践供新手入门。
4343

@@ -48,6 +48,7 @@ SWIFT具有丰富的文档体系,如有使用问题请请查看[这里](https:
4848
可以在[Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift)[ModelScope创空间](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary) 中体验SWIFT web-ui功能了。
4949

5050
## 🎉 新闻
51+
- 2024.07.06: 支持llava-next-video系列模型: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-dpo-instruct. 可以查看[llava-video最佳实践](docs/source/Multi-Modal/llava-video最佳实践.md)了解更多.
5152
- 2024.07.06: 支持internvl-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
5253
- 2024.07.06: 支持codegeex4-9b-chat.
5354
- 2024.07.04: 支持internlm2_5-7b系列: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
@@ -64,6 +65,8 @@ SWIFT具有丰富的文档体系,如有使用问题请请查看[这里](https:
6465
- 🔥2024.06.01: 支持**SimPO**训练,使用`swift simpo`来开始训练,最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/LLM/SimPO算法最佳实践.md)
6566
- 🔥2024.06.01: 支持多模态大模型部署, 可以查看[多模态部署文档](docs/source/Multi-Modal/MLLM部署文档.md).
6667
- 2024.05.31: 支持Mini-Internvl多模态模型, 使用model_type `mini-internvl-chat-2b-v1_5``mini-internvl-chat-4b-v1_5`来训练.
68+
<details><summary>更多</summary>
69+
6770
- 2024.05.24: 支持Phi3多模态模型, 使用model_type `phi3-vision-128k-instruct`来训练.
6871
- 2024.05.22: 支持DeepSeek-V2-lite系列模型, model_type为 `deepseek-v2-lite``deekseek-v2-lite-chat`
6972
- 2024.05.22: 支持TeleChat-12b-v2模型和量化版本, model_type为 `telechat-12b-v2``telechat-12b-v2-gptq-int4`
@@ -80,8 +83,6 @@ SWIFT具有丰富的文档体系,如有使用问题请请查看[这里](https:
8083
- 2024.04.29: 支持InternVL-Chat-V1.5的推理与微调, 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/Multi-Modal/internvl最佳实践.md).
8184
- 🔥2024.04.26: 支持**LISA****unsloth**训练!指定 `--lisa_activated_layers=2` 来开启LISA(显存使用降低至全参训练的30%),指定 `--tuner_backend unsloth` 来使用unsloth,用更少的显存(30%或更少)更快的速度(5x)训练一个超大模型!
8285
- 🔥2024.04.26: 支持Qwen1.5-110B和Qwen1.5-110B-Chat模型的推理与微调, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_110b_chat/lora_ddp_ds/sft.sh)来开始训练!
83-
<details><summary>更多</summary>
84-
8586
- 2024.04.24: 支持Phi3系列模型的推理与微调. 包括: [phi3-4b-4k-instruct](examples/pytorch/llm/scripts/phi3_4b_4k_instruct/lora), phi3-4b-128k-instruct.
8687
- 2024.04.22: 支持**chinese-llama-alpaca-2**系列模型的推理与微调和部署等. 包括:chinese-llama-2-1.3b, chinese-llama-2-7b, chinese-llama-2-13b, chinese-alpaca-2-1.3b, chinese-alpaca-2-7b和chinese-alpaca-2-13b以及对应的16k和64k长文本模型.
8788
- 2024.04.22: 支持Llama3 GPTQ-Int4, GPTQ-Int8, AWQ系列模型的推理与微调. 支持chatglm3-6b-128k, Openbuddy-llama3的推理与微调.
@@ -555,7 +556,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
555556
| MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5 | [OpenBmB MiniCPM视觉模型](https://github.com/OpenBMB/MiniCPM) | 中文<br>英文 | 3B-9B | chat模型 |
556557
| CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V | [智谱ChatGLM视觉问答和Agent模型](https://github.com/THUDM/) | 中文<br>英文 | 9B-19B | chat模型 |
557558
| Llava1.5<br>Llava1.6 | [Llava系列模型](https://github.com/haotian-liu/LLaVA) | 英文 | 7B-34B | chat模型 |
558-
| Llava-Next | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT) | 中文<br>英文 | 8B-110B | chat模型 |
559+
| Llava-Next<br>Llava-Next-Video | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT) | 中文<br>英文 | 7B-110B | chat模型 |
559560
| mPLUG-Owl | [mPLUG-Owl系列模型](https://github.com/X-PLUG/mPLUG-Owl) | 英文 | 11B | chat模型 |
560561
| InternVL<br>Mini-Internvl<br>Internvl2 | [InternVL](https://github.com/OpenGVLab/InternVL) | 中文<br>英文 | 2B-25.5B<br>包含量化版本 | chat模型 |
561562
| Llava-llama3 | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | 英文 | 8B | chat model |

docs/source/LLM/支持的模型和数据集.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,10 @@
340340
|llama3-llava-next-8b|[AI-Modelscope/llama3-llava-next-8b](https://modelscope.cn/models/AI-Modelscope/llama3-llava-next-8b/summary)|q_proj, k_proj, v_proj|llama-llava-next|&#x2714;|&#x2718;||vision|[lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)|
341341
|llava-next-72b|[AI-Modelscope/llava-next-72b](https://modelscope.cn/models/AI-Modelscope/llava-next-72b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||vision|[lmms-lab/llava-next-72b](https://huggingface.co/lmms-lab/llava-next-72b)|
342342
|llava-next-110b|[AI-Modelscope/llava-next-110b](https://modelscope.cn/models/AI-Modelscope/llava-next-110b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||vision|[lmms-lab/llava-next-110b](https://huggingface.co/lmms-lab/llava-next-110b)|
343+
|llava-next-video-7b-instruct|[huangjintao/LLaVA-NeXT-Video-7B-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf)|
344+
|llava-next-video-7b-32k-instruct|[huangjintao/LLaVA-NeXT-Video-7B-32K-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-32K-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-32K-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-32K-hf)|
345+
|llava-next-video-7b-dpo-instruct|[huangjintao/LLaVA-NeXT-Video-7B-DPO-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-DPO-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-DPO-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-DPO-hf)|
346+
|llava-next-video-34b-dpo-instruct|[huangjintao/LLaVA-NeXT-Video-34B-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-34B-hf/summary)|q_proj, k_proj, v_proj|llava-next-video-yi|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-34B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-34B-hf)|
343347
|yi-vl-6b-chat|[01ai/Yi-VL-6B](https://modelscope.cn/models/01ai/Yi-VL-6B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-6B](https://huggingface.co/01-ai/Yi-VL-6B)|
344348
|yi-vl-34b-chat|[01ai/Yi-VL-34B](https://modelscope.cn/models/01ai/Yi-VL-34B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)|
345349
|llava-llama-3-8b-v1_1|[AI-ModelScope/llava-llama-3-8b-v1_1-transformers](https://modelscope.cn/models/AI-ModelScope/llava-llama-3-8b-v1_1-transformers/summary)|q_proj, k_proj, v_proj|llava-llama-instruct|&#x2714;|&#x2718;|transformers>=4.36|vision|[xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers)|

docs/source/Multi-Modal/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717

1818
一轮对话只能包含一张图片(可能可以不含图片):
19-
1. [Llava最佳实践](llava最佳实践.md)
19+
1. [Llava最佳实践](llava最佳实践.md), [LLava Video最佳实践](llava-video最佳实践.md)
2020
2. [Yi-VL最佳实践.md](yi-vl最佳实践.md)
2121
3. [mPLUG-Owl2最佳实践](mplug-owl2最佳实践.md)
2222
4. [florence最佳实践](florence最佳实践.md)

0 commit comments

Comments
 (0)