Skip to content

Commit 569d916

Browse files
HIT-cwhLZHgrla
andauthored
[Doc] Add dataset pipeline doc & Improve Doc (#38)
* add dataset pipeline doc * add dataset pipeline doc * fix bugs * fix bugs * refine doc * fix bugs * Update README.md * Update README.md * update docs (#1) * Update README.md * fix pre-commit * rename xTuner to XTuner * Update README.md * Update README.md * Update README.md * Update README.md * fix pre-commit * Update README.md * Update README.md * Update README.md * Update README.md * Update chat.md * Update chat.md * Update chat.md * Update chat.md * Update chat.md * Update chat.md * Update chat.md * Update chat.md * Update chat.md * Update finetune.md * Update finetune.md * Update chat.md * fix pre-commit * add zh_cn chat and finetune doc * Update chat.md * Update README.md * del tool_usage * Update README.md * Update chat.md * Update chat.md * Update README.md * Update README.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * fix pre-commit * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README_zh-CN.md * refactor data pipeline doc * add colorist llama2 * fix incremental pretraining doc --------- Co-authored-by: LZHgrla <[email protected]> Co-authored-by: LZHgrla <[email protected]>
1 parent 830ad06 commit 569d916

File tree

13 files changed

+1616
-112
lines changed

13 files changed

+1616
-112
lines changed

.github/CONTRIBUTING.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Contributing to InternLM
22

3-
Welcome to the xTuner community! All kinds of contributions are welcomed, including but not limited to
3+
Welcome to the XTuner community! All kinds of contributions are welcomed, including but not limited to
44

55
**Fix bug**
66

@@ -27,7 +27,7 @@ If you're not familiar with Pull Request, don't worry! The following guidance wi
2727

2828
#### 1. Fork and clone
2929

30-
If you are posting a pull request for the first time, you should fork the xTuner repository by clicking the **Fork** button in the top right corner of the GitHub page, and the forked repository will appear under your GitHub profile.
30+
If you are posting a pull request for the first time, you should fork the XTuner repository by clicking the **Fork** button in the top right corner of the GitHub page, and the forked repository will appear under your GitHub profile.
3131

3232
<img src="https://user-images.githubusercontent.com/57566630/167305749-43c7f4e9-449b-4e98-ade5-0c9276d5c9ce.png" width="1200">
3333

@@ -56,7 +56,7 @@ upstream [email protected]:InternLM/xtuner.git (push)
5656
5757
#### 2. Configure pre-commit
5858

59-
You should configure [pre-commit](https://pre-commit.com/#intro) in the local development environment to make sure the code style matches that of InternLM. **Note**: The following code should be executed under the xTuner directory.
59+
You should configure [pre-commit](https://pre-commit.com/#intro) in the local development environment to make sure the code style matches that of InternLM. **Note**: The following code should be executed under the XTuner directory.
6060

6161
```shell
6262
pip install -U pre-commit
@@ -101,7 +101,7 @@ git pull upstream master
101101

102102
#### 4. Commit the code and pass the unit test
103103

104-
- xTuner introduces mypy to do static type checking to increase the robustness of the code. Therefore, we need to add Type Hints to our code and pass the mypy check. If you are not familiar with Type Hints, you can refer to [this tutorial](https://docs.python.org/3/library/typing.html).
104+
- XTuner introduces mypy to do static type checking to increase the robustness of the code. Therefore, we need to add Type Hints to our code and pass the mypy check. If you are not familiar with Type Hints, you can refer to [this tutorial](https://docs.python.org/3/library/typing.html).
105105

106106
- The committed code should pass through the unit test
107107

@@ -151,7 +151,7 @@ Find more details about Pull Request description in [pull request guidelines](#p
151151

152152
<img src="https://user-images.githubusercontent.com/57566630/167307490-f9ebf9fa-63c0-4d83-8ba1-081ea169eb3a.png" width="1200">
153153

154-
xTuner will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking `Details` in the above image so that we can modify the code.
154+
XTuner will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking `Details` in the above image so that we can modify the code.
155155

156156
(3) If the Pull Request passes the CI, then you can wait for the review from other developers. You'll modify the code based on the reviewer's comments, and repeat the steps [4](#4-commit-the-code-and-pass-the-unit-test)-[5](#5-push-the-code-to-remote) until all reviewers approve it. Then, we will merge it ASAP.
157157

README.md

Lines changed: 53 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,34 @@
11
<div align="center">
2+
<img src="https://github.com/InternLM/lmdeploy/assets/36994684/0cf8d00f-e86b-40ba-9b54-dc8f1bc6c8d8" width="600"/>
3+
<br /><br />
24

3-
[![docs](https://readthedocs.org/projects/xtuner/badge)](https://xtuner.readthedocs.io/en)
4-
[![license](https://img.shields.io/github/license/InternLM/xtuner.svg)](https://github.com/InternLM/xtuner/blob/main/LICENSE)
5-
[![PyPI](https://badge.fury.io/py/opencompass.svg)](https://pypi.org/project/opencompass/)
6-
7-
[📘 Documentation](https://xtuner.readthedocs.io/en/latest/) |
8-
[🤔 Reporting Issues](https://github.com/InternLM/xtuner/issues/new/choose) |
9-
[⚙️ Model Zoo](<>)
5+
[![license](https://img.shields.io/github/license/InternLM/xtuner.svg)](https://github.com/InternLM/xtuner/LICENSE)
6+
[![PyPI](https://badge.fury.io/py/xtuner.svg)](https://pypi.org/project/xtuner/)
7+
[![Generic badge](https://img.shields.io/badge/🤗%20Huggingface-xtuner-yellow.svg)](https://huggingface.co/xtuner)
108

119
English | [简体中文](README_zh-CN.md)
1210

11+
👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
12+
1313
</div>
1414

15-
## 📣 News
15+
## 🎉 News
1616

17-
- **\[2023.08.xx\]** We release xTuner, with multiple fine-tuned adapters.
17+
- **\[2023.08.xx\]** XTuner is released, with multiple fine-tuned adapters on [HuggingFace](https://huggingface.co/xtuner).
1818

1919
## 📖 Introduction
2020

21-
xTuner is a toolkit for efficiently fine-tuning LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
21+
XTuner is a toolkit for efficiently fine-tuning LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
2222

23-
- **Efficiency**: Support LLM fine-tuning on consumer-grade GPUs. The minimum GPU memory required for 7B LLM fine-tuning is only 15GB, indicating that users can leverage the free resource, *e.g.*, Colab, to fine-tune their custom LLM models.
24-
- **Versatile**: Support various **LLMs** ([InternLM](https://github.com/InternLM/InternLM), [Llama2](https://github.com/facebookresearch/llama), [Qwen](https://github.com/QwenLM/Qwen-7B), [Baichuan](https://github.com/baichuan-inc)), **datasets** ([MOSS_003_SFT](https://huggingface.co/datasets/fnlp/moss-003-sft-data), [Arxiv GenTitle](https://github.com/WangRongsheng/ChatGenTitle), [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [oasst1](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), [Chinese Medical Dialogue](https://github.com/Toyhom/Chinese-medical-dialogue-data/)) and **algorithms** ([QLoRA](http://arxiv.org/abs/2305.14314), [LoRA](http://arxiv.org/abs/2106.09685)), allowing users to choose the most suitable solution for their requirements.
25-
- **Compatibility**: Compatible with [DeepSpeed](https://github.com/microsoft/DeepSpeed) and the [HuggingFace](https://huggingface.co) training pipeline, enabling effortless integration and utilization.
23+
- **Efficiency**: Support LLM fine-tuning on consumer-grade GPUs. The minimum GPU memory required for 7B LLM fine-tuning is only **8GB**, indicating that users can use nearly any GPU (even the free resource, *e.g.*, Colab) to fine-tune custom LLMs.
24+
- **Versatile**: Support various **LLMs** ([InternLM](https://github.com/InternLM/InternLM), [Llama2](https://github.com/facebookresearch/llama), [Qwen](https://github.com/QwenLM/Qwen-7B), [Baichuan](https://github.com/baichuan-inc), ...), **datasets** ([MOSS_003_SFT](https://huggingface.co/datasets/fnlp/moss-003-sft-data), [Colorist](https://huggingface.co/datasets/burkelibbey/colors), [Code Alpaca](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K), [Arxiv GenTitle](https://github.com/WangRongsheng/ChatGenTitle), [Chinese Law](https://github.com/LiuHC0428/LAW-GPT), [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), [Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus), ...) and **algorithms** ([QLoRA](http://arxiv.org/abs/2305.14314), [LoRA](http://arxiv.org/abs/2106.09685)), allowing users to choose the most suitable solution for their requirements.
25+
- **Compatibility**: Compatible with [DeepSpeed](https://github.com/microsoft/DeepSpeed) 🚀 and [HuggingFace](https://huggingface.co) 🤗 training pipeline, enabling effortless integration and utilization.
2626

2727
## 🌟 Demos
2828

29-
- QLoRA fine-tune for InternLM-7B [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1yzGeYXayLomNQjLD4vC6wgUHvei3ezt4?usp=sharing)
30-
- Chat with Llama2-7B-Plugins [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](<>)
31-
- Integrate xTuner into HuggingFace's pipeline [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eBI9yiOkX-t7P-0-t9vS8y1x5KmWrkoU?usp=sharing)
29+
- QLoRA Fine-tune [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QAEZVBfQ7LZURkMUtaq0b-5nEQII9G9Z?usp=sharing)
30+
- Plugin-based Chat [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/144OuTVyT_GvFyDMtlSlTzcxYIfnRsklq?usp=sharing)
31+
- Ready-to-use models and datasets from XTuner API [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eBI9yiOkX-t7P-0-t9vS8y1x5KmWrkoU?usp=sharing)
3232

3333
## 🔥 Supports
3434

@@ -42,7 +42,7 @@ xTuner is a toolkit for efficiently fine-tuning LLM, developed by the [MMRazor](
4242
<b>SFT Datasets</b>
4343
</td>
4444
<td>
45-
<b>Parallel Strategies</b>
45+
<b>Data Pipelines</b>
4646
</td>
4747
<td>
4848
<b>Algorithms</b>
@@ -51,42 +51,46 @@ xTuner is a toolkit for efficiently fine-tuning LLM, developed by the [MMRazor](
5151
<tr valign="top">
5252
<td align="left" valign="top">
5353
<ul>
54-
<li><a href="configs/internlm/internlm_7b">InternLM</a></li>
55-
<li><a href="configs/internlm/internlm_chat_7b">InternLM-Chat</a></li>
56-
<li><a href="configs/llama/llama_7b">Llama</a></li>
57-
<li><a href="configs/llama/llama2_7b">Llama2</a></li>
58-
<li><a href="configs/llama/llama2_7b_chat">Llama2-Chat</a></li>
59-
<li><a href="configs/qwen/qwen_7b">Qwen</a></li>
60-
<li><a href="configs/qwen/qwen_7b_chat">Qwen-Chat</a></li>
61-
<li><a href="configs/baichuan/baichuan_7b">Baichuan-7B</a></li>
62-
<li><a href="configs/baichuan/baichuan_13b_base">Baichuan-13B-Base</a></li>
63-
<li><a href="configs/baichuan/baichuan_13b_chat">Baichuan-13B-Chat</a></li>
54+
<li><a href="https://github.com/InternLM/InternLM">InternLM</a></li>
55+
<li><a href="https://github.com/InternLM/InternLM">InternLM-Chat</a></li>
56+
<li><a href="https://github.com/facebookresearch/llama">Llama</a></li>
57+
<li><a href="https://github.com/facebookresearch/llama">Llama2</a></li>
58+
<li><a href="https://github.com/facebookresearch/llama">Llama2-Chat</a></li>
59+
<li><a href="https://github.com/QwenLM/Qwen-7B">Qwen</a></li>
60+
<li><a href="https://github.com/QwenLM/Qwen-7B">Qwen-Chat</a></li>
61+
<li><a href="https://github.com/baichuan-inc/Baichuan-7B">Baichuan-7B</a></li>
62+
<li><a href="https://github.com/baichuan-inc/Baichuan-13B">Baichuan-13B-Base</a></li>
63+
<li><a href="https://github.com/baichuan-inc/Baichuan-13B">Baichuan-13B-Chat</a></li>
6464
<li>...</li>
6565
</ul>
6666
</td>
6767
<td>
6868
<ul>
69-
<li><a href="configs/_base_/datasets/moss_003_sft_all.py">MOSS-003-SFT</a></li>
70-
<li><a href="configs/_base_/datasets/arxiv.py">Arxiv GenTitle</a></li>
71-
<li><a href="configs/_base_/datasets/open_orca.py">OpenOrca</a></li>
72-
<li><a href="configs/_base_/datasets/alpaca.py">Alpaca en</a> / <a href="configs/_base_/datasets/alpaca_zh.py">zh</a></li>
73-
<li><a href="configs/_base_/datasets/oasst1.py">oasst1</a></li>
74-
<li><a href="configs/_base_/datasets/cmd.py">Chinese Medical Dialogue</a></li>
69+
<li><a href="https://huggingface.co/datasets/fnlp/moss-003-sft-data">MOSS-003-SFT</a> 🔧</li>
70+
<li><a href="https://huggingface.co/datasets/burkelibbey/colors">Colorist</a> 🎨</li>
71+
<li><a href="https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K">Code Alpaca</a></li>
72+
<li><a href="https://github.com/WangRongsheng/ChatGenTitle">Arxiv GenTitle</a></li>
73+
<li><a href="https://github.com/LiuHC0428/LAW-GPT">Chinese Law</a></li>
74+
<li><a href="https://huggingface.co/datasets/Open-Orca/OpenOrca">OpenOrca</a></li>
75+
<li><a href="https://huggingface.co/datasets/tatsu-lab/alpaca">Alpaca en</a> / <a href="https://huggingface.co/datasets/silk-road/alpaca-data-gpt4-chinese">zh</a></li>
76+
<li><a href="https://huggingface.co/datasets/timdettmers/openassistant-guanaco">oasst1</a></li>
77+
<li><a href="https://huggingface.co/datasets/shibing624/medical">Medical Dialogue</a></li>
78+
<li><a href="https://huggingface.co/datasets/garage-bAInd/Open-Platypus">Open-Platypus</a></li>
7579
<li>...</li>
7680
</ul>
7781
</td>
7882
<td>
7983
<ul>
80-
<li>(Distributed) Data Parallel</li>
81-
<li><a href="examples">DeepSpeed</a> 🚀</li>
84+
<li><a href="docs/zh_cn/dataset/incremental_pretraining.md">Incremental Pre-training</a> </li>
85+
<li><a href="docs/zh_cn/dataset/single_turn_conversation.md">Single-turn Conversation SFT</a> </li>
86+
<li><a href="docs/zh_cn/dataset/multi_turn_conversation.md">Multi-turn Conversation SFT</a> </li>
8287
</ul>
8388
</td>
8489
<td>
8590
<ul>
8691
<li><a href="http://arxiv.org/abs/2305.14314">QLoRA</a></li>
8792
<li><a href="http://arxiv.org/abs/2106.09685">LoRA</a></li>
8893
<li>Full parameter fine-tune</li>
89-
<li>...</li>
9094
</ul>
9195
</td>
9296
</tr>
@@ -97,7 +101,7 @@ xTuner is a toolkit for efficiently fine-tuning LLM, developed by the [MMRazor](
97101

98102
### Installation
99103

100-
Install xTuner with pip
104+
Install XTuner with pip
101105

102106
```shell
103107
pip install xtuner
@@ -111,7 +115,7 @@ cd xtuner
111115
pip install -e .
112116
```
113117

114-
### Chat [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](<>)
118+
### Chat [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/144OuTVyT_GvFyDMtlSlTzcxYIfnRsklq?usp=sharing)
115119

116120
<table>
117121
<tr>
@@ -130,7 +134,7 @@ pip install -e .
130134
</tr>
131135
</table>
132136

133-
xTuner provides the tools to chat with pretrained / fine-tuned LLMs.
137+
XTuner provides tools to chat with pretrained / fine-tuned LLMs.
134138

135139
- For example, we can start the chat with Llama2-7B-Plugins by
136140

@@ -140,17 +144,17 @@ xTuner provides the tools to chat with pretrained / fine-tuned LLMs.
140144

141145
For more usages, please see [chat.md](./docs/en/chat.md).
142146

143-
### Fine-tune [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1yzGeYXayLomNQjLD4vC6wgUHvei3ezt4?usp=sharing)
147+
### Fine-tune [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QAEZVBfQ7LZURkMUtaq0b-5nEQII9G9Z?usp=sharing)
144148

145-
xTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs.
149+
XTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs.
146150

147-
- **Step 0**, prepare the config. xTuner provides many ready-to-use configs and we can view all configs by
151+
- **Step 0**, prepare the config. XTuner provides many ready-to-use configs and we can view all configs by
148152

149153
```shell
150154
xtuner list-cfg
151155
```
152156

153-
Or, if the provided configs cannot meet the requirements, we can copy the provided config to the specified directory and make modifications by
157+
Or, if the provided configs cannot meet the requirements, please copy the provided config to the specified directory and make specific modifications by
154158

155159
```shell
156160
xtuner copy-cfg ${CONFIG_NAME} ${SAVE_DIR}
@@ -160,9 +164,9 @@ xTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs.
160164

161165
```shell
162166
# On a single GPU
163-
xtuner train internlm_7b_qlora_oasst1
167+
xtuner train internlm_7b_qlora_oasst1_e3
164168
# On multiple GPUs
165-
xtuner dist_train internlm_7b_qlora_oasst1 ${GPU_NUM}
169+
NPROC_PER_NODE=${GPU_NUM} xtuner train internlm_7b_qlora_oasst1_e3
166170
```
167171

168172
For more usages, please see [finetune.md](./docs/en/finetune.md).
@@ -172,13 +176,13 @@ xTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs.
172176
- **Step 0**, convert the pth adapter to HuggingFace adapter, by
173177

174178
```shell
175-
xtuner convert adapter_pth_2_hf \
179+
xtuner convert adapter_pth2hf \
176180
${CONFIG} \
177181
${PATH_TO_PTH_ADAPTER} \
178182
${SAVE_PATH_TO_HF_ADAPTER}
179183
```
180184

181-
or, directly merge pth adapter to pretrained LLM, by
185+
or, directly merge the pth adapter to pretrained LLM, by
182186

183187
```shell
184188
xtuner convert merge_adapter \
@@ -203,13 +207,11 @@ xTuner supports the efficient fine-tune (*e.g.*, QLoRA) for LLMs.
203207

204208
### Evaluation
205209

206-
- We recommend using [OpenCompass](https://github.com/InternLM/opencompass), a comprehensive and systematic LLM evaluation library, which currently supports 50+ datasets with about 300,000 questions.
207-
208-
## 🔜 Roadmap
210+
- We recommend using [OpenCompass](https://github.com/InternLM/opencompass), a comprehensive and systematic LLM evaluation library, which currently supports 50+ datasets with about 300,000 questions.
209211

210212
## 🤝 Contributing
211213

212-
We appreciate all contributions to xTuner. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
214+
We appreciate all contributions to XTuner. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
213215

214216
## 🎖️ Acknowledgement
215217

0 commit comments

Comments
 (0)