Update doc (#125)

tastelikefeet · web-flow · commit 97d7cd97a87b · 2023-10-30T21:47:52.000+08:00
diff --git a/README.md b/README.md
@@ -19,11 +19,14 @@ SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning) is an extensible fra
 Currently supported approches (and counting):
 
 1. LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685)
-2. Adapter: [Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
-3. Prompt Tuning: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
-4. Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
-5. ResTuning-Bypass
-7. All tuners offered on [PEFT](https://github.com/huggingface/peft)
+2. QA-LoRA:[Quantization-Aware Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2309.14717).
+3. LongLoRA: [Efficient Fine-tuning of Long-Context Large Language Models](https://arxiv.org/abs/2309.12307)
+4. Adapter: [Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
+5. Prompt Tuning: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
+6. Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
+7. ResTuning-Bypass
+8. ROME: [Rank-One Editing of Encoder-Decoder Models](https://arxiv.org/abs/2211.13317)
+9. All tuners offered on [PEFT](https://github.com/huggingface/peft)
 
 Key features:
 
@@ -33,6 +36,18 @@ Key features:
 
 Users can check the [documentation of Swift](docs/source/GetStarted/Introduction.md) to get detail tutorials.
 
+### 🎉News
+
+- 🔥 2023.10.30: Support QA-LoRA and LongLoRA to decrease memory usage in training.
+- 🔥 2023.10.30: Support ROME(Rank One Model Editing) to add/modify knowledges, training is not needed!
+- 🔥 2023.10.27: Support for chatglm3 series models: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. The corresponding shell script can be found in `scripts/chatglm3_6b_32k`.
+- 🔥 2023.10.17: Supported int4, int8 models: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. The corresponding shell script can be found at `scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
+- 2023.10.15: Supported ziya2-13b model series: ziya2-13b, ziya2-13b-chat. The corresponding shell script can be found at `scripts/ziya2_13b_chat`.
+- 2023.10.12: Supported mistral-7b model series: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. The corresponding shell script can be found at `scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
+- 🔥 2023.10.7: Supported DeepSpeed ZeRO-2, enabling LoRA (not just QLoRA) to run DDP on 2*A10. The corresponding shell script can be found at `scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
+- 🔥 2023.9.25: Supported qwen-14b model series: qwen-14b, qwen-14b-chat. The corresponding shell script can be found at `scripts/qwen_14b`, `scripts/qwen_14b_chat`.
+- 2023.9.12: Supported training with MP+DDP to accelerate full-parameter fine-tuning speed. The corresponding shell script can be found at `scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
+
 ## LLM SFT Example
 Press [this link](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm) to view the detail documentation of these examples.
 
@@ -70,19 +85,6 @@ Press [this link](https://github.com/modelscope/swift/tree/main/examples/pytorch
   - Chat: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse
 
 
-### News
-- 🔥 2023.10.27: Support for chatglm3 series models: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. The corresponding shell script can be found in `scripts/chatglm3_6b_32k`.
-- 🔥 2023.10.24: Use the registration mechanism to add models, datasets, and chat templates. To customize models, datasets, and chat templates, refer to the "User Guide" section. The corresponding Python file can be found in `custom.py`, and the corresponding shell script can be found in `scripts/custom/tigerbot_13b_chat`.
-- 🔥 2023.10.17: Supported int4, int8 models: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. The corresponding shell script can be found at `scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
-- 2023.10.15: Supported ziya2-13b model series: ziya2-13b, ziya2-13b-chat. The corresponding shell script can be found at `scripts/ziya2_13b_chat`.
-- 2023.10.12: Supported mistral-7b model series: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. The corresponding shell script can be found at `scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
-- 🔥 2023.10.7: Supported DeepSpeed ZeRO-2, enabling LoRA (not just QLoRA) to run DDP on 2*A10. The corresponding shell script can be found at `scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
-- 2023.10.4: Supported datasets in the fields of mathematics, law, SQL, and coding: blossom-math-zh, school-math-zh, text2sql-en, sql-create-context-en, lawyer-llama-zh, tigerbot-law-zh, leetcode-python-en.
-- 🔥 2023.9.25: Supported qwen-14b model series: qwen-14b, qwen-14b-chat. The corresponding shell script can be found at `scripts/qwen_14b`, `scripts/qwen_14b_chat`.
-- 2023.9.18: Supported internlm-20b model series: internlm-20b, internlm-20b-chat. The corresponding shell script can be found at `scripts/internlm_20b`, `scripts/internlm_20b_chat`.
-- 2023.9.12: Supported training with MP+DDP to accelerate full-parameter fine-tuning speed. The corresponding shell script can be found at `scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
-
-
 # Installation
 
 SWIFT is running in Python environment. Please make sure your python version is higher than 3.8.
diff --git a/README_CN.md b/README_CN.md
@@ -18,11 +18,14 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 目前支持的方法：
 
 1. LoRA：[LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685)
-2. Adapter：[Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
-3. Prompt: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
-4. Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
-5. ResTuning-Bypass
-6. 所有在[PEFT](https://github.com/huggingface/peft)上提供的tuners
+2. QA-LoRA:[Quantization-Aware Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2309.14717).
+3. LongLoRA: [Efficient Fine-tuning of Long-Context Large Language Models](https://arxiv.org/abs/2309.12307)
+4. Adapter：[Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
+5. Prompt: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
+6. Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
+7. ResTuning-Bypass
+8. ROME: [Rank-One Editing of Encoder-Decoder Models](https://arxiv.org/abs/2211.13317)
+9. 所有在[PEFT](https://github.com/huggingface/peft)上提供的tuners
 
 主要能力：
 1. 可以通过model-id使SWIFT或PEFT的方法使用ModelScope Hub中的模型
@@ -31,6 +34,18 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 
 用户可以查看 [Swift官方文档](docs/source/GetStarted/Introduction.md) 来了解详细信息。
 
+## 新闻
+
+- 🔥 2023.10.30: 支持 QA-LoRA 和 LongLoRA两种新的tuners
+- 🔥 2023.10.30: 支持使用ROME(Rank One Model Editing)来编辑模型，在无需训练的情况下即可给模型灌注新知识！
+- 🔥 2023.10.27: 支持chatglm3系列模型: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. 对应的sh脚本可以查看`scripts/chatglm3_6b_32k`.
+- 🔥 2023.10.17: 支持int4, int8模型的SFT: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. 对应的sh脚本可以查看`scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
+- 2023.10.15: 支持ziya2-13b系列模型: ziya2-13b, ziya2-13b-chat. 对应的sh脚本可以查看`scripts/ziya2_13b_chat`.
+- 2023.10.12: 支持mistral-7b系列模型: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. 对应的sh脚本可以查看`scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
+- 🔥 2023.10.7: 支持DeepSpeed ZeRO-2, 使得lora(不仅仅是qlora)可以在双卡A10上运行DDP. 对应的sh脚本可以查看`scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
+- 🔥 2023.9.25: 支持**qwen-14b**系列模型: qwen-14b, qwen-14b-chat. 对应的sh脚本可以查看`scripts/qwen_14b`, `scripts/qwen_14b_chat`.
+- 2023.9.12: 支持MP+DDP的方式训练, 加快全参数微调的速度, 对应的sh脚本可以查看`scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
+
 ## 大模型微调的例子
 可以[在这里](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm) 查看LLM微调的使用文档。
 
@@ -68,19 +83,6 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
   - 对话: chatml(qwen), baichuan, chatglm2, chatglm3, llama, openbuddy-llama, default, internlm, xverse
 
 
-## 新闻
-- 🔥 2023.10.27: 支持chatglm3系列模型: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. 对应的sh脚本可以查看`scripts/chatglm3_6b_32k`.
-- 🔥 2023.10.24: 使用注册机制来新增模型, 数据集和对话模板. 如何自定义模型, 数据集和对话模板可以查看`使用文档`部分, 其对应的py文件可以查看`custom.py`, 其对应的sh脚本可以查看`scripts/custom/tigerbot_13b_chat`.
-- 🔥 2023.10.17: 支持int4, int8模型的SFT: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-vl-chat-int4, baichuan2-7b-chat-int4, baichuan2-13b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8. 对应的sh脚本可以查看`scripts/qwen_7b_chat_int4`, `scripts/qwen_14b_chat_int4`, `scripts/qwen_vl_chat_int4`, `scripts/qwen_7b_chat_int8`, `scripts/qwen_14b_chat_int8`.
-- 2023.10.15: 支持ziya2-13b系列模型: ziya2-13b, ziya2-13b-chat. 对应的sh脚本可以查看`scripts/ziya2_13b_chat`.
-- 2023.10.12: 支持mistral-7b系列模型: openbuddy-mistral-7b-chat, mistral-7b, mistral-7b-chat. 对应的sh脚本可以查看`scripts/openbuddy_mistral_7b_chat`, `scripts/mistral_7b_chat`.
-- 🔥 2023.10.7: 支持DeepSpeed ZeRO-2, 使得lora(不仅仅是qlora)可以在双卡A10上运行DDP. 对应的sh脚本可以查看`scripts/qwen_7b_chat/lora_ddp_ds/sft.sh`.
-- 2023.10.4: 支持更多数学, 法律, SQL, 代码领域的数据集: blossom-math-zh, school-math-zh, text2sql-en, sql-create-context-en, lawyer-llama-zh, tigerbot-law-zh, leetcode-python-en.
-- 🔥 2023.9.25: 支持**qwen-14b**系列模型: qwen-14b, qwen-14b-chat. 对应的sh脚本可以查看`scripts/qwen_14b`, `scripts/qwen_14b_chat`.
-- 2023.9.18: 支持internlm-20b系列模型: internlm-20b, internlm-20b-chat. 对应的sh脚本可以查看`scripts/internlm_20b`, `scripts/internlm_20b_chat`.
-- 2023.9.12: 支持MP+DDP的方式训练, 加快全参数微调的速度, 对应的sh脚本可以查看`scripts/qwen_7b_chat/full_mp_ddp/sft.sh`.
-
-
 # 安装
 
 SWIFT在Python环境中运行。请确保您的Python版本高于3.8。
diff --git a/docs/source/GetStarted/Deployment.md b/docs/source/GetStarted/Deployment.md
@@ -0,0 +1,67 @@
+# 部署
+
+经过训练的模型可以使用各开源推理框架进行部署。下面介绍SWIFT框架如何对接开源推理框架进行部署。
+
+## VLLM
+
+[VLLM](https://github.com/vllm-project/vllm) 是针对transformer结构的推理加速框架，支持的Paged Attention和Continuous Batching等技术可以有效提升推理效率并减低显存占用。
+
+使用VLLM的条件为：
+
+1. 使用全参数微调或LoRA微调
+2. 模型类型符合VLLM支持的模型类型
+
+目前VLLM支持的模型系列为：
+
+> - Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
+> - Baichuan (`baichuan-inc/Baichuan-7B`, `baichuan-inc/Baichuan-13B-Chat`, etc.)
+> - BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
+> - Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
+> - GPT-2 (`gpt2`, `gpt2-xl`, etc.)
+> - GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
+> - GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
+> - GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
+> - InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
+> - LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
+> - Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
+> - MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
+> - OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
+> - Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
+
+首先需要安装vllm:
+
+```shell
+pip install vllm
+```
+
+如果是全参数微调，则可以使用vllm直接启动API服务，方法如下：
+
+```shell
+python -m vllm.entrypoints.openai.api_server --model /dir/to/your/trained/model --trust-remote-code
+```
+
+如果是LoRA微调，需要先执行下面的脚本将LoRA weights合并到原始模型中：
+
+```shell
+python merge_lora_weights_to_model.py --model_id_or_path /dir/to/your/base/model --model_revision master --ckpt_dir /dir/to/your/lora/model
+```
+
+合并后的模型会输出到`{ckpt_dir}-merged`文件夹中, 将该文件夹传入上述vllm命令中即可拉起服务。
+
+调用服务：
+
+```shell
+curl http://localhost:8000/v1/completions \
+-H "Content-Type: application/json" \
+-d '{
+"model": "/dir/to/your/trained/model",
+"prompt": "San Francisco is a",
+"max_tokens": 7,
+"temperature": 0
+}'
+
+# Response:
+{"id":"cmpl-90329ab1eba24d02934b38f2edbb26a8","object":"text_completion","created":11506341,"model":"/dir/to/your/trained/model","choices":[{"index":0,"text":" city in the United States of America","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":4,"total_tokens":11,"completion_tokens":7}}
+```
+
+vllm也支持使用python代码拉起模型并调用，具体可以查看[vllm官方文档](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)。
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -14,6 +14,7 @@ Swift DOCUMENTATION
    GetStarted/Use in train and infer.md
    GetStarted/Examples.md
    GetStarted/Work with Peft.md
+   GetStarted/Deployment.md
 
 .. toctree::
    :maxdepth: 2
diff --git a/merge_lora_weights_to_model.py b/merge_lora_weights_to_model.py
@@ -0,0 +1,8 @@
+from swift.llm import InferArguments
+from swift.llm.infer import merge_lora
+from swift.utils import parse_args
+
+if __name__ == '__main__':
+    args, remaining_argv = parse_args(InferArguments, None)
+    args.init_argument()
+    merge_lora(args, replace_if_exists=True)
diff --git a/swift/llm/infer.py b/swift/llm/infer.py
@@ -14,7 +14,7 @@
 logger = get_logger()
 
 
-def merge_lora(args: InferArguments) -> None:
+def merge_lora(args: InferArguments, replace_if_exists=False) -> None:
     assert args.sft_type == 'lora'
     assert not args.model_type.endswith('int4'), 'int4 model is not supported'
     assert not args.model_type.endswith('int8'), 'int8 model is not supported'
@@ -39,7 +39,7 @@ def merge_lora(args: InferArguments) -> None:
     args.sft_type = 'full'
     args.ckpt_dir = merged_lora_path
 
-    if not os.path.exists(args.ckpt_dir):
+    if not os.path.exists(args.ckpt_dir) or replace_if_exists:
         logger.info('Saving merged weights...')
         model.save_pretrained(args.ckpt_dir)
         tokenizer.save_pretrained(args.ckpt_dir)
diff --git a/swift/tuners/longlora/longlora.py b/swift/tuners/longlora/longlora.py
@@ -16,6 +16,18 @@ class LongLoRAModelType:
 
 @dataclass
 class LongLoRAConfig(LoRAConfig):
+    """
+    The Config for the LongLoRA adapter.
+    LongLoRA:[Efficient Fine-tuning of Long-Context Large Language Models](https://arxiv.org/abs/2309.12307)
+    This adapter uses S2-attention to shorten the attention window for long context training scenarios.
+    Args:
+        embedder_and_normalizer: LongLoRA allows the embedder and normalizer to be trainable, this parameter specifies
+            the names of the embedders and normalizers.
+        model_type: The model type, now support llama only
+        use_flash_attn: Use flash attention version of forward
+        group_size_ratio: The group size window ratio of the sequence length.
+            Note: The sequence length should be split to smaller sequences by the ratio.
+    """
 
     embedder_and_normalizer: Union[str, List[str], Tuple[str]] = field(
         default=('embed', 'norm'),
diff --git a/swift/tuners/lora.py b/swift/tuners/lora.py
@@ -166,6 +166,9 @@ class LoRAConfig(SwiftConfig):
         enable_lora(List[bool]): The modules need to be turned on when using the merged linear layer
         fan_in_fan_out(bool): Set this to True if the layer to replace stores weight like (fan_in, fan_out)
         bias(str): Bias type. Values ca be "none", "all" or "lora_only"
+        use_qa_lora(bool): Use
+            QA-LoRA:[Quantization-Aware Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2309.14717)
+            instead of LoRA. QA-LoRA only supports AutoGPTQ quantized models.
     """
 
     r: int = field(default=6, metadata={'help': 'The rank of the LoRA module'})
diff --git a/swift/tuners/rome/rome.py b/swift/tuners/rome/rome.py
@@ -24,10 +24,21 @@
 @dataclass
 class RomeConfig(SwiftConfig):
     """
-    The configuration class for the loRA module.
-
+    The configuration class for the ROME module.
+    This adapter can be used to inject/modify knowledge to models, without any training.
+    ROME: [Rank-One Editing of Encoder-Decoder Models](https://arxiv.org/abs/2211.13317)
     Args:
-
+        model_type(`str`): The model type, now support llama-7b/llama-13b
+        tokenizer(`AutoTokenizer`): The tokenizer
+        knowledge(`List[Dict]`): The knowledge to be injected to the model.
+            format:
+            >>> [
+            >>>     {
+            >>>         "prompt": "{} was the founder of",
+            >>>         "subject": "Steve Jobs",
+            >>>         "target": "Microsoft"
+            >>>     }
+            >>> ]
     """
     model_type: str = field(default=None, metadata={'help': 'The model type'})