update readme (#137)

tastelikefeet · tastelikefeet · commit c7b965150e7e · 2023-11-07T22:50:36.000+08:00
(cherry picked from commit e65f96c)
diff --git a/README.md b/README.md
@@ -38,6 +38,7 @@ Users can check the [documentation of Swift](docs/source/GetStarted/Introduction
 
 ### 🎉 News
 
+- 🔥 2023.11.07: Support the finetuning of yi-6b model, scripts can be found at: `scripts/yi_6b`.
 - 🔥 2023.10.30: Support QA-LoRA and LongLoRA to decrease memory usage in training.
 - 🔥 2023.10.30: Support ROME(Rank One Model Editing) to add/modify knowledges, training is not needed!
 - 🔥 2023.10.27: Support for chatglm3 series models: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. The corresponding shell script can be found in `scripts/chatglm3_6b_32k`.
diff --git a/README_CN.md b/README_CN.md
@@ -36,6 +36,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 
 ## 🎉 新闻
 
+- 🔥 2023.11.07: 支持yi-6b模型的训练和推理流程，脚本在`scripts/yi_6b`.
 - 🔥 2023.10.30: 支持 QA-LoRA 和 LongLoRA两种新的tuners
 - 🔥 2023.10.30: 支持使用ROME(Rank One Model Editing)来编辑模型，在无需训练的情况下即可给模型灌注新知识！
 - 🔥 2023.10.27: 支持chatglm3系列模型: chatglm3-6b-base, chatglm3-6b, chatglm3-6b-32k. 对应的sh脚本可以查看`scripts/chatglm3_6b_32k`.
diff --git a/docs/source/GetStarted/Deployment.md b/docs/source/GetStarted/Deployment.md
@@ -66,3 +66,100 @@ curl http://localhost:8000/v1/completions \
 ```
 
 vllm也支持使用python代码拉起模型并调用，具体可以查看[vllm官方文档](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)。
+
+## chatglm.cpp
+
+该推理优化框架支持：
+
+ChatGLM系列模型
+
+BaiChuan系列模型
+
+CodeGeeX系列模型
+
+chatglm.cpp的github地址是：https://github.com/li-plus/chatglm.cpp
+
+首先初始化对应repo:
+```shell
+git clone --recursive https://github.com/li-plus/chatglm.cpp.git && cd chatglm.cpp
+python3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece
+cmake -B build
+cmake --build build -j --config Release
+```
+
+如果SWIFT训练的是LoRA模型，需要将LoRA weights合并到原始模型中去：
+
+```shell
+# 先将文件夹cd到swift根目录中
+python tools/merge_lora_weights_to_model.py --model_id_or_path /dir/to/your/base/model --model_revision master --ckpt_dir /dir/to/your/lora/model
+```
+
+合并后的模型会输出到`{ckpt_dir}-merged`文件夹中。
+
+之后将上述合并后的`{ckpt_dir}-merged`的模型weights转为cpp支持的bin文件：
+
+```shell
+# 先将文件夹cd到chatglm.cpp根目录中
+python3 chatglm_cpp/convert.py -i {ckpt_dir}-merged -t q4_0 -o chatglm-ggml.bin
+```
+
+chatglm.cpp支持以各种精度转换模型，详情请参考：https://github.com/li-plus/chatglm.cpp#getting-started
+
+之后就可以拉起模型推理：
+
+```shell
+./build/bin/main -m chatglm-ggml.bin -i
+# 以下对话为使用agent数据集训练后的效果
+# Prompt   > how are you?
+# ChatGLM3 > <|startofthink|>```JSON
+# {"api_name": "greeting", "apimongo_instance": "ddb1e34-0406-42a3-a547a220a2", "parameters": {"text": "how are # you?"}}}
+# ```<|endofthink|>
+#
+# I'm an AI assistant and I can only respond to text input. I don't have the ability to respond to audio or # video input.
+```
+
+## XInference
+
+XInference是XOrbits开源的推理框架，支持大多数LLM模型的python格式和cpp格式高效推理。github链接在：https://github.com/xorbitsai/inference，在使用chatglm.cpp转换成ggml格式之后就可以使用XInference进行推理。
+
+首先安装依赖：
+
+```shell
+pip install git+https://github.com/li-plus/chatglm.cpp.git@main
+pip install xinference -U
+```
+
+之后启动xinference：
+
+```shell
+xinference -p 9997
+```
+
+在浏览器界面上选择Register Model选项卡，添加chatglm.cpp章节中转换成功的ggml模型：
+
+![image.png](../resources/xinference.jpg)
+
+注意：
+
+- 模型能力选择Chat
+
+之后再Launch Model中搜索刚刚创建的模型名称，点击火箭标识运行即可使用。
+
+调用可以使用如下代码：
+
+```python
+from xinference.client import Client
+
+client = Client("http://localhost:9997")
+model_uid = client.launch_model(model_name="custom-chatglm")
+model = client.get_model(model_uid)
+
+chat_history = []
+prompt = "What is the largest animal?"
+model.chat(
+    prompt,
+    chat_history,
+    generate_config={"max_tokens": 1024}
+)
+# {'id': 'chatcmpl-df3c2c28-f8bc-4e79-9c99-2ae3950fd459', 'object': 'chat.completion', 'created': 1699367362, 'model': '021c2b74-7d7a-11ee-b1aa-ead073d837c1', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "According to records kept by the Guinness World Records, the largest animal in the world is the Blue Whale, specifically, the Right and Left Whales, which were both caught off the coast of Newfoundland. The two whales measured a length of 105.63 meters, or approximately 346 feet long, and had a corresponding body weight of 203,980 pounds, or approximately 101 tons. It's important to note that this was an extremely rare event and the whales that size don't commonly occur."}, 'finish_reason': None}], 'usage': {'prompt_tokens': -1, 'completion_tokens': -1, 'total_tokens': -1}}
+```
diff --git a/docs/source/resources/xinference.jpg b/docs/source/resources/xinference.jpg
diff --git a/examples/pytorch/llm/README.md b/examples/pytorch/llm/README.md
@@ -83,8 +83,9 @@ cd examples/pytorch/llm
 pip install deepspeed -U
 
 # If you want to use qlora training based on auto_gptq (recommended, better performance than bnb):
-# Models using auto_gptq: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8
-pip install auto_gptq optimum -U
+# auto_gptq has version mapping with cuda versions，please refer to https://github.com/PanQiWei/AutoGPTQ#quick-installation
+pip install auto_gptq
+pip install optimum -U
 
 # If you want to use qlora training based on bnb:
 pip install bitsandbytes -U
diff --git a/examples/pytorch/llm/README_CN.md b/examples/pytorch/llm/README_CN.md
@@ -84,7 +84,9 @@ pip install deepspeed -U
 
 # 如果你想要使用基于auto_gptq的qlora训练. (推荐, 效果优于bnb)
 # 使用auto_gptq的模型: qwen-7b-chat-int4, qwen-14b-chat-int4, qwen-7b-chat-int8, qwen-14b-chat-int8
-pip install auto_gptq optimum -U
+# auto_gptq和cuda版本有对应关系，请按照https://github.com/PanQiWei/AutoGPTQ#quick-installation选择版本
+pip install auto_gptq
+pip install optimum -U
 
 # 如果你想要使用基于bnb的qlora训练.
 pip install bitsandbytes -U