mftcoder-accelerate readme

chencyudel · chencyudel · commit 86c68a863f26 · 2024-01-10T14:30:47.000+08:00
diff --git a/mftcoder_accelerate/README.md b/mftcoder_accelerate/README.md
@@ -16,13 +16,14 @@
 
 🔥 MFTCoder-accelerate supports Full-parameters/QLoRA/LoRA using accelerate + DeepSpeed Framework;
 
-🔥 MFTCoder-accelerate supports Multiple Task Finetuning, which is able to balance diffenrent tasks in data level.
+🔥 MFTCoder-accelerate supports Multitask Fine-Tuning(MFT), which is able to balance diffenrent tasks in data level.
 
-🔥 MFTCoder-accelerate supports finetuning multiple mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen.
+🔥 MFTCoder-accelerate supports finetuning most of mainstream open-source base models: codellama, llama2, llama, starcoder, codegeex2, chatglm2, qwen.
 
 ## 2. Data Format
 ### 2.1 Training Data Format
-The training data is in a uniformed JSONL format, in which each line of data has the following JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on specific needs. 
+The training data is required to be a uniformed JSONL format, in which each line of data has the following "chatML"-style JSON format. The "chat_rounds" field is required, and other fields can be added or removed based on specific needs. 
+The reason why we selected "chatML" style as our training and inference data format is that "chatML" style is compatible with both "conversation" and "instruction/response" scenarios.
 
 For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple or "system/user/assistant" tuple.
 
@@ -33,40 +34,36 @@ For the keys of roles in "chat_rounds", you could use "system/human/bot" tuple o
     "chat_rounds":[
         {
             "role": "system",
-            "content": "You are a expert in coding and help answer code questions",
-            "chat_round_id": 0
+            "content": "You are a expert in coding and help answer code questions"
         },
         {
             "role": "human",
-            "content": "Write a python function of quick sort", 
-            "chat_round_id": 1
+            "content": "Write a python function of quick sort"
         },
         {
             "role": "bot",
-            "content": "Below is the function of quick sort: ...", 
-            "chat_round_id": 1
+            "content": "Below is the function of quick sort: ..."
         },
         {
             "role": "human",
-            "content": "Explain the code", 
-            "chat_round_id": 2
+            "content": "Explain the code"
         },
         {
             "role": "bot",
-            "content": "OK, this code ...", 
-            "chat_round_id": 2
+            "content": "OK, this code ..."
         }
     ]
 }
 ```
 
 ### 2.2 Default Inference Data Format
-The default inference data contains strings concatenated by conversation data(system, human and bot contents) in the training data format. 
+Inference data format is the real string format consumed by tokenizers and then LLMs. It is also the string format to which the training data is converted before tokenization.
+The default inference data format contains strings concatenated by conversation data(system, human and bot contents) in the training data format. 
 It is used as the data "seen"(before tokenization) by the model in training process.
 It is used as input during the inference process as well.
-Here is an example format of the concatenated string:
+Here is an example format of the inference string:
 
-```python
+```
 """
 <s>system
 System instruction
@@ -225,7 +222,7 @@ Frequently used arguments are provided in ```configs/***_train_config``` and exp
 
 - **pretrained_model_path**: Local/Shared disk path or model name on HuggingFace for the pre-trained model.
 
-- **weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyper-parameters.
+- **weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyperparameters.
 
 - **padding_mode**: The way tokenized data is set. "padding" means padding for each sample to seq_length, "pack" means putting samples into seq_length as many as possible.
 
diff --git a/mftcoder_accelerate/README_cn.md b/mftcoder_accelerate/README_cn.md
@@ -30,36 +30,31 @@
     "chat_rounds":[
         {
             "role": "system",
-            "content": "你是一个智能代码助手，可以回复用户与代码相关的问题",
-            "chat_round_id": 0
+            "content": "你是一个智能代码助手，可以回复用户与代码相关的问题"
         },
         {
             "role": "human",
-            "content": "写一个快速排序", 
-            "chat_round_id": 1
+            "content": "写一个快速排序"
         },
         {
             "role": "bot",
-            "content": "以下是一个快速排序算法xxxxxx", 
-            "chat_round_id": 1
+            "content": "以下是一个快速排序算法xxxxxx"
         },
         {
             "role": "human",
-            "content": "解释一下这段代码", 
-            "chat_round_id": 2
+            "content": "解释一下这段代码"
         },
         {
             "role": "bot",
-            "content": "好的，这段代码xxx", 
-            "chat_round_id": 2
+            "content": "好的，这段代码xxx"
         }
     ]
 }
 ```
 
 ### 2.2 推理数据格式
 推理数据格式为模型在训练数据格式下拼接的字符串形式，它也是推理时输入prompt拼接的方式：
-```python
+```
 """
 <s>system
 这是System指令
@@ -148,31 +143,27 @@ QLoRA论文指出，该方法可以在一张V100上对33B的模型进行微调
 
 执行如下命令即可进行 Lora/QLora/全量 微调：
 #### Launch via Deepspeed
-deepspeed配置在accelerate_ds_config.yaml中。
+DeepSpeed配置在accelerate_ds_config.yaml中。
 ```bash
-accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "deepspeed" 
+accelerate launch --config_file accelerate_ds_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "DeepSpeed" 
 ```
 或者
 
-修改并执行如下sh脚本：
-
-deepspeed配置在脚本中通过命令行输入。
+DeepSpeed配置在脚本中通过命令行输入。
 ```bash
 sh ds_single_launch.sh
 ```
 
 #### Launch via FSDP
-deepspeed配置在accelerate_ds_config.yaml中。
+FSDP配置在accelerate_fsdp_config.yaml中。
 ```bash
-accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "fsdp"
+accelerate launch --config_file accelerate_fsdp_config.yaml pefts/mft_accelerate.py --train_config configs/xxx_train_config.json --distributed_type "FSDP"
 ```
 或者
 
-修改并执行如下sh脚本：
-
-deepspeed配置在脚本中通过命令行输入。
+FSDP配置在脚本中通过命令行输入。
 ```bash
-sh ds_single_launch.sh
+sh fsdp_single_launch.sh
 ```
 
 #### 训练参数
@@ -209,7 +200,7 @@ _**训练需要的参数配置在```configs/*_train_config```中，主要参数
 - **warmup_steps**：warm up步数。学习率经过多少步，增长到指定的数值。
 - **seed**：随机种子，用于复现实验结果。
 - **saving_limit**：整数，ckpt存储数量上限， 全量训练必须设置。默认null即不限制数量。
-- **role_markers**: null，即使用{"system": "\<s\>system\n", "user": "\<s\>human\n", "assistant": "\<s\>bot\n}。 你可以自定义 "system", "user" and "assistant"的模板， 用于定制自己的问答或者对话模板，比如 {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"}
+- **role_markers**: null，即使用{"system": "\<s\>system\n", "user": "\<s\>human\n", "assistant": "\<s\>bot\n"}。 你可以自定义 "system", "user" and "assistant"的模板， 用于定制自己的问答或者对话模板，比如 {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"}
 
 ## 4. 模型使用
 
@@ -288,7 +279,7 @@ CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file pefts/accelerate_ds_con
 对于LoRA/QLoRA, 我们推荐使用DeepSpeed作为底层分布式框架，它具有易用性和兼容性好的特点，并且速度很快。
 FSDP 不支持QLoRA, 因为bitsandbytes暂不支持FSDP。
 
-对于全量微调，我们推荐使用FSDP， 因为它在全量训练时可以发挥fully sharding的优势，已达到更快的训练速度。
+对于全量微调，我们推荐使用FSDP， 因为它在全量训练时可以发挥fully sharding的优势，达到更快的训练速度。
 
 #### 问题6：当前支持的模型中，有什么区别
 国产大模型比如chatglm2， chatglm3， baichuan2， qwen， aquila2等，使用的是和模型共同发布的modeling_xxx.py.