Skip to content

Commit d54642a

Browse files
authored
[application] add lora sft example (#6192)
* [application] add lora sft example * update requirements * update readme * update comment * update ci
1 parent d20c8ff commit d54642a

File tree

5 files changed

+565
-3
lines changed

5 files changed

+565
-3
lines changed

.github/workflows/run_chatgpt_examples.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,12 @@ jobs:
3131

3232
- name: Install Colossal-AI
3333
run: |
34-
BUILD_EXT=1 pip install --no-cache-dir -v -e .
34+
pip install --no-cache-dir -v -e .
3535
3636
- name: Install ChatGPT
3737
run: |
3838
cd applications/ColossalChat
3939
pip install --no-cache-dir -v .
40-
export BUILD_EXT=1
4140
pip install --no-cache-dir -r examples/requirements.txt
4241
4342
- name: Install Transformers

applications/ColossalChat/README.md

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
- [Alternative Option For RLHF: KTO](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto)
3030
- [O1 Journey](#o1-journey)
3131
- [Inference with Self-refined MCTS](#inference-with-self-refined-mcts)
32+
- [SFT for DeepSeek V3/R1](#sft-for-deepseek-v3)
3233
- [FAQ](#faq)
3334
- [How to save/load checkpoint](#faq)
3435
- [How to train with limited resources](#faq)
@@ -389,6 +390,37 @@ You can find more examples in this [repo](https://github.com/XueFuzhao/Instructi
389390
- Cannot abide by OpenAI's policy: When generating prompts from OpenAI API, it always abides by its policy. So no violation case is in the datasets.
390391
</details>
391392

393+
## SFT for DeepSeek V3
394+
395+
We add a script to supervised-fintune the DeepSeek V3/R1 model with LoRA. The script is located in `examples/training_scripts/lora_fintune.py`. The script is similar to the SFT script for Coati7B, but with a few differences. This script is compatible with Peft.
396+
397+
### Dataset preparation
398+
399+
This script receives JSONL format file as input dataset. Each line of dataset should be a list of chat dialogues. E.g.
400+
```json
401+
[{"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}]
402+
```
403+
```json
404+
[{"role": "user", "content": "火烧赤壁 曹操为何不拨打119求救?"}, {"role": "assistant", "content": "因为在三国时期,还没有电话和现代的消防系统,所以曹操无法拨打119求救。"}]
405+
```
406+
407+
The dialogues can by multiple turns and it can contain system prompt. For more details, see the [chat_templating](https://huggingface.co/docs/transformers/main/chat_templating).
408+
409+
### Model weights preparation
410+
411+
We use bf16 weights for finetuning. If you downloaded fp8 DeepSeek V3/R1 weights, you can use the [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) to convert the weights to bf16 via GPU. For Ascend NPU, you can use this [script](https://gitee.com/ascend/ModelZoo-PyTorch/blob/master/MindIE/LLM/DeepSeek/DeepSeek-V2/NPU_inference/fp8_cast_bf16.py).
412+
413+
### Usage
414+
415+
After preparing the dataset and model weights, you can run the script with the following command:
416+
```bash
417+
colossalai run --hostfile path-to-host-file --nproc_per_node 8 lora_finetune.py --pretrained path-to-DeepSeek-R1-bf16 --dataset path-to-dataset.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 3 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora
418+
```
419+
420+
For more details of each argument, you can run `python lora_finetune.py --help`.
421+
422+
The sample command does not use CPU offload to get better throughput. The minimum hardware requirement for sample command is 32 ascend 910B NPUs (with `ep=8,pp=4`) or 24 H100/H800 GPUs (with `ep=8,pp=3`). If you enable CPU offload by `--zero_cpu_offload`, the hardware requirement can be further reduced.
423+
392424
## FAQ
393425

394426
<details><summary><b>How to save/load checkpoint</b></summary>
@@ -501,7 +533,7 @@ Thanks so much to all of our amazing contributors!
501533
- Keep in a sufficiently high running speed
502534

503535
| Model Pair | Alpaca-7B ⚔ Coati-7B | Coati-7B ⚔ Alpaca-7B |
504-
| :-----------: | :------------------: | :------------------: |
536+
|:-------------:|:--------------------:|:--------------------:|
505537
| Better Cases | 38 ⚔ **41** | **45** ⚔ 33 |
506538
| Win Rate | 48% ⚔ **52%** | **58%** ⚔ 42% |
507539
| Average Score | 7.06 ⚔ **7.13** | **7.31** ⚔ 6.82 |

applications/ColossalChat/coati/dataset/loader.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from dataclasses import dataclass
99
from typing import Dict, Iterator, List, Optional, Sequence, Union
1010

11+
import jsonlines
1112
import torch
1213
import torch.nn.functional as F
1314
from coati.dataset.utils import chuncate_sequence, pad_to_max_len
@@ -345,3 +346,77 @@ def __len__(self) -> int:
345346

346347
def set_start_index(self, start_index: int) -> None:
347348
self.start_index = start_index
349+
350+
351+
def apply_chat_template_and_mask(
352+
tokenizer: PreTrainedTokenizer,
353+
chat: List[Dict[str, str]],
354+
max_length: Optional[int] = None,
355+
padding: bool = True,
356+
truncation: bool = True,
357+
ignore_idx: int = -100,
358+
) -> Dict[str, torch.Tensor]:
359+
tokens = []
360+
assistant_mask = []
361+
for i, msg in enumerate(chat):
362+
msg_tokens = tokenizer.apply_chat_template([msg], tokenize=True)
363+
# remove unexpected bos token
364+
if i > 0 and msg_tokens[0] == tokenizer.bos_token_id:
365+
msg_tokens = msg_tokens[1:]
366+
tokens.extend(msg_tokens)
367+
if msg["role"] == "assistant":
368+
assistant_mask.extend([True] * len(msg_tokens))
369+
else:
370+
assistant_mask.extend([False] * len(msg_tokens))
371+
attention_mask = [1] * len(tokens)
372+
if max_length is not None:
373+
if padding and len(tokens) < max_length:
374+
to_pad = max_length - len(tokens)
375+
if tokenizer.padding_side == "right":
376+
tokens.extend([tokenizer.pad_token_id] * to_pad)
377+
assistant_mask.extend([False] * to_pad)
378+
attention_mask.extend([0] * to_pad)
379+
else:
380+
tokens = [tokenizer.pad_token_id] * to_pad + tokens
381+
assistant_mask = [False] * to_pad + assistant_mask
382+
attention_mask = [0] * to_pad + attention_mask
383+
if truncation and len(tokens) > max_length:
384+
tokens = tokens[:max_length]
385+
assistant_mask = assistant_mask[:max_length]
386+
attention_mask = attention_mask[:max_length]
387+
input_ids = torch.tensor(tokens, dtype=torch.long)
388+
attention_mask = torch.tensor(attention_mask, dtype=torch.long)
389+
labels = input_ids.clone()
390+
labels[~torch.tensor(assistant_mask, dtype=torch.bool)] = ignore_idx
391+
392+
return {
393+
"input_ids": input_ids,
394+
"attention_mask": attention_mask,
395+
"labels": labels,
396+
}
397+
398+
399+
class RawConversationDataset(Dataset):
400+
"""
401+
Raw conversation dataset.
402+
Each instance is a dictionary with fields `system`, `roles`, `messages`, `offset`, `sep_style`, `seps`.
403+
"""
404+
405+
def __init__(self, tokenizer: PreTrainedTokenizer, input_file: str, max_length: int) -> None:
406+
self.tokenizer = tokenizer
407+
self.raw_texts = []
408+
with jsonlines.open(input_file) as f:
409+
for line in f:
410+
self.raw_texts.append(line)
411+
self.tokenized_texts = [None] * len(self.raw_texts)
412+
self.max_length = max_length
413+
414+
def __len__(self) -> int:
415+
return len(self.raw_texts)
416+
417+
def __getitem__(self, index: int):
418+
if self.tokenized_texts[index] is None:
419+
message = self.raw_texts[index]
420+
tokens = apply_chat_template_and_mask(self.tokenizer, message, self.max_length)
421+
self.tokenized_texts[index] = dict(tokens)
422+
return self.tokenized_texts[index]

0 commit comments

Comments
 (0)