Skip to content

Commit 78298c6

Browse files
committed
refactor, support FSDP as experimental feature
1 parent df5a730 commit 78298c6

File tree

8 files changed

+168
-109
lines changed

8 files changed

+168
-109
lines changed

mftcoder_accelerate/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# MFTCoder-accelerate: Training Framework with accelerate and deepspeed
1+
# MFTCoder-accelerate: Training Framework with Accelerate and DeepSpeed/FSDP
22
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai)
33
<a href="https://github.com/codefuse-ai/MFTCoder/blob/main/LICENSE">
44
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue">
@@ -160,7 +160,7 @@ Frequently used arguments are provided in ```configs/***_train_config``` and exp
160160

161161
- **attn_implementation**: "flash_attention_2" or "eager" or "sdpa", worked when model is supported by transformers officially
162162

163-
- **peft_type**: either "lora" or "qlora".
163+
- **peft_type**: null or "lora" or "qlora". null for full-params training
164164

165165
- **lora_rank**: Rank value for Lora.
166166

@@ -170,11 +170,11 @@ Frequently used arguments are provided in ```configs/***_train_config``` and exp
170170

171171
- **target_modules**: List of target modules in lora, we have default values if None
172172

173-
- **quantization**: Whether to use quantization."4bit" or "8bit", or null. For QLoRA, it is recommended to use 4-bit quantization.
173+
- **quantization**: "4bit" for QLoRA/ null for LoRA and Full-params training.
174174

175175
- **pretrained_model_path**: Local/Shared disk path or model name on HuggingFace for the pre-trained model.
176176

177-
- **weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present.
177+
- **weighted_loss_mode**: Loss weighting method for multitask training. "case3" is recommended at present, "self-paced" is supported but need tuning of hyper-parameters.
178178

179179
- **padding_mode**: The way tokenized data is set. "padding" means padding for each sample to seq_length, "pack" means putting samples into seq_length as many as possible.
180180

mftcoder_accelerate/README_cn.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# MFTCoder: Accelerate + DeepSpeed框架篇
1+
# MFTCoder: Accelerate + DeepSpeed/FSDP 框架篇
22
[![Generic badge](https://img.shields.io/badge/🤗-Huggingface%20Repo-green.svg)](https://huggingface.co/codefuse-ai)
33
<a href="https://github.com/codefuse-ai/MFTCoder/blob/main/LICENSE">
44
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/transformers.svg?color=blue">
@@ -135,11 +135,11 @@ _**训练需要的参数配置在```configs/*_train_config```中,主要参数
135135

136136
- load_raw_dataset : 需要保持true,后续会支持其它模式数据,当前仅支持jsonl输入
137137
- data_paths: "[path1,path2,path3]" 输入数据地址,字符串,开头结尾用[],中间用```,```间隔不同path,每个path是一个目录,目录的最后一级名字作为任务名称,下面包含1到多个jsonl数据
138-
- output_dir:训练输出目录,存储checkpoint、lora_adaptor等
138+
- output_dir:训练输出目录,存储checkpoint(全量训练时)、lora_adaptor(Lora或者Qlora时)等
139139
- tb_dir: 存储tensorboard等
140-
- model_type: "llama|starcoder|chatglm2|qwen|gpt_nex"
140+
- model_type: "mixtral|mistral|deepseek|llama|starcoder|chatglm2|qwen|gpt_neox"
141141
- attn_implementation: "flash_attention_2" 或者 "eager"
142-
- peft_type: lora或者qlora
142+
- peft_type: lora或者qlora或者null(全量微调)
143143
- lora_rank: lora rank
144144
- lora_alpha: lora alpha
145145
- lora_dropout: lora dropout
@@ -234,7 +234,13 @@ CUDA_VISIBLE_DEVICES=0,1 accelerate launch --config_file pefts/accelerate_ds_con
234234

235235
如果你可以自行安装环境并使用torch>=2.1.1,可以尝试设置参数"attn_implementation"为 "sdpa"。这样会尝试使用transformers兼容的torch.nn.functional.scaled_dot_product_attention。支持的模型不全面。
236236

237-
#### 问题5:当前支持的模型中,有什么区别
237+
#### 问题5:在FDSP模式下,使用LoRA + Flash Attention,需要注意什么?
238+
FSDP模式下,由于dtype统一的问题,FA需要将queue, key, value同时加入target_modules,适配这种情况不影响最终结果。
239+
240+
FSDP模式下,不支持QLoRA, 因为目前对int类型的支持不够完全。
241+
242+
243+
#### 问题6:当前支持的模型中,有什么区别
238244
国产大模型比如chatglm2, chatglm3, baichuan2, qwen, aquila2等,使用的是和模型共同发布的modeling_xxx.py.
239245
其它被transformers官方支持的大模型,由于已经升级支持flash attention等,所以全面切换到官方的modeling支持训练,之前的自定义modeling会被deprecated
240246

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
compute_environment: LOCAL_MACHINE
2+
deepspeed_config: {}
3+
distributed_type: FSDP
4+
downcast_bf16: 'no'
5+
dynamo_backend: 'NO'
6+
fsdp_config:
7+
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
8+
fsdp_backward_prefetch_policy: BACKWARD_PRE
9+
fsdp_offload_params: false
10+
fsdp_sharding_strategy: 1
11+
fsdp_state_dict_type: FULL_STATE_DICT
12+
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
13+
machine_rank: 0
14+
main_training_function: main
15+
megatron_lm_config: {}
16+
mixed_precision: bf16
17+
num_machines: 1
18+
num_processes: 2
19+
rdzv_backend: static
20+
same_network: true
21+
use_cpu: false

mftcoder_accelerate/src/data/multi_task_dataset.py

Lines changed: 26 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -146,23 +146,22 @@ def __getitem__(self, idx):
146146

147147

148148
def shuffle_arrays(arrays, set_seed=-1):
149-
"""Shuffles arrays in-place, in the same order, along axis=0
149+
"""Shuffles arrays in-place, in the same order, along axis=0
150150
151-
Parameters:
152-
-----------
153-
arrays : List of NumPy arrays.
154-
set_seed : Seed value if int >= 0, else seed is random.
155-
"""
156-
assert all(len(arr) == len(arrays[0]) for arr in arrays)
157-
seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed
151+
Parameters:
152+
-----------
153+
arrays : List of NumPy arrays.
154+
set_seed : Seed value if int >= 0, else seed is random.
155+
"""
156+
assert all(len(arr) == len(arrays[0]) for arr in arrays)
157+
seed = np.random.randint(0, 2 ** (32 - 1) - 1) if set_seed < 0 else set_seed
158158

159-
for arr in arrays:
160-
rstate = np.random.RandomState(seed)
161-
rstate.shuffle(arr)
159+
for arr in arrays:
160+
rstate = np.random.RandomState(seed)
161+
rstate.shuffle(arr)
162162

163163

164164
def load_dataset_from_jsonl(args, shard_data=False, world_size=1, global_rank=0, local_rank=0):
165-
166165
# tokenization编码器
167166
encoder = UniformEncoder(args, args.tokenize_mode)
168167
encoder.initializer()
@@ -213,13 +212,13 @@ def load_dataset_from_jsonl(args, shard_data=False, world_size=1, global_rank=0,
213212
if shard_data and i % world_size != global_rank:
214213
continue
215214
data = json.loads(line.rstrip('\n\r'))
216-
features, length = encoder.encode(data, verbose=(i<1))
215+
features, length = encoder.encode(data, verbose=(i < 1))
217216
# features, length = encoder.encode(data)
218217
# may have more samples
219218
for idx in range(len(features['input_ids'])):
220219
cur_dataset_input_ids.append(features['input_ids'][idx])
221220
cur_dataset_loss_mask.append(features['loss_mask'][idx])
222-
221+
223222
fin.close()
224223
else:
225224
i = 0
@@ -236,31 +235,33 @@ def load_dataset_from_jsonl(args, shard_data=False, world_size=1, global_rank=0,
236235
cur_dataset_input_ids.append(features['input_ids'][idx])
237236
cur_dataset_loss_mask.append(features['loss_mask'][idx])
238237
fin.close()
239-
238+
240239
cur_dataset_input_ids = np.array(cur_dataset_input_ids, dtype=np.float32)
241240
cur_dataset_loss_mask = np.array(cur_dataset_loss_mask, dtype=np.float32)
242241
cur_dataset_num_tokens = np.sum(cur_dataset_loss_mask, dtype=np.int32)
243242
cur_dataset_sample_num = len(cur_dataset_input_ids)
244243
num_tokens.append(cur_dataset_num_tokens)
245244
total_sample_cnt.append(cur_dataset_sample_num)
246245
effective_token_rate.append(cur_dataset_num_tokens / (cur_dataset_sample_num * args.seq_length))
247-
246+
248247
# shuffle before split
249248
shuffle_arrays([cur_dataset_input_ids, cur_dataset_loss_mask], args.seed)
250249
train_ratio = splits[0] / 100.0
251250
train_num = int(math.ceil(train_ratio * cur_dataset_sample_num))
252251
# split train/valid
253-
cur_train_input_ids, cur_valid_input_ids = cur_dataset_input_ids[: train_num], cur_dataset_input_ids[train_num: ]
254-
cur_train_loss_mask, cur_valid_loss_mask = cur_dataset_loss_mask[: train_num], cur_dataset_loss_mask[train_num: ]
252+
cur_train_input_ids, cur_valid_input_ids = cur_dataset_input_ids[: train_num], cur_dataset_input_ids[train_num:]
253+
cur_train_loss_mask, cur_valid_loss_mask = cur_dataset_loss_mask[: train_num], cur_dataset_loss_mask[train_num:]
255254
local_train_num += train_num
256255
local_valid_num += (cur_dataset_sample_num - train_num)
257256

258-
cur_train_dataset = {'input_ids': cur_train_input_ids,
259-
'loss_mask': cur_train_loss_mask
260-
}
261-
cur_valid_dataset = {'input_ids': cur_valid_input_ids,
262-
'loss_mask': cur_valid_loss_mask
263-
}
257+
cur_train_dataset = {
258+
'input_ids': cur_train_input_ids,
259+
'loss_mask': cur_train_loss_mask
260+
}
261+
cur_valid_dataset = {
262+
'input_ids': cur_valid_input_ids,
263+
'loss_mask': cur_valid_loss_mask
264+
}
264265
print(f"[Global Rank {global_rank}]shape of cur train dataset: {cur_train_dataset['input_ids'].shape}")
265266
print(f"[Global Rank {global_rank}]shape of cur valid dataset: {cur_valid_dataset['input_ids'].shape}")
266267

@@ -339,7 +340,7 @@ def load_dataset_from_jsonl(args, shard_data=False, world_size=1, global_rank=0,
339340
all_train_datasets[i].update_ds_weight(train_loss_weights[i] / factor)
340341
print(f'loss weight of train dataset {i} after update in rank {global_rank}: {all_train_datasets[i].ds_weight}')
341342
blending_train_dataset = GPT2BlendableDataset(all_train_datasets, train_sample_weights, global_train_num, local_train_num)
342-
343+
343344
for i in range(len(all_train_datasets)):
344345
print(f'loss weight of valid dataset {i} before update in rank {global_rank}: {all_train_datasets[i].ds_weight}')
345346
blending_valid_dataset = None

mftcoder_accelerate/src/ds_single_launch.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,14 @@ N_GPU_PER_NODE=8
44
# envs used inside training
55
export OMP_NUM_THREADS=4
66
export TOKENIZERS_PARALLELISM=False
7-
MYHOME=path/to/your/log
7+
88
TODAY=$(date +%Y-%m%d-%H%M)
99

1010
# accelerate launch --config_file accelerate_ds_config.yaml \
1111
accelerate launch \
1212
--num_machines 1 \
13-
--num_processes $(($N_GPU_PER_NODE)) \
13+
--num_processes $N_GPU_PER_NODE \
1414
--use_deepspeed \
15-
--deepspeed_multinode_launcher 'standard' \
1615
--zero_stage 2 \
1716
--offload_optimizer_device 'cpu' \
1817
--offload_param_device 'none' \
@@ -27,4 +26,5 @@ accelerate launch \
2726
--machine_rank 0 \
2827
--rdzv_backend 'static' \
2928
pefts/mft_accelerate.py --train_config configs/"xxx_train_config.json" \
30-
> $MYHOME/logs/MFTCoder-training-$TODAY.log 2>&1 &
29+
--distributed_type "DeepSpeed" \
30+
> MFTCoder-training-"$TODAY".log 2>&1 &
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Launch script on single node
2+
N_GPU_PER_NODE=8
3+
4+
# envs used inside training
5+
export OMP_NUM_THREADS=4
6+
export TOKENIZERS_PARALLELISM=False
7+
8+
TODAY=$(date +%Y-%m%d-%H%M)
9+
10+
ccelerate launch \
11+
--use_fsdp \
12+
--num_machines=1 \
13+
--num_processes=2 \
14+
--fsdp_sharding_strategy=1 \
15+
--fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
16+
--fsdp_state_dict_type=FULL_STATE_DICT \
17+
--fsdp_backward_prefetch_policy=BACKWARD_PRE \
18+
--fsdp_transformer_layer_cls_to_wrap=LlamaDecoderLayer \
19+
--fsdp_offload_params=false \
20+
--main_training_function=main \
21+
--mixed_precision=bf16 \
22+
--dynamo_backend=no \
23+
--same_network \
24+
--machine_rank=0 \
25+
--rdzv_backend=static \
26+
pefts/mft_accelerate.py --train_config configs/"xxx_train_config.json" \
27+
--distributed_type "FSDP" \
28+
> MFTCoder-training-"$TODAY".log 2>&1 &
29+

mftcoder_accelerate/src/pefts/arguments.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,7 @@ class TrainArgs:
154154
# role_markers: {"system": "### System:\n", "user": "### Instruction:\n", "assistant": "### Response:\n"}
155155
role_markers: Union[None, dict] = None
156156

157+
distributed_type: Union[None, str] = "deepspeed"
157158
# legacy, leave them
158159
use_xformers: bool = True
159160
trust_remote_code: bool = True

0 commit comments

Comments
 (0)