Skip to content

Commit 987dbd7

Browse files
authored
update resume from checkpoint & update timeout (#4774)
1 parent ec5c7f6 commit 987dbd7

File tree

13 files changed

+123
-66
lines changed

13 files changed

+123
-66
lines changed

README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,12 @@ You can contact us and communicate with us by adding our group:
6060
- 🍎 **Model Types**: Supports 500+ pure text large models, **200+ multi-modal large models**, as well as All-to-All multi-modal models, sequence classification models, and embedding models, **covering the entire process from training to deployment**.
6161
- **Dataset Types**: Comes with 150+ pre-training, fine-tuning, human alignment, multi-modal datasets, and supports custom datasets.
6262
- **Hardware Support**: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, MPS, etc.
63-
- 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
64-
- **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
63+
- **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
64+
- **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, Megatron, and other distributed training techniques.
6565
- **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
66-
- **RLHF Training**: Supports human alignment training methods such as DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both pure text and multi-modal large models.
66+
- 🍊 **RLHF Training**: Supports human alignment training methods such as DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both pure text and multi-modal large models.
6767
- 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
68+
- 🥥 **Megatron Parallelism**: Supports accelerating CPT/SFT/DPO using Megatron parallelism techniques, currently compatible with 200+ large language models.
6869
- **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
6970
- **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
7071
- 🍉 **Toolbox Capabilities**: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.

README_CN.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,12 @@
5757
- 🍎 **模型类型**:支持500+纯文本大模型、**200+多模态大模型**以及All-to-All全模态模型、序列分类模型、Embedding模型**训练到部署全流程**
5858
- **数据集类型**:内置150+预训练、微调、人类对齐、多模态等各种类型的数据集,并支持自定义数据集。
5959
- **硬件支持**:CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU、MPS等。
60-
- 🍊 **轻量训练**:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
61-
- **分布式训练**:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术
60+
- **轻量训练**:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
61+
- **分布式训练**:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP、Megatron等分布式训练技术
6262
- **量化训练**:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
63-
- **RLHF训练**:支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
63+
- 🍊 **RLHF训练**:支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
6464
- 🍓 **多模态训练**:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。
65+
- 🥥 **Megatron并行技术**:支持使用Megatron并行技术对CPT/SFT/DPO进行加速,现支持200+大语言模型。
6566
- **界面训练**:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。
6667
- **插件化与拓展**:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
6768
- 🍉 **工具箱能力**:不仅提供大模型和多模态大模型的训练支持,还涵盖其推理、评测、量化和部署全流程。

docs/source/GetStarted/快速开始.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,12 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
55
- 🍎 模型类型:支持500+纯文本大模型、200+多模态大模型以及All-to-All全模态模型、序列分类模型、Embedding模型训练到部署全流程。
66
- 数据集类型:内置150+预训练、微调、人类对齐、多模态等各种类型的数据集,并支持自定义数据集。
77
- 硬件支持:CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU、MPS等。
8-
- 🍊 轻量训练:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
9-
- 分布式训练:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术
8+
- 轻量训练:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
9+
- 分布式训练:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP、Megatron等分布式训练技术
1010
- 量化训练:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
11-
- RLHF训练:支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
11+
- 🍊 RLHF训练:支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
1212
- 🍓 多模态训练:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。
13+
- 🥥 Megatron并行技术:支持使用Megatron并行技术对CPT/SFT/DPO进行加速,现支持200+大语言模型。
1314
- 界面训练:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。
1415
- 插件化与拓展:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
1516
- 🍉 工具箱能力:除了对大模型和多模态大模型的训练支持外,还支持其推理、评测、量化和部署全流程。

docs/source_en/GetStarted/Quick-start.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,12 @@ ms-swift is a comprehensive training and deployment framework for large language
55
- 🍎 Model Types: Supports 500+ pure text large models, 200+ multi-modal large models, as well as All-to-All multi-modal models, sequence classification models, and embedding models, covering the entire process from training to deployment.
66
- Dataset Types: Comes with more than 150 pre-built datasets for pre-training, fine-tuning, human alignment, multimodal, and supports custom datasets.
77
- Hardware Support: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, MPS and others.
8-
- 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
9-
- Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies.
8+
- Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
9+
- Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, Megatron, and other distributed training technologies.
1010
- Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
11-
- RLHF Training: Supports human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both text-based and multimodal large models.
11+
- 🍊 RLHF Training: Supports human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both text-based and multimodal large models.
1212
- 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
13+
- 🥥 Megatron Parallelism: Supports accelerating CPT/SFT/DPO using Megatron parallelism techniques, currently compatible with 200+ large language models.
1314
- Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
1415
- Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
1516
- 🍉 Toolbox Capabilities: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.

swift/llm/data_loader.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,9 +77,10 @@ def __iter__(self):
7777

7878
class DataLoaderDispatcher:
7979

80-
def __init__(self, base_dataloader, device=None):
80+
def __init__(self, base_dataloader, device=None, skip_batches: int = 0):
8181
self.base_dataloader = base_dataloader
8282
self.device = device
83+
self.skip_batches = skip_batches
8384

8485
@property
8586
def rank(self):
@@ -101,8 +102,14 @@ def _scatter_object_list(self, inputs):
101102
dist.scatter_object_list(outputs, inputs, global_src_rank, group=self.group)
102103
return outputs[0]
103104

105+
def _skip_batches(self, base_iter):
106+
if self.rank == 0 and self.skip_batches > 0:
107+
for _ in range(self.skip_batches):
108+
[next(base_iter) for _ in range(self.world_size)]
109+
104110
def __iter__(self):
105111
base_iter = iter(self.base_dataloader)
112+
self._skip_batches(base_iter)
106113
while True:
107114
if self.rank == 0:
108115
try:

swift/llm/dataset/media.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ def _safe_download(media_type: Union[str, List[str]],
102102
'you can manually download the resources and extracting to the local dir.')
103103
logger.info('Now begin.')
104104
download_config = DownloadConfig(cache_dir=MediaResource.cache_dir)
105-
download_config.storage_options = {'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=3600)}}
105+
download_config.storage_options = {'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=86400)}}
106106
if file_type == 'file':
107107
filename = media_type.split('/')[-1]
108108
final_path = os.path.join(final_folder, filename)

swift/llm/dataset/utils.py

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -395,29 +395,23 @@ def __init__(
395395
def _processor(self):
396396
while True:
397397
i, data = self._in_queue.get()
398-
if data is None:
399-
encoded_data = None
400-
else:
401-
encoded_data = self._encode_data(data)
398+
encoded_data = self._encode_data(data)
402399
self._out_queue.put((i, encoded_data))
403400

404-
def _put_data_in_queue(self, iterator):
401+
def _put_data_in_queue(self, iterator) -> int:
405402
for i in range(self.packing_interval):
406403
try:
407404
data = next(iterator)
408405
except StopIteration:
409-
self._in_queue.put((i, None))
410-
return True
406+
return i
411407
self._in_queue.put((i, data))
412-
return False
408+
return i + 1
413409

414-
def _fetch_data_out_queue(self, last_res):
415-
res = [None] * self.packing_interval
416-
for _ in range(self.packing_interval):
410+
def _fetch_data_out_queue(self, last_res, num_samples):
411+
res = [None] * num_samples
412+
for _ in range(num_samples):
417413
i, data = self._out_queue.get()
418-
if data is None:
419-
break
420-
elif not data:
414+
if not data:
421415
continue
422416
res[i] = (data, len(data['input_ids']))
423417
res = [data for data in res if data]
@@ -442,8 +436,9 @@ def __iter__(self):
442436
iterator = iter(self.dataset)
443437
data = []
444438
while True:
445-
finished = self._put_data_in_queue(iterator)
446-
data = self._fetch_data_out_queue(data)
439+
num_samples = self._put_data_in_queue(iterator)
440+
finished = num_samples != self.packing_interval
441+
data = self._fetch_data_out_queue(data, num_samples)
447442
res, data = self.calculate_matched_group(self.template, data, is_finished=finished)
448443
yield from res
449444
if finished:

swift/llm/infer/infer_engine/grpo_vllm_engine.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
try:
1515
# After setting the environment variables, import vllm. This way of writing allows lint to pass.
1616
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
17-
os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '3600'
17+
os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '86400'
1818

1919
except Exception:
2020
raise

swift/llm/infer/infer_engine/infer_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ def __init__(self,
2323
api_key: str = 'EMPTY',
2424
*,
2525
base_url: Optional[str] = None,
26-
timeout: Optional[int] = 3600) -> None:
26+
timeout: Optional[int] = 86400) -> None:
2727
"""
2828
Initialize the InferClient.
2929

swift/llm/infer/infer_engine/vllm_engine.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
try:
2525
# After setting the environment variables, import vllm. This way of writing allows lint to pass.
2626
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
27-
os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '3600'
27+
os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '86400'
2828
import vllm
2929
from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams, EngineArgs, LLMEngine
3030
except Exception:

0 commit comments

Comments
 (0)