update resume from checkpoint & update timeout (#4774)

Jintao-Huang · web-flow · commit 987dbd7125b8 · 2025-07-01T20:44:07.000+08:00
diff --git a/README.md b/README.md
@@ -60,11 +60,12 @@ You can contact us and communicate with us by adding our group:
 - 🍎 **Model Types**: Supports 500+ pure text large models, **200+ multi-modal large models**, as well as All-to-All multi-modal models, sequence classification models, and embedding models, **covering the entire process from training to deployment**.
 - **Dataset Types**: Comes with 150+ pre-training, fine-tuning, human alignment, multi-modal datasets, and supports custom datasets.
 - **Hardware Support**: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, MPS, etc.
-- 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
-- **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
+- **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
+- **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, Megatron, and other distributed training techniques.
 - **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- **RLHF Training**: Supports human alignment training methods such as DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both pure text and multi-modal large models.
+- 🍊 **RLHF Training**: Supports human alignment training methods such as DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both pure text and multi-modal large models.
 - 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
+- 🥥 **Megatron Parallelism**: Supports accelerating CPT/SFT/DPO using Megatron parallelism techniques, currently compatible with 200+ large language models.
 - **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
 - **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
 - 🍉 **Toolbox Capabilities**: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.
diff --git a/README_CN.md b/README_CN.md
@@ -57,11 +57,12 @@
 - 🍎 **模型类型**：支持500+纯文本大模型、**200+多模态大模型**以及All-to-All全模态模型、序列分类模型、Embedding模型**训练到部署全流程**。
 - **数据集类型**：内置150+预训练、微调、人类对齐、多模态等各种类型的数据集，并支持自定义数据集。
 - **硬件支持**：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU、MPS等。
-- 🍊 **轻量训练**：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
-- **分布式训练**：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
+- **轻量训练**：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
+- **分布式训练**：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP、Megatron等分布式训练技术。
 - **量化训练**：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
+- 🍊 **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
 - 🍓 **多模态训练**：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
+- 🥥 **Megatron并行技术**：支持使用Megatron并行技术对CPT/SFT/DPO进行加速，现支持200+大语言模型。
 - **界面训练**：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - **插件化与拓展**：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
 - 🍉 **工具箱能力**：不仅提供大模型和多模态大模型的训练支持，还涵盖其推理、评测、量化和部署全流程。
diff --git a/docs/source/GetStarted/快速开始.md b/docs/source/GetStarted/快速开始.md
@@ -5,11 +5,12 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
 - 🍎 模型类型：支持500+纯文本大模型、200+多模态大模型以及All-to-All全模态模型、序列分类模型、Embedding模型训练到部署全流程。
 - 数据集类型：内置150+预训练、微调、人类对齐、多模态等各种类型的数据集，并支持自定义数据集。
 - 硬件支持：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU、MPS等。
-- 🍊 轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
-- 分布式训练：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
+- 轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
+- 分布式训练：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP、Megatron等分布式训练技术。
 - 量化训练：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- RLHF训练：支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
+- 🍊 RLHF训练：支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
 - 🍓 多模态训练：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
+- 🥥 Megatron并行技术：支持使用Megatron并行技术对CPT/SFT/DPO进行加速，现支持200+大语言模型。
 - 界面训练：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - 插件化与拓展：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
 - 🍉 工具箱能力：除了对大模型和多模态大模型的训练支持外，还支持其推理、评测、量化和部署全流程。
diff --git a/docs/source_en/GetStarted/Quick-start.md b/docs/source_en/GetStarted/Quick-start.md
@@ -5,11 +5,12 @@ ms-swift is a comprehensive training and deployment framework for large language
 - 🍎 Model Types: Supports 500+ pure text large models, 200+ multi-modal large models, as well as All-to-All multi-modal models, sequence classification models, and embedding models, covering the entire process from training to deployment.
 - Dataset Types: Comes with more than 150 pre-built datasets for pre-training, fine-tuning, human alignment, multimodal, and supports custom datasets.
 - Hardware Support: Compatible with CPU, RTX series, T4/V100, A10/A100/H100, Ascend NPU, MPS and others.
-- 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
-- Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies.
+- Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
+- Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, Megatron, and other distributed training technologies.
 - Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- RLHF Training: Supports human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both text-based and multimodal large models.
+- 🍊 RLHF Training: Supports human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both text-based and multimodal large models.
 - 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
+- 🥥 Megatron Parallelism: Supports accelerating CPT/SFT/DPO using Megatron parallelism techniques, currently compatible with 200+ large language models.
 - Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
 - Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
 - 🍉 Toolbox Capabilities: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.
diff --git a/swift/llm/data_loader.py b/swift/llm/data_loader.py
@@ -77,9 +77,10 @@ def __iter__(self):
 
 class DataLoaderDispatcher:
 
-    def __init__(self, base_dataloader, device=None):
+    def __init__(self, base_dataloader, device=None, skip_batches: int = 0):
         self.base_dataloader = base_dataloader
         self.device = device
+        self.skip_batches = skip_batches
 
     @property
     def rank(self):
@@ -101,8 +102,14 @@ def _scatter_object_list(self, inputs):
         dist.scatter_object_list(outputs, inputs, global_src_rank, group=self.group)
         return outputs[0]
 
+    def _skip_batches(self, base_iter):
+        if self.rank == 0 and self.skip_batches > 0:
+            for _ in range(self.skip_batches):
+                [next(base_iter) for _ in range(self.world_size)]
+
     def __iter__(self):
         base_iter = iter(self.base_dataloader)
+        self._skip_batches(base_iter)
         while True:
             if self.rank == 0:
                 try:
diff --git a/swift/llm/dataset/media.py b/swift/llm/dataset/media.py
@@ -102,7 +102,7 @@ def _safe_download(media_type: Union[str, List[str]],
                     'you can manually download the resources and extracting to the local dir.')
         logger.info('Now begin.')
         download_config = DownloadConfig(cache_dir=MediaResource.cache_dir)
-        download_config.storage_options = {'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=3600)}}
+        download_config.storage_options = {'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=86400)}}
         if file_type == 'file':
             filename = media_type.split('/')[-1]
             final_path = os.path.join(final_folder, filename)
diff --git a/swift/llm/dataset/utils.py b/swift/llm/dataset/utils.py
@@ -395,29 +395,23 @@ def __init__(
     def _processor(self):
         while True:
             i, data = self._in_queue.get()
-            if data is None:
-                encoded_data = None
-            else:
-                encoded_data = self._encode_data(data)
+            encoded_data = self._encode_data(data)
             self._out_queue.put((i, encoded_data))
 
-    def _put_data_in_queue(self, iterator):
+    def _put_data_in_queue(self, iterator) -> int:
         for i in range(self.packing_interval):
             try:
                 data = next(iterator)
             except StopIteration:
-                self._in_queue.put((i, None))
-                return True
+                return i
             self._in_queue.put((i, data))
-        return False
+        return i + 1
 
-    def _fetch_data_out_queue(self, last_res):
-        res = [None] * self.packing_interval
-        for _ in range(self.packing_interval):
+    def _fetch_data_out_queue(self, last_res, num_samples):
+        res = [None] * num_samples
+        for _ in range(num_samples):
             i, data = self._out_queue.get()
-            if data is None:
-                break
-            elif not data:
+            if not data:
                 continue
             res[i] = (data, len(data['input_ids']))
         res = [data for data in res if data]
@@ -442,8 +436,9 @@ def __iter__(self):
             iterator = iter(self.dataset)
         data = []
         while True:
-            finished = self._put_data_in_queue(iterator)
-            data = self._fetch_data_out_queue(data)
+            num_samples = self._put_data_in_queue(iterator)
+            finished = num_samples != self.packing_interval
+            data = self._fetch_data_out_queue(data, num_samples)
             res, data = self.calculate_matched_group(self.template, data, is_finished=finished)
             yield from res
             if finished:
diff --git a/swift/llm/infer/infer_engine/grpo_vllm_engine.py b/swift/llm/infer/infer_engine/grpo_vllm_engine.py
@@ -14,7 +14,7 @@
 try:
     # After setting the environment variables, import vllm. This way of writing allows lint to pass.
     os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
-    os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '3600'
+    os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '86400'
 
 except Exception:
     raise
diff --git a/swift/llm/infer/infer_engine/infer_client.py b/swift/llm/infer/infer_engine/infer_client.py
@@ -23,7 +23,7 @@ def __init__(self,
                  api_key: str = 'EMPTY',
                  *,
                  base_url: Optional[str] = None,
-                 timeout: Optional[int] = 3600) -> None:
+                 timeout: Optional[int] = 86400) -> None:
         """
         Initialize the InferClient.
 
diff --git a/swift/llm/infer/infer_engine/vllm_engine.py b/swift/llm/infer/infer_engine/vllm_engine.py
@@ -24,7 +24,7 @@
 try:
     # After setting the environment variables, import vllm. This way of writing allows lint to pass.
     os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
-    os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '3600'
+    os.environ['VLLM_ENGINE_ITERATION_TIMEOUT_S'] = '86400'
     import vllm
     from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams, EngineArgs, LLMEngine
 except Exception:
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
@@ -407,7 +407,7 @@ def train(self, *args, **kwargs):
         # gradient_checkpointing
         gradient_checkpointing = self.args.gradient_checkpointing
         self._prepare_gradient_checkpointing(self.accelerator.unwrap_model(self.model))
-        with self.hub.patch_hub(), self._fix_grad_norm_nan():
+        with self.hub.patch_hub(), self._fix_grad_norm_nan(), self._patch_skip_first_batches():
             res = super().train(*args, **kwargs)
         self.template.remove_post_encode_hook()
         self.args.gradient_checkpointing = gradient_checkpointing  # recover
@@ -570,14 +570,33 @@ def get_batch_samples(self, *args, **kwargs):
         dist.all_reduce(num_items_in_batch, dist.ReduceOp.SUM, sequence_parallel.sp_group)
         return batch_samples, num_items_in_batch
 
+    @contextmanager
+    def _patch_skip_first_batches(self):
+        from transformers import trainer
+        origin_skip_first_batches = trainer.skip_first_batches
+
+        def skip_first_batches(dataloader, num_batches=0):
+            if isinstance(dataloader, (DataLoaderShard, DataLoaderDispatcher)):
+                # DataLoaderMixin
+                return self.get_train_dataloader(skip_batches=num_batches)
+            else:
+                return origin_skip_first_batches(dataloader, num_batches)
+
+        trainer.skip_first_batches = skip_first_batches
+        try:
+            yield
+        finally:
+            trainer.skip_first_batches = origin_skip_first_batches
+
 
 class DataLoaderMixin:
 
-    def get_train_dataloader(self):
+    def get_train_dataloader(self, skip_batches=0):
         dataloader = None
         if self.template.sequence_parallel_size > 1:
             from swift.trainers.sequence_parallel import sequence_parallel
-            dataloader = sequence_parallel.get_dataloader(self, self.train_dataset, self._train_batch_size)
+            dataloader = sequence_parallel.get_dataloader(
+                self, self.train_dataset, self._train_batch_size, skip_batches=skip_batches)
         if dataloader is None:
             # Higher efficiency
             if self.train_dataset is None:
@@ -603,15 +622,17 @@ def get_train_dataloader(self):
                     len(train_dataset), batch_size=self._train_batch_size, **batch_sampler_params)
                 dataloader_params['worker_init_fn'] = partial(
                     seed_worker, num_workers=self.args.dataloader_num_workers, rank=self.args.process_index)
+                if skip_batches > 0:
+                    from accelerate.data_loader import SkipBatchSampler
+                    batch_sampler = SkipBatchSampler(batch_sampler, skip_batches=skip_batches)
                 dataloader_params['batch_sampler'] = batch_sampler
                 dataloader = DataLoaderShard(train_dataset, device=self.accelerator.device, **dataloader_params)
             else:
                 # IterableDataset
                 if dist.is_initialized() and dataloader_params['prefetch_factor']:
                     dataloader_params['prefetch_factor'] = dataloader_params['prefetch_factor'] * dist.get_world_size()
                 dataloader = DataLoader(train_dataset, batch_size=self._train_batch_size, **dataloader_params)
-                dataloader = DataLoaderDispatcher(dataloader, self.accelerator.device)
-
+                dataloader = DataLoaderDispatcher(dataloader, self.accelerator.device, skip_batches=skip_batches)
         return dataloader
 
     def get_eval_dataloader(self, eval_dataset=None):
diff --git a/swift/trainers/sequence_parallel/ulysses.py b/swift/trainers/sequence_parallel/ulysses.py
diff --git a/tests/train/test_resume_from_checkpoint.py b/tests/train/test_resume_from_checkpoint.py