Merge branch 'main' into release/2.6

Jintao-Huang · Jintao-Huang · commit 85076f9fd2a3 · 2024-11-23T19:43:46.000+08:00
diff --git a/docs/source/Instruction/常见问题整理.md b/docs/source/Instruction/常见问题整理.md
@@ -175,6 +175,43 @@ swift目前还不支持词表扩充。
 ### Q47: 多机训练速度缓慢，在使用swift框架进行LLM训练时，发现采用deepspeed zero3训练会出现严重的速度下降问题
 详见[issue](https://github.com/modelscope/ms-swift/issues/1825)。
 
+### Q48: swift现在是支持qwen2-vl多阶段预训练的吗？我看官方的最佳实践里的sft好像都是vit+llm一起训的，不知道支不支持单独finetune
+详见[issue](https://github.com/modelscope/ms-swift/issues/2222)。
+
+### Q49: qwen2-vl是不是不支持混合纯文本数据?
+支持图文和纯文本。
+
+### Q50: 微调的时候可以绘制不同数据集的loss曲线吗？
+不支持的，数据集是混合训练的。
+
+### Q51: 模型训练后，回复重复了很多内容
+参考[LLM微调文档](https://swift.readthedocs.io/zh-cn/latest/Instruction/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.html)。如果训练过程中出现重复的情况，请多训练几个epoch, 清洗数据, 全参数训练, 采用RLHF的方式缓解。
+
+### Q52: 想问一下swift目前支持prompt tuning或者prefix tuning吗？
+不支持，这两个方法知识遗忘比较严重，目前不推荐使用。
+
+### Q53: 两张A10训练报错如下：
+```text
+[rank0]: torch.distributed.DistBackendError: NCCL error in:../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970， unhandled system error (run with NCCL_DEBUG=INFO for details),NCCL version 2.20.5
+[rank0]:ncclSystemError: System call (e.g. socket,malloc) or external library call failed or device error.
+```
+请检查共享内存是否太小，nccl需要共享内存。
+
+### Q54: 请问在采用DDP微调训练的过程中，冻结某些层时导致的某些参数未参与梯度回传问题怎么解决？
+配置参数`--ddp_find_unused_parameters true`。
+
+### Q55: swift有没有数据集质检工具？
+[data-juicer](https://github.com/modelscope/data-juicer)。
+
+### Q55: web端在哪启动模型并行?只找到了数据并行的勾选项，没找到模型并行在哪。
+指定可见显卡就可以。
+
+### Q56: 如何不自动shuffle呢?想把它关掉。
+目前只能改transformers[代码](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py)。
+
+### Q57: 'num_items_in_batch'是个什么参数？没找到在哪。
+升级`ms-swift==2.5.2`或者降低`transformers<4.46`。
+
 ## 推理
 
 ### Q1:swift推理有文档吗？
@@ -224,6 +261,40 @@ qlora训练的模型不支持merge-lora的, 建议lora微调后 merge-lora再量
 ### Q14: 有人遇到过这个问题吗?RuntimeError: "triu_tril_cuda_template" not implemented for'BFloat16'
 升级torch,这个版本的torch没实现这个算子。
 
+### Q15: qwen2-audio支持流式推理吗？
+支持，详见[issue](https://github.com/modelscope/ms-swift/issues/1653)
+
+### Q16: inference client推理多模态，do_sample在哪里设置？
+设置temperature=0。
+
+### Q17: ms-swift支持大模型批处理不？
+支持的。python脚本推理时，文档中request_list可以有多个query，部署时服务端会自动batch处理，详见[VLLM推理加速与部署](https://swift.readthedocs.io/zh-cn/latest/LLM/VLLM%E6%8E%A8%E7%90%86%E5%8A%A0%E9%80%9F%E4%B8%8E%E9%83%A8%E7%BD%B2.html#id3)。
+
+### Q18: ms-swift量化模型的时候，显示内存不足，可以在量化的时候少占用一些资源吗，慢一点没关系。
+尝试设置`--quant_device_map cpu`。
+
+### Q19: swift支持对多模态模型量化吗？
+支持。
+
+### Q20: 使用GPTQ报错如下，请问是啥原因？
+```text
+if llm_config['architectures'][0] == 'LlamaForCausalLM':
+KeyError: 'architectures'
+```
+尝试transformers==4.44.*版本。
+
+### Q21: swift infer如何将评估的结果保存到指定文件呢 每次都不知道保存到哪里了
+设置--result_dir your_path，详见[InferArguments](https://github.com/modelscope/ms-swift/blob/main/swift/llm/utils/argument.py)。
+
+### Q22: AWQ量化yi-vl-6b出错如下：
+```text
+TypeError: swift.llm.utils.model.get_model_tokenizer_with_flash_attn() got multiple values for keyword argument 'automodel_class'.
+```
+请使用gptq量化。
+
+### Q23: 想问一下用swift export对qwen2.5 72B模型进行gptq int4量化，max model length=32768用的是默认值，给的校准数据集有128个样本，但是量化的时候报错了，报错日志是：factorization could not be completed because the input is not positive-definite(the leading minor of order 18145 is not pisitive-definite)。是什么原因？
+海森矩阵不正定的问题，试试其他的数据集。
+
 ## 部署
 
 ### Q1: 如何部署训练后的模型？
@@ -256,6 +327,16 @@ base模型可以用client.chat.completions.create的，不过这个是兼容行
 ### Q10: 在本地部署qwen2vl模型，推理后端使用vllm，本地视频怎么传入呢？可以使用 base64 传进去吗？curl调用如何加载视频呢？
 可以查看[mllm部署文档](https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/MLLM%E9%83%A8%E7%BD%B2%E6%96%87%E6%A1%A3.html)。url、base64、本地路径都可以的, 本地路径只限于本机测试。
 
+### Q11: qwen2-vl部署时报错如下，是vllm的版本不对么？
+```text
+Unrecognized keys in `rope_scaling`for 'rope_type'='default': {'mrope_section'} Unrecognized keys in `rope_scaling`for 'rope_type'='default': {'mrope_section'}
+```
+详见[issue](https://github.com/QwenLM/Qwen2-VL/issues/209)。
+
+### Q12: swift推理可以输出预测概率吗？在部署的时候，怎么设置？
+python脚本推理时，`model.generation_config.output_logits = True, model.generation_config.return_dict_in_generate = True`。
+部署时，客户端传参数，`logprobs=True, top_logprobs=5`。
+
 ## 评测
 
 ### Q1: swift支持的评测集有哪些？
@@ -305,5 +386,17 @@ swift eval --model_type 'qwen2_5-1_5b-instruct' --eval_dataset no --custom_eval_
 ```
 这是依赖了nltk的包，然后nltk的tokenizer需要下载一个punkt_tab的zip文件，国内有些环境下载不太稳定或者直接失败。已尝试改了代码做兜底，规避这个问题；参考[issue](https://github.com/nltk/nltk/issues/3293)。
 
-### Q6:  eval微调后的模型，总是会在固定的百分比停掉，但是vllm服务看着一直是有在正常运行的。模型越大，断开的越早。
+### Q6: eval微调后的模型，总是会在固定的百分比停掉，但是vllm服务看着一直是有在正常运行的。模型越大，断开的越早。
 `TIMEOUT`环境变量设置为-1。
+
+### Q7: evalscope 支持多模型对比吗？
+详见[文档](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html)。
+
+### Q8: 多模态数据集有没有自定义评估？
+多模态自定义评估可以参考[文档](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/custom_dataset.html#vlm)。
+
+### Q9: ms-swift有方法测试qps，延迟，tokens/s吗？
+可以尝试使用evalscope的[模型压测工具](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test.html#id1)。
+
+### Q10: 评估的时候可不可以控制数据集条数？评估一个mmlu需要一个多小时，也太慢了。
+配置参数`--eval_limit`，这里的`--eval_limit`是控制了每个subset的条数，比如mmlu有50多个subset，每个limit10条，那就是500多条。
diff --git a/docs/source_en/Instruction/Common-QA.md b/docs/source_en/Instruction/Common-QA.md
@@ -176,6 +176,43 @@ When fine-tuning on V100, it saves in fp32 format.
 ### Q47: Multi-machine training speed is slow. When using the Swift framework for LLM training, we found that using DeepSpeed ZeRO-3 for training results in a severe speed decrease.
 See the details in this [issue](https://github.com/modelscope/ms-swift/issues/1825).
 
+### Q48: Does swift currently support multi-stage pre-training for qwen2-vl? I noticed in the official best practices that the SFT seems to train VIT+LLM together. Is it possible to finetune separately?
+See [issue](https://github.com/modelscope/ms-swift/issues/2222) for details.
+
+### Q49: Does qwen2-vl not support mixing pure text data?
+It supports both image-text and pure text data.
+
+### Q50: Can we plot loss curves for different datasets during fine-tuning?
+No, it's not supported. Datasets are trained in a mixed manner.
+
+### Q51: After model training, the responses contain a lot of repetitive content
+Refer to the [LLM Fine-tuning Documentation](https://swift.readthedocs.io/en/latest/Instruction/LLM-fine-tuning.html). If repetition occurs during training, try training for more epochs, clean the data, use full parameter training, or adopt RLHF methods to mitigate the issue.
+
+### Q52: I want to ask if swift currently supports prompt tuning or prefix tuning?
+Not supported. These two methods suffer from severe knowledge forgetting, so they are not recommended for use at present.
+
+### Q53: Training on two A10 GPUs reports the following error:
+```text
+[rank0]: torch.distributed.DistBackendError: NCCL error in:../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970， unhandled system error (run with NCCL_DEBUG=INFO for details),NCCL version 2.20.5
+[rank0]:ncclSystemError: System call (e.g. socket,malloc) or external library call failed or device error.
+```
+Please check if the shared memory is too small, as NCCL requires shared memory.
+
+### Q54: How to solve the problem of some parameters not participating in gradient backpropagation when freezing certain layers during DDP fine-tuning training?
+Configure the parameter `--ddp_find_unused_parameters true`.
+
+### Q55: Does swift have a dataset quality inspection tool?
+[data-juicer](https://github.com/modelscope/data-juicer)。
+
+### Q55: Where can I start model parallelism on the web-ui? I only found the checkbox for data parallelism, but couldn't find where to enable model parallelism.
+Just specify the visible GPUs.
+
+### Q56: How to turn off automatic shuffling? I want to disable it.
+Currently, you can only modify the transformers [code](https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py).
+
+### Q57: What is the 'num_items_in_batch' parameter? I can't find it anywhere.
+Upgrade to `ms-swift==2.5.2` or downgrade to `transformers<4.46`.
+
 ## Inference
 
 ### Q1:Is there documentation for Swift inference?
@@ -224,6 +261,40 @@ Modify `generation_config.output_logits`. Set `model.generation_config.output_lo
 ### Q14: Has anyone encountered this problem? RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16'
 Upgrade torch, this version of torch hasn't implemented this operator.
 
+### Q15: Does qwen2-audio support streaming inference?
+Yes, it does. For details, see this [issue](https://github.com/modelscope/ms-swift/issues/1653).
+
+### Q16: Where do I set do_sample for multimodal model inference in the inference client?
+Set temperature=0.
+
+### Q17: Does ms-swift support batch processing for large models?
+Yes, it does. When inferencing with Python scripts, the request_list in the documentation can contain multiple queries. During deployment, the server will automatically handle batch processing. See [VLLM Inference Acceleration and Deployment](https://swift.readthedocs.io/en/latest/LLM/VLLM-inference-acceleration-and-deployment.html) for details.
+
+### Q18: When quantizing models in ms-swift, it shows insufficient memory. Is it possible to use fewer resources during quantization?
+Try setting `--quant_device_map cpu`.
+
+### Q19: Does swift support quantization for multimodal models?
+Yes, it does.
+
+### Q20:I'm getting the following error when using GPTQ. What's the reason?
+```text
+if llm_config['architectures'][0] == 'LlamaForCausalLM':
+KeyError: 'architectures'
+```
+Try using transformers version 4.44.*.
+
+### Q21: How can I save the evaluation results to a specified file in swift infer? I never know where it's being saved.
+Set `--result_dir your_path`. See [InferArguments](https://github.com/modelscope/ms-swift/blob/main/swift/llm/utils/argument.py) for details.
+
+### Q22: I'm getting the following error when AWQ quantizing yi-vl-6b:
+```text
+TypeError: swift.llm.utils.model.get_model_tokenizer_with_flash_attn() got multiple values for keyword argument 'automodel_class'.
+```
+Please use gptq quantization instead.
+
+### Q23: I'm trying to use swift export for gptq int4 quantization of the qwen2.5 72B model, with max model length=32768 as the default value and a calibration dataset of 128 samples. However, I'm getting an error during quantization. The error log says: "factorization could not be completed because the input is not positive-definite (the leading minor of order 18145 is not positive-definite)". What's the reason?
+This is an issue with the Hessian matrix not being positive-definite. Try using a different dataset.
+
 ## Deployment
 
 ### Q1: How to deploy the trained model?
@@ -256,6 +327,16 @@ Inference settings can only be set before startup. For deployment, default setti
 ### Q10: When deploying the qwen2vl model locally with vllm as the inference backend, how can we input local videos? Can we use base64 encoding? How to load videos when using curl?
 You can refer to the [Mutlimoda LLM Deployment](https://swift.readthedocs.io/en/latest/Multi-Modal/mutlimodal-deployment.html). URL, base64, and local file paths are all acceptable. Local file paths are only for testing on the same machine.
 
+### Q11: When deploying qwen2-vl, the following error occurs. Is it due to an incorrect version of vllm?
+```text
+Unrecognized keys in `rope_scaling`for 'rope_type'='default': {'mrope_section'} Unrecognized keys in `rope_scaling`for 'rope_type'='default': {'mrope_section'}
+```
+See [issue](https://github.com/QwenLM/Qwen2-VL/issues/209) for details.
+
+### Q12: Can swift inference output prediction probabilities? How to set it up during deployment?
+For Python script inference, `use model.generation_config.output_logits = True, model.generation_config.return_dict_in_generate = True`.
+During deployment, pass parameters from the client: `logprobs=True, top_logprobs=5`.
+
 ## Evaluation
 
 ### Q1: What evaluation datasets does Swift support?
@@ -305,5 +386,17 @@ swift eval --model_type 'qwen2_5-1_5b-instruct' --eval_dataset no --custom_eval_
 ```
 This relies on the nltk package, and the nltk tokenizer needs to download a punkt_tab zip file, which can be unstable or fail directly in some environments in China. We have tried to modify the code to work around this issue; refer to this [issue](https://github.com/nltk/nltk/issues/3293).
 
-### Q6:  When evaluating a fine-tuned model, it always stops at a fixed percentage, but the vllm service seems to be running normally. The larger the model, the earlier it disconnects.
+### Q6: When evaluating a fine-tuned model, it always stops at a fixed percentage, but the vllm service seems to be running normally. The larger the model, the earlier it disconnects.
 Set the `TIMEOUT` environment variable to -1.
+
+### Q7: Does evalscope support multi-model comparison?
+Please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html) for details.
+
+### Q8: Is there custom evaluation for multimodal datasets?
+For multimodal custom evaluation, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/custom_dataset.html#vlm).
+
+### Q9: Does ms-swift have methods to test QPS, latency, and tokens/s?
+You can try using evalscope's [model stress testing tool](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test.html#id1).
+
+### Q10: Is it possible to control the number of dataset entries during evaluation? Evaluating one MMLU takes over an hour, which is too slow.
+Configure the parameter `--eval_limit`. Here, `--eval_limit` controls the number of entries for each subset. For example, if MMLU has over 50 subsets and each is limited to 10 entries, it would be over 500 entries in total.
diff --git a/swift/llm/export.py b/swift/llm/export.py
@@ -81,6 +81,22 @@ def _get_dataset(*args, **kwargs):
     return res
 
 
+@contextmanager
+def _patch_move_embed(awq_model):
+    _origin_move_embed = awq_model.move_embed
+
+    def _move_embed(model, device: str):
+        if hasattr(model, '_hf_hook') and device != 'cpu':
+            return
+        _origin_move_embed(model, device)
+
+    awq_model.move_embed = _move_embed
+    try:
+        yield
+    finally:
+        awq_model.move_embed = _origin_move_embed
+
+
 def awq_model_quantize(awq_model, tokenizer, batch_size) -> None:
 
     from awq.quantize import quantizer
@@ -93,7 +109,8 @@ def awq_model_quantize(awq_model, tokenizer, batch_size) -> None:
     group_size = 128
     quant_config = {'zero_point': True, 'q_group_size': group_size, 'w_bit': _args.quant_bits, 'version': 'GEMM'}
     logger.info('Start quantizing the model...')
-    awq_model.quantize(tokenizer, quant_config=quant_config, n_parallel_calib_samples=batch_size)
+    with _patch_move_embed(awq_model):
+        awq_model.quantize(tokenizer, quant_config=quant_config, n_parallel_calib_samples=batch_size)
     quantizer.get_calib_dataset = _origin_get_calib_dataset  # recover
     awq_model.model.config.quantization_config = AwqConfig(
         bits=_args.quant_bits, group_size=group_size, zero_point=True, version='GEMM')
@@ -260,6 +277,7 @@ def llm_export(args: ExportArguments) -> None:
             from awq import AutoAWQForCausalLM
             model, template = prepare_model_template(
                 args, device_map=args.quant_device_map, task='export', automodel_class=AutoAWQForCausalLM)
+            template.model = model.model
             awq_model_quantize(model, template.tokenizer, args.quant_batch_size)
             model.save_quantized(args.quant_output_dir)
         elif args.quant_method == 'gptq':
diff --git a/swift/llm/utils/utils.py b/swift/llm/utils/utils.py
@@ -317,7 +317,7 @@ def _single_map(d: Dict[str, Any], map_func: MapFunc) -> Optional[Dict[str, Any]
 
 def _map_mp_single(shard: HfDataset, map_func: MapFunc, queue: Queue, rank: int):
     batch_size = 64
-    pre_i = 0
+    pre_i = -1
     result = []
     for i, d in enumerate(shard):
         output = map_func(d)
@@ -336,7 +336,7 @@ def _map_mp_i(dataset: HfDataset, map_func: MapFunc, num_proc: int) -> Iterator[
         os.environ = pre_environ
         queue = manager.Queue()
         async_results = []
-        shard_list = [dataset.shard(num_proc, i) for i in range(num_proc)]
+        shard_list = [dataset.shard(num_proc, i, contiguous=True) for i in range(num_proc)]
         for i in range(num_proc):
             async_results.append(pool.apply_async(_map_mp_single, args=(shard_list[i], map_func, queue, i)))
         while True:
@@ -350,11 +350,12 @@ def _map_mp_i(dataset: HfDataset, map_func: MapFunc, num_proc: int) -> Iterator[
 def _map_mp(dataset: HfDataset, map_func: MapFunc, num_proc: int) -> List[Dict[str, Any]]:
     # Solving the unordered problem
     num_proc = min(num_proc, len(dataset))
-    data_list = [[]] * num_proc
+    data_list = [[] for _ in range(num_proc)]
     prog_bar = tqdm(total=len(dataset), desc=f'Map (num_proc={num_proc})', dynamic_ncols=True)
     for d in _map_mp_i(dataset, map_func, num_proc):
         data_list[d[0]] += d[1]
         prog_bar.update(d[2])
+    prog_bar.close()
     res = []
     for data in data_list:
         res += data
diff --git a/swift/trainers/rlhf_trainer/kto_trainer.py b/swift/trainers/rlhf_trainer/kto_trainer.py
@@ -37,7 +37,7 @@ def _add_kl_dataset(dataset: LLMDataset, total_batch_size: int, seed: Optional[i
                 'labels': data['labels'],
                 'KL_input_ids': kl_input_ids,
                 'KL_labels': kl_labels,
-                'label': kl_data['label']
+                'label': data['label']
             })
         raw_dataset[i:i + total_batch_size] = new_dataset_group
         i += total_batch_size