update FAQ (#2165)

slin000111 · web-flow · commit 0c2294a7cb3c · 2024-09-30T13:32:10.000+08:00
diff --git a/docs/source/Instruction/常见问题整理.md b/docs/source/Instruction/常见问题整理.md
@@ -100,6 +100,54 @@ swift sft \
 ### Q22: 训练时，如果两个数据集直接追加一起放在训练集中，模型在训练的时候内部会有shuffle的流程吗？还是按顺序取数据去训练？
 trainer中会随机。
 
+### Q23: 如果模型两张卡，数据不开并行，deepspeed就会出现报错，怎么处理呢？
+`deepspeed` 和 `device_map`是不兼容的，两个只能选1个。
+
+### Q24: 在线训练时已经下载的数据集，离线重新训练为什么还要下载？
+数据文件中有url，不支持离线训练。
+
+### Q25: vlm模型训练如何减少显存使用？
+配置`--freeze_vit true`。
+
+### Q26: 为什么WEB-UI界面上支持的模型比文档中少？
+升级一下ms-swift。
+
+### Q27: 没有适配model_type的模型，sft时可以自定义special_tokens和chat_template吗？
+可以。参考接入模型的PR以及自定义模型数据集文档。
+
+### Q28: 可以在python脚本里面用DPO去训练qwen2-vl吗？
+可以。从`swift.llm`中导入`rlhf_main` 和`RLHFArguments`。
+
+### Q29: 请问训练MLLM时，可否先进行纯文本的预训练，然后接入VQA数据集进行微调呢？
+可以。也可以混着训练。
+
+### Q30: 基于qwen2的sft模型进行dpo训练，v100的机器，训练时都是Nan呢？
+V100机器要用fp32训练qwen2。
+
+### Q31: 想问一下，swift，能支持蒸馏吗？
+不支持。建议量化，效果比较好。
+
+### Q32: cannot import name 'ftp_head' from 'datasets.utils.file_utils' ，有没有遇到这个问题的?
+`pip install datasets==2.*`。
+
+### Q33: 当前训练完默认最多保存两个checkpoint，如果想多保存几个应该怎么修改呢？
+`--save_total_limit`，详见[命令行参数](https://swift.readthedocs.io/zh-cn/latest/Instruction/%E5%91%BD%E4%BB%A4%E8%A1%8C%E5%8F%82%E6%95%B0.html)。
+
+### Q34: Grounding任务中通用数据格式支持一个类别有多个实例吗？
+目前均支持了一个物体对应多个bbox，参考文档[InternVL 最佳实践](https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/internvl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.html)。
+
+### Q35: 这个错误为什么会出现在这，numpy.object找不到在哪？
+`numpy==1.26.3`，尝试一下。
+
+### Q36: swift框架能支持序列并行了吗？
+支持。现在是引入`xtuner`来实现。
+
+### Q37: 用v100微调qwen2-1.5B时，loss': 0.0, 'acc': 0.0, 'grad_norm': nan，是什么问题呢?
+尝试用fp32。
+
+### Q38: gptq量化模型，能全参数微调吗？
+不能。gptq模型的int型参数无法参与求导，只能附着lora等额外结构参与更新。
+
 ## 推理
 
 ### Q1:swift推理有文档吗？
@@ -124,6 +172,18 @@ ValueError: Input length of input_ids is 35, but `max_length` is set to 20. This
 ### Q6: qwen2-vl推理爆显存
 设置环境变量，SIZE_FACTOR=8 MAX_PIXELS=602112，见文档[Qwen2-VL 最佳实践](https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.html)。
 
+### Q7: v100显卡，在python虚拟环境中，参考https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md 完成环境准备，在测试推理命令：CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type qwen2-vl-7b-instruct 时报错：RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
+尝试用A10或者3090机器推理。
+
+### Q8: 运行下面命令，预测之后的结果在哪里？CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir output/glm4v-9b-chat/vx-xxx/checkpoint-xxx-merged --load_dataset_config true
+日志中会打印路径。
+
+### Q9: 推理的时候，调用inference，怎样能获取到输出logits呢？
+参考https://github.com/modelscope/ms-swift/blob/main/tests/custom/test_logprobs.py。
+
+### Q10: 最新版本swift，我在加载qwen2-32b-instruct-awq 量化模型及其lora的时候，使用vllm 提示我加上merge lore ture,我加上就报错了，我去掉vllm 加速就能正常推理了，但是速度很慢
+qlora训练的模型不支持merge-lora的, 建议lora微调后 merge-lora再量化。
+
 ## 部署
 
 ### Q1: 如何部署训练后的模型？
@@ -141,6 +201,15 @@ base模型可以用client.chat.completions.create的，不过这个是兼容行
 ### Q5: 使用两张卡用swift deploy启动服务端后，用Ctrl+C退出后，会一直有一个python进程，一直占用一张卡的显存，这是正常现象吗？
 需要kill 一下, 这是vllm的问题。
 
+### Q6: 在哪查看模型是否支持lmdeploy或vllm加速？
+请查看文档，[支持的模型和数据集](https://swift.readthedocs.io/zh-cn/latest/Instruction/%E6%94%AF%E6%8C%81%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%92%8C%E6%95%B0%E6%8D%AE%E9%9B%86.html)。
+
+### Q7: 通义千问2.5-数学-7B-Instruct，会偶尔这样一直返回乱码，是什么问题呢？用vllm部署，fp16。
+尝试bf16。
+
+### Q8: lora 微调后进行了部署，使用swift的推理方式，报错requests.exceptions.HTTPError: Multimodal model only support `default-lora`
+这里`model_type`设置`default-lora`
+
 ## 评测
 
 ### Q1: swift支持的评测集有哪些？
@@ -177,3 +246,9 @@ base模型可以用client.chat.completions.create的，不过这个是兼容行
 
 ### Q2: 如何使用自定义评测集？
 纯文本、多模态自定义评测集必须和某个官方评测集数据格式（pattern）保持一致，见文档[LLM评测文档](https://swift.readthedocs.io/zh-cn/latest/Instruction/LLM%E8%AF%84%E6%B5%8B%E6%96%87%E6%A1%A3.html)。
+
+### Q3: python3.11环境，评测时mmengine报错
+尝试python3.10环境。或先安装全量依赖： `pip3 install evalscope[all]`，再打patch： `pip3 install https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/package/evalscope-0.5.3.post1-py3-none-any.whl`。
+
+### Q4: 官方支持的评测数据集手动下载后，swift eval能配置本地路径评测吗？
+先下载评测数据集[eval.zip](https://modelscope.cn/datasets/swift/evalscope_resource/files)，解压后将里面的内容放到 `~/.cache/modelscope/media_resources/evalscope/data`文件夹下；再执行swift eval命令就可以使用本地数据。
diff --git a/docs/source_en/Instruction/Common-QA.md b/docs/source_en/Instruction/Common-QA.md
@@ -101,6 +101,54 @@ Use lazy_tokenize, see [Command Line Arguments](https://swift.readthedocs.io/en/
 ### Q22: During training, if two datasets are directly appended together in the training set, does the model have an internal shuffling process during training? Or does it take data in order for training?
 Randomization occurs in the trainer.
 
+### Q23: If the model uses two GPUs but data parallelism is not enabled, DeepSpeed will throw an error. How can this be addressed?
+`deepspeed` and `device_map` are incompatible; you can only choose one of them.
+
+### Q24: Why do we need to download the dataset again for offline retraining when it has already been downloaded during online training?
+The data file contains URLs, which doesn't support offline training.
+
+### Q25: How can memory usage be reduced when training VLM (Vision-Language) models?
+Configure `--freeze_vit true`.
+
+### Q26: Why are there fewer models supported on the WEB-UI interface compared to those in the documentation?
+Please upgrade ms-swift.
+
+### Q27: For models without an adapted model_type, can we customize special_tokens and chat_template during SFT?
+Yes. Refer to the PR for integrating models and the custom model dataset documentation.
+
+### Q28: Is it possible to train Qwen2-VL using DPO (Direct Preference Optimization) in Python script?
+Yes. Import `rlhf_main` and `RLHFArguments` from `swift.llm`.
+
+### Q29: When training an MLLM, is it possible to first conduct pre-training with pure text, and then fine-tune using a VQA dataset?
+Yes, it's possible. You can also train them together.
+
+### Q30: When performing DPO training on an SFT model based on Qwen2 using a V100 machine, why are all the results NaN?
+V100 machines should use fp32 for training Qwen2.
+
+### Q31: I'd like to ask, does Swift support distillation?
+It's not supported. Quantization is recommended, which has better results.
+
+### Q32: Has anyone encountered this issue, cannot import name 'ftp_head' from 'datasets.utils.file_utils?
+`pip install datasets==2.*`
+
+### Q33: Currently, a maximum of two checkpoints are saved by default after training. How can I modify it to save more?
+`--save_total_limit`, See [Command Line Arguments](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html) for details.
+
+### Q34: In Grounding tasks, does the general data format support multiple instances for one category?
+Currently, multiple bboxes for one object are supported. Refer to the documentation [InternVL Best Practice](https://swift.readthedocs.io/en/latest/Multi-Modal/internvl-best-practice.html).
+
+### Q35: Why does this error appear here? Where can't numpy.object be found?
+Try `numpy==1.26.3`.
+
+### Q36: Does the Swift framework support sequence parallelism now?
+Yes, it does. It's now implemented by introducing `xtuner`.
+
+### Q37: When fine-tuning Qwen2-1.5B on a V100, I get 'loss': 0.0, 'acc': 0.0, 'grad_norm': nan. What's the problem?
+Try using fp32.
+
+### Q38: Can GPTQ quantized models be fully fine-tuned?
+No, they can't. The int-type parameters in GPTQ models cannot participate in gradient computation. Only additional structures like LoRA can be attached for updates.
+
 ## Inference
 
 ### Q1:Is there documentation for Swift inference?
@@ -125,6 +173,18 @@ Set `model.generation_config.max_new_tokens`.
 ### Q6: Qwen2-VL inference causes out of memory error
 Set environment variables, `SIZE_FACTOR=8 MAX_PIXELS=602112`, see documentation [Qwen2-VL Best Practice](https://swift.readthedocs.io/en/latest/Multi-Modal/qwen2-vl-best-practice.html).
 
+### Q7: With V100 GPU, in Python virtual environment, following https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md to complete environment preparation, when testing inference command: CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type qwen2-vl-7b-instruct, it reports error: RuntimeError: probability tensor contains either inf, nan or element < 0.
+Try using an A10 or 3090 machine for inference.
+
+### Q8:  After running the following command, where are the prediction results? CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir output/glm4v-9b-chat/vx-xxx/checkpoint-xxx-merged --load_dataset_config true
+The path will be printed in the logs.
+
+### Q9: During inference, when calling inference, how can I get the output logits?
+Refer to https://github.com/modelscope/ms-swift/blob/main/tests/custom/test_logprobs.py.
+
+### Q10: In the latest version of Swift, when I'm loading the qwen2-32b-instruct-awq quantized model and its LoRA using vllm, it prompts me to add "merge lora true". When I add it, I get an error. If I remove vllm acceleration, I can inference normally, but the speed is very slow.
+Models trained with QLoRA do not support merge-lora. It is recommended to perform LoRA fine-tuning first, then merge-lora, and finally quantize.
+
 ## Deployment
 
 ### Q1: How to deploy the trained model?
@@ -142,6 +202,15 @@ Base models can use client.chat.completions.create, but this is a compatibility
 ### Q5: After starting the server with Swift deploy using two GPUs, when exiting with Ctrl+C, there's always a Python process that keeps occupying the memory of one GPU. Is this normal?
 Need to kill it, this is a vllm issue.
 
+### Q6: Where can I check if the model supports lmdeploy or vllm acceleration?
+Please check the documentation, [Supported models and datasets](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-datasets.html).
+
+### Q7: Qwen2.5-Math-7B-Instruct occasionally keeps returning garbled text. What's the problem? Using vllm deployment, fp16.
+Try bf16.
+
+### Q8: After LoRA fine-tuning and deployment, using Swift's inference method, it reports an error: requests.exceptions.HTTPError: Multimodal model only support default-lora
+Set `model_type` to `default-lora` here.
+
 ## Evaluation
 
 ### Q1: What evaluation datasets does Swift support?
@@ -178,3 +247,9 @@ See documentation [LLM Evaluation Documentation](https://swift.readthedocs.io/en
 
 ### Q2: How to use custom evaluation datasets?
 Custom evaluation datasets for NLP and multimodal must follow the data format (pattern) of an official evaluation dataset, see documentation [LLM Evaluation Documentation](https://swift.readthedocs.io/en/latest/Instruction/LLM-eval.html).
+
+### Q3: Python 3.11 environment, mmengine reports an error during evaluation
+Try using a Python 3.10 environment. Or first install all dependencies: `pip3 install evalscope[all]`, then apply the patch: `pip3 install https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/package/evalscope-0.5.3.post1-py3-none-any.whl`.
+
+### Q4: Can swift eval be configured to evaluate using local paths after manually downloading the officially supported evaluation datasets?
+First download the evaluation dataset [eval.zip](https://modelscope.cn/datasets/swift/evalscope_resource/files), unzip it and place its contents in the `~/.cache/modelscope/media_resources/evalscope/data` folder; then execute the swift eval command to use the local data.