[template] add_retry (#6138)

Jintao-Huang · Jintao-Huang · commit 4bacd3fbbfd0 · 2025-10-24T16:17:02.000+08:00
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -810,4 +810,5 @@ qwen2_5_omni除了包含qwen2_5_vl和qwen2_audio的模型特定参数外，还
 - LOG_LEVEL: 日志的level，默认为'INFO'，你可以设置为'WARNING', 'ERROR'等。
 - SWIFT_DEBUG: 在`engine.infer(...)`时，若设置为'1'，PtEngine将会打印input_ids和generate_ids的内容方便进行调试与对齐。
 - VLLM_USE_V1: 用于切换vLLM使用V0/V1版本。
+- SWIFT_TIMEOUT: (ms-swift>=3.10) 若多模态数据集中存在图像URL，该参数用于控制获取图片的timeout，默认为20s。
 - ROOT_IMAGE_DIR: (ms-swift>=3.8) 图像（多模态）资源的根目录。通过设置该参数，可以在数据集中使用相对于 `ROOT_IMAGE_DIR` 的相对路径。默认情况下，是相对于运行目录的相对路径。
diff --git a/docs/source/Instruction/常见问题整理.md b/docs/source/Instruction/常见问题整理.md
@@ -759,8 +759,8 @@ RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=1 nohup swift deploy --ckpt
 ```
 需要客户端传参数，`request_config = RequestConfig(..., logprobs=True, top_logprobs=2)`。
 
-### Q12: wift3.0 部署推理，可以设置请求的超时时间么？如果图片url非法，会等在那里
-设置环境变量`TIMEOUT`,默认是300秒。或者`InferClient`中可以传参数。
+### Q12: swift3.0 部署推理，可以设置请求的超时时间么？如果图片url非法，会等在那里
+设置环境变量`SWIFT_TIMEOUT`。或者`InferClient`中可以传参数。
 
 ### Q13: swift部署的模型怎么没法流式生成啊？服务端的stream设为True了，客户端的stream也设为True了，但它就是没法流式生成
 客户端控制的，查看[examples/deploy/client](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/client)。
@@ -837,7 +837,7 @@ swift eval --model_type 'qwen2_5-1_5b-instruct' --eval_dataset no --custom_eval_
 这是依赖了nltk的包，然后nltk的tokenizer需要下载一个punkt_tab的zip文件，国内有些环境下载不太稳定或者直接失败。已尝试改了代码做兜底，规避这个问题；参考[issue](https://github.com/nltk/nltk/issues/3293)。
 
 ### Q6: eval微调后的模型，总是会在固定的百分比停掉，但是vllm服务看着一直是有在正常运行的。模型越大，断开的越早。
-`TIMEOUT`环境变量设置为-1。
+`SWIFT_TIMEOUT`环境变量设置为-1。
 
 ### Q7: evalscope 支持多模型对比吗？
 详见[文档](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/arena.html)。
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -834,4 +834,5 @@ The meanings of the following parameters can be found in the example code [here]
 - LOG_LEVEL: The log level, default is 'INFO'. You can set it to 'WARNING', 'ERROR', etc.
 - SWIFT_DEBUG: When set to `'1'` during `engine.infer(...)`, PtEngine will print the contents of `input_ids` and `generate_ids` to facilitate debugging and alignment.
 - VLLM_USE_V1: Used to switch between V0 and V1 versions of vLLM.
+- SWIFT_TIMEOUT: (ms-swift >= 3.10) If the multimodal dataset contains image URLs, this parameter controls the timeout for fetching images, defaulting to 20 seconds.
 - ROOT_IMAGE_DIR: (ms-swift>=3.8) The root directory for image (multimodal) resources. By setting this parameter, relative paths in the dataset can be interpreted relative to `ROOT_IMAGE_DIR`. By default, paths are relative to the current working directory.
diff --git a/docs/source_en/Instruction/Frequently-asked-questions.md b/docs/source_en/Instruction/Frequently-asked-questions.md
@@ -760,7 +760,7 @@ RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=1 nohup swift deploy --ckpt
 Parameters need to be passed from the client side, `request_config = RequestConfig(..., logprobs=True, top_logprobs=2)`.
 
 ### Q12: Can we set request timeout time for Swift3.0 deployment inference? What happens if the image URL is invalid?
-You can set the `TIMEOUT` environment variable, which defaults to 300 seconds. Alternatively, you can pass parameters in `InferClient`.
+You can set the `SWIFT_TIMEOUT` environment variable. Alternatively, you can pass parameters in `InferClient`.
 
 ### Q13: Why can't I get streaming generation with Swift deployed models? I've set stream to True on both server and client side, but it's still not streaming
 It's controlled by the client side. Please check [examples/deploy/client](https://github.com/modelscope/ms-swift/tree/main/examples/deploy/client).
@@ -840,7 +840,7 @@ swift eval --model_type 'qwen2_5-1_5b-instruct' --eval_dataset no --custom_eval_
 This relies on the nltk package, which needs to download a punkt_tab zip file. Some environments in China have unstable or failed downloads. The code has been modified to handle this issue; reference [issue](https://github.com/nltk/nltk/issues/3293).
 
 ### Q6: The model after eval fine-tuning keeps stopping at a fixed percentage, but the vllm service seems to be running normally. The larger the model, the sooner it disconnects.
-Set the `TIMEOUT` environment variable to -1.
+Set the `SWIFT_TIMEOUT` environment variable to -1.
 
 ### Q7: Does evalscope support multi-model comparison?
 See the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
diff --git a/swift/llm/template/vision_utils.py b/swift/llm/template/vision_utils.py
@@ -10,6 +10,8 @@
 import requests
 import torch
 from PIL import Image
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
 
 from swift.utils import get_env_args
 
@@ -105,12 +107,19 @@ def load_file(path: Union[str, bytes, _T]) -> Union[BytesIO, _T]:
     if isinstance(path, str):
         path = path.strip()
         if path.startswith('http'):
-            request_kwargs = {}
-            timeout = float(os.getenv('TIMEOUT', '300'))
-            if timeout > 0:
-                request_kwargs['timeout'] = timeout
-            content = requests.get(path, **request_kwargs).content
-            res = BytesIO(content)
+            retries = Retry(total=3, backoff_factor=1, allowed_methods=['GET'])
+            with requests.Session() as session:
+                session.mount('http://', HTTPAdapter(max_retries=retries))
+                session.mount('https://', HTTPAdapter(max_retries=retries))
+
+                timeout = float(os.getenv('SWIFT_TIMEOUT', '20'))
+                request_kwargs = {'timeout': timeout} if timeout > 0 else {}
+
+                response = session.get(path, **request_kwargs)
+                response.raise_for_status()
+                content = response.content
+                res = BytesIO(content)
+
         elif os.path.exists(path) or (not path.startswith('data:') and len(path) <= 200):
             ROOT_IMAGE_DIR = get_env_args('ROOT_IMAGE_DIR', str, None)
             if ROOT_IMAGE_DIR is not None and not os.path.exists(path):