Add an optional VLM preload/warm-up flag to reduce cold-start latency #4692

myhloli · 2026-03-30T17:36:40Z

myhloli
Mar 30, 2026
Maintainer

当前服务启动时只会初始化基础网络服务，例如 router、gradio 和 fastapi，但不会提前加载 VLM 推理服务。因此在冷启动场景下，首次请求仍需等待 VLM 模型完成加载和初始化，导致首包耗时过长，影响使用体验。

建议增加一个 VLM 预加载参数，默认关闭，供仅使用 pipeline 后端的用户按需启用。
当该参数开启后，在启动 router / gradio / fastapi 服务时，系统会自动对 VLM 模型进行预热，从而显著降低冷启动后的首次请求延迟，预计可节约接近一分钟的模型初始化时间。

该功能默认关闭，不影响现有启动流程，同时也能为对首请求时延敏感的场景提供更好的体验。

At the moment, service startup only initializes the basic network services, such as router, gradio, and fastapi, but does not preload the VLM inference service. As a result, in cold-start scenarios, the first request still has to wait for the VLM model to be loaded and initialized, which leads to excessive latency on the first response.

It would be helpful to add an optional VLM preload flag, disabled by default, for users who rely on the pipeline backend only.

When this flag is enabled, starting the router / gradio / fastapi services would automatically trigger a warm-up of the VLM model. This can significantly reduce the latency of the first request after a cold start, saving nearly one minute of model initialization time.

Since the feature is disabled by default, it would not affect the current startup behavior, while still providing a much better experience for latency-sensitive use cases.

myhloli · 2026-03-30T17:50:57Z

myhloli
Mar 30, 2026
Maintainer Author

Version 3.0.4 has added this parameter: --enable-vlm-preload.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an optional VLM preload/warm-up flag to reduce cold-start latency #4692

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Add an optional VLM preload/warm-up flag to reduce cold-start latency #4692

Uh oh!

myhloli Mar 30, 2026 Maintainer

Replies: 1 comment

Uh oh!

myhloli Mar 30, 2026 Maintainer Author

myhloli
Mar 30, 2026
Maintainer

myhloli
Mar 30, 2026
Maintainer Author