windows本地的mineru项目，通过http方式调用服务器上单独部署的最新vllm，报Error #4721

Wxx1250227780 · 2026-04-02T00:42:20Z

Wxx1250227780
Apr 2, 2026

在服务器上用docker部署了最新的vllm模型，成功加载了opendatalab/MinerU2.5-2509-1.2B模型，端口是8080；
在windows本地，拉取了最新的Mineru3.0.0源码；
windows启动命令：mineru -p "D:....pdf" -o C:\Users...\mineru-ocr -b vlm-http-client -u http://***:8080，日志显示下面的错误：
2026-04-01 09:57:19.704 | INFO | mineru.cli.client:run_orchestrated_cli:867 - Started local mineru-api at http://127.0.0.1:52694
2026-04-01 09:57:22.748 | INFO | main:create_app:212 - Request concurrency limited to 3
Start MinerU FastAPI Service: http://127.0.0.1:52694
API documentation: http://127.0.0.1:52694/docs
INFO: Started server process [23216]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:52694 (Press CTRL+C to quit)
Error: Timed out waiting for local mineru-api to become healthy. Failed to query MinerU API health from http://127.0.0.1:52694: 502 Bad Gateway
服务器上的vllm日志显示：
(base) drdp@drdpservice:~/model$ docker logs -f ff7b0185db5a
WARNING 04-01 01:48:01 [argparse_utils.py:193] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in v0.13.
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297]
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.2rc1.dev153+g39474513f
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] █▄█▀ █ █ █ █ model /models/MinerU25
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297]
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:233] non-default args: {'model_tag': '/models/MinerU25', 'port': 8080, 'model': '/models/MinerU25', 'allowed_local_media_path': '/data', 'max_model_len': 16384, 'served_model_name': ['mineru3'], 'gpu_memory_utilization': 0.85}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section'}
(APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section'}
(APIServer pid=1) INFO 04-01 01:48:09 [model.py:533] Resolved architecture: Qwen2VLForConditionalGeneration
(APIServer pid=1) INFO 04-01 01:48:09 [model.py:1582] Using max model len 16384
(APIServer pid=1) INFO 04-01 01:48:09 [vllm.py:750] Asynchronous scheduling is enabled.
(APIServer pid=1) The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False.
(EngineCore pid=80) INFO 04-01 01:48:28 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev153+g39474513f) with config: model='/models/MinerU25', speculative_config=None, tokenizer='/models/MinerU25', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=mineru3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=80) INFO 04-01 01:48:30 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.21.0.2:49989 backend=nccl
(EngineCore pid=80) INFO 04-01 01:48:30 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=80) The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False.
(EngineCore pid=80) INFO 04-01 01:48:39 [gpu_model_runner.py:4516] Starting to load model /models/MinerU25...
(EngineCore pid=80) INFO 04-01 01:48:39 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=80) INFO 04-01 01:48:39 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=80) INFO 04-01 01:48:39 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=80) INFO 04-01 01:48:40 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=80) INFO 04-01 01:48:40 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=80) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=80) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.93it/s]
(EngineCore pid=80)
(EngineCore pid=80) INFO 04-01 01:48:41 [default_loader.py:384] Loading weights took 0.64 seconds
(EngineCore pid=80) INFO 04-01 01:48:41 [gpu_model_runner.py:4601] Model loading took 2.16 GiB memory and 1.396784 seconds
(EngineCore pid=80) INFO 04-01 01:48:42 [gpu_model_runner.py:5526] Encoder cache will be initialized with a budget of 14336 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=80) INFO 04-01 01:48:51 [backends.py:1046] Using cache directory: /root/.cache/vllm/torch_compile_cache/19584346ff/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=80) INFO 04-01 01:48:51 [backends.py:1106] Dynamo bytecode transform time: 4.99 s
(EngineCore pid=80) INFO 04-01 01:48:56 [backends.py:371] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=80) INFO 04-01 01:49:00 [backends.py:389] Compiling a graph for compile range (1, 2048) takes 8.53 s
(EngineCore pid=80) INFO 04-01 01:49:02 [decorators.py:638] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/73a970f4275bc32f615c2e93edc7f543228b79e5429bd1784b34c755a8a78d99/rank_0_0/model
(EngineCore pid=80) INFO 04-01 01:49:02 [monitor.py:48] torch.compile took 16.21 s in total
(EngineCore pid=80) INFO 04-01 01:49:02 [monitor.py:76] Initial profiling/warmup run took 0.09 s
(EngineCore pid=80) INFO 04-01 01:49:11 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=80) INFO 04-01 01:49:11 [gpu_model_runner.py:5646] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=80) INFO 04-01 01:49:13 [gpu_model_runner.py:5725] Estimated CUDA graph memory: 0.28 GiB total
(EngineCore pid=80) INFO 04-01 01:49:14 [gpu_worker.py:456] Available KV cache memory: 15.78 GiB
(EngineCore pid=80) INFO 04-01 01:49:14 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8500 to 0.8617 to maintain the same effective KV cache size.
(EngineCore pid=80) INFO 04-01 01:49:14 [kv_cache_utils.py:1319] GPU KV cache size: 1,379,136 tokens
(EngineCore pid=80) INFO 04-01 01:49:14 [kv_cache_utils.py:1324] Maximum concurrency for 16,384 tokens per request: 84.18x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:01<00:00, 29.20it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 30.70it/s]
(EngineCore pid=80) INFO 04-01 01:49:18 [gpu_model_runner.py:5785] Graph capturing finished in 4 secs, took 0.38 GiB
(EngineCore pid=80) INFO 04-01 01:49:18 [gpu_worker.py:617] CUDA graph pool memory: 0.38 GiB (actual), 0.28 GiB (estimated), difference: 0.1 GiB (27.0%).
(EngineCore pid=80) INFO 04-01 01:49:18 [core.py:281] init engine (profile, create kv cache, warmup model) took 36.40 seconds
(APIServer pid=1) INFO 04-01 01:49:19 [api_server.py:586] Supported tasks: ['generate']
(APIServer pid=1) WARNING 04-01 01:49:19 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'repetition_penalty': 1.0, 'temperature': 0.01, 'top_k': 1, 'top_p': 0.001}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=1) INFO 04-01 01:49:21 [hf.py:320] Detected the chat template content format to be 'openai'. You can set --chat-template-content-format to override this.
(APIServer pid=1) INFO 04-01 01:49:27 [base.py:216] Multi-modal warmup completed in 6.696s
(APIServer pid=1) INFO 04-01 01:49:27 [api_server.py:590] Starting vLLM server on http://0.0.0.0:8080
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: [windows本地IP]:53539 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: [windows本地IP]:63463 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: [windows本地IP]:63463 - "GET /favicon.ico HTTP/1.1" 404 Not Found
(APIServer pid=1) INFO: [windows本地IP]:60923 - "GET /health HTTP/1.1" 200 OK

2026-04-02T00:49:55Z

dosubot[bot]
bot Apr 2, 2026

这个 502 Bad Gateway 错误很可能是 模型名称不匹配 导致的。你的 vLLM 配置了 served_model_name: ['mineru3']，但 MinerU 只支持模型名 MinerU2.5-2509-1.2B [1]。当本地 mineru-api 启动后连接远端 vLLM 的 /v1/models 接口进行校验时，会因为找不到预期的模型名而导致初始化失败，从而 health check 返回 502/503。

解决方法：

修改 vLLM 的 served-model-name 为 MinerU2.5-2509-1.2B：

vllm serve /models/MinerU25 --served-model-name MinerU2.5-2509-1.2B --port 8080

MinerU 启动命令中加上 --model_name 参数 [1]：

mineru -p "D:\....pdf" -o C:\Users\...\mineru-ocr -b vlm-http-client -u http://***:8080 --model_name MinerU2.5-2509-1.2B

需要 --model_name 的原因是 vLLM 会返回多个 model ID（别名 + 完整路径），MinerU 在未指定时会因为 "Expected exactly one model" 而报错 [1]。

其他排查建议：

如果修改后仍超时，可以设置环境变量增加启动等待时间：set MINERU_LOCAL_API_STARTUP_TIMEOUT_SECONDS=600
处理大文件时建议将 http_timeout 设为 1800 秒以上，避免超时 [2]
确认 Windows 到服务器之间没有代理或防火墙拦截请求 [3]

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windows本地的mineru项目，通过http方式调用服务器上单独部署的最新vllm，报Error #4721

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

windows本地的mineru项目，通过http方式调用服务器上单独部署的最新vllm，报Error #4721

Uh oh!

Wxx1250227780 Apr 2, 2026

Replies: 1 comment

Uh oh!

dosubot[bot] bot Apr 2, 2026

Wxx1250227780
Apr 2, 2026

dosubot[bot]
bot Apr 2, 2026