windows本地的mineru项目,通过http方式调用服务器上单独部署的最新vllm,报Error #4721
Replies: 1 comment
-
|
这个 502 Bad Gateway 错误很可能是 模型名称不匹配 导致的。你的 vLLM 配置了 解决方法:
vllm serve /models/MinerU25 --served-model-name MinerU2.5-2509-1.2B --port 8080
mineru -p "D:\....pdf" -o C:\Users\...\mineru-ocr -b vlm-http-client -u http://***:8080 --model_name MinerU2.5-2509-1.2B需要 其他排查建议:
To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
2026-04-01 09:57:19.704 | INFO | mineru.cli.client:run_orchestrated_cli:867 - Started local mineru-api at http://127.0.0.1:52694
2026-04-01 09:57:22.748 | INFO | main:create_app:212 - Request concurrency limited to 3
Start MinerU FastAPI Service: http://127.0.0.1:52694
API documentation: http://127.0.0.1:52694/docs
INFO: Started server process [23216]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:52694 (Press CTRL+C to quit)
Error: Timed out waiting for local mineru-api to become healthy. Failed to query MinerU API health from http://127.0.0.1:52694: 502 Bad Gateway
(base) drdp@drdpservice:~/model$ docker logs -f ff7b0185db5a
WARNING 04-01 01:48:01 [argparse_utils.py:193] With
vllm serve, you should provide the model as a positional argument or in a config file instead of via the--modeloption. The--modeloption will be removed in v0.13.(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297]
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.2rc1.dev153+g39474513f
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] █▄█▀ █ █ █ █ model /models/MinerU25
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:297]
(APIServer pid=1) INFO 04-01 01:48:01 [utils.py:233] non-default args: {'model_tag': '/models/MinerU25', 'port': 8080, 'model': '/models/MinerU25', 'allowed_local_media_path': '/data', 'max_model_len': 16384, 'served_model_name': ['mineru3'], 'gpu_memory_utilization': 0.85}
(APIServer pid=1) Unrecognized keys in
rope_parametersfor 'rope_type'='default': {'mrope_section'}(APIServer pid=1) Unrecognized keys in
rope_parametersfor 'rope_type'='default': {'mrope_section'}(APIServer pid=1) INFO 04-01 01:48:09 [model.py:533] Resolved architecture: Qwen2VLForConditionalGeneration
(APIServer pid=1) INFO 04-01 01:48:09 [model.py:1582] Using max model len 16384
(APIServer pid=1) INFO 04-01 01:48:09 [vllm.py:750] Asynchronous scheduling is enabled.
(APIServer pid=1) The image processor of type
Qwen2VLImageProcessoris now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class withuse_fast=False.(EngineCore pid=80) INFO 04-01 01:48:28 [core.py:103] Initializing a V1 LLM engine (v0.17.2rc1.dev153+g39474513f) with config: model='/models/MinerU25', speculative_config=None, tokenizer='/models/MinerU25', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=mineru3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=80) INFO 04-01 01:48:30 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.21.0.2:49989 backend=nccl
(EngineCore pid=80) INFO 04-01 01:48:30 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=80) The image processor of type
Qwen2VLImageProcessoris now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class withuse_fast=False.(EngineCore pid=80) INFO 04-01 01:48:39 [gpu_model_runner.py:4516] Starting to load model /models/MinerU25...
(EngineCore pid=80) INFO 04-01 01:48:39 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=80) INFO 04-01 01:48:39 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=80) INFO 04-01 01:48:39 [vllm.py:750] Asynchronous scheduling is enabled.
(EngineCore pid=80) INFO 04-01 01:48:40 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=80) INFO 04-01 01:48:40 [flash_attn.py:598] Using FlashAttention version 2
(EngineCore pid=80) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=80) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.93it/s]
(EngineCore pid=80)
(EngineCore pid=80) INFO 04-01 01:48:41 [default_loader.py:384] Loading weights took 0.64 seconds
(EngineCore pid=80) INFO 04-01 01:48:41 [gpu_model_runner.py:4601] Model loading took 2.16 GiB memory and 1.396784 seconds
(EngineCore pid=80) INFO 04-01 01:48:42 [gpu_model_runner.py:5526] Encoder cache will be initialized with a budget of 14336 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=80) INFO 04-01 01:48:51 [backends.py:1046] Using cache directory: /root/.cache/vllm/torch_compile_cache/19584346ff/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=80) INFO 04-01 01:48:51 [backends.py:1106] Dynamo bytecode transform time: 4.99 s
(EngineCore pid=80) INFO 04-01 01:48:56 [backends.py:371] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=80) INFO 04-01 01:49:00 [backends.py:389] Compiling a graph for compile range (1, 2048) takes 8.53 s
(EngineCore pid=80) INFO 04-01 01:49:02 [decorators.py:638] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/73a970f4275bc32f615c2e93edc7f543228b79e5429bd1784b34c755a8a78d99/rank_0_0/model
(EngineCore pid=80) INFO 04-01 01:49:02 [monitor.py:48] torch.compile took 16.21 s in total
(EngineCore pid=80) INFO 04-01 01:49:02 [monitor.py:76] Initial profiling/warmup run took 0.09 s
(EngineCore pid=80) INFO 04-01 01:49:11 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=80) INFO 04-01 01:49:11 [gpu_model_runner.py:5646] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=80) INFO 04-01 01:49:13 [gpu_model_runner.py:5725] Estimated CUDA graph memory: 0.28 GiB total
(EngineCore pid=80) INFO 04-01 01:49:14 [gpu_worker.py:456] Available KV cache memory: 15.78 GiB
(EngineCore pid=80) INFO 04-01 01:49:14 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8500 to 0.8617 to maintain the same effective KV cache size.
(EngineCore pid=80) INFO 04-01 01:49:14 [kv_cache_utils.py:1319] GPU KV cache size: 1,379,136 tokens
(EngineCore pid=80) INFO 04-01 01:49:14 [kv_cache_utils.py:1324] Maximum concurrency for 16,384 tokens per request: 84.18x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:01<00:00, 29.20it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 30.70it/s]
(EngineCore pid=80) INFO 04-01 01:49:18 [gpu_model_runner.py:5785] Graph capturing finished in 4 secs, took 0.38 GiB
(EngineCore pid=80) INFO 04-01 01:49:18 [gpu_worker.py:617] CUDA graph pool memory: 0.38 GiB (actual), 0.28 GiB (estimated), difference: 0.1 GiB (27.0%).
(EngineCore pid=80) INFO 04-01 01:49:18 [core.py:281] init engine (profile, create kv cache, warmup model) took 36.40 seconds
(APIServer pid=1) INFO 04-01 01:49:19 [api_server.py:586] Supported tasks: ['generate']
(APIServer pid=1) WARNING 04-01 01:49:19 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's
generation_config.json:{'repetition_penalty': 1.0, 'temperature': 0.01, 'top_k': 1, 'top_p': 0.001}. If this is not intended, please relaunch vLLM instance with--generation-config vllm.(APIServer pid=1) INFO 04-01 01:49:21 [hf.py:320] Detected the chat template content format to be 'openai'. You can set
--chat-template-content-formatto override this.(APIServer pid=1) INFO 04-01 01:49:27 [base.py:216] Multi-modal warmup completed in 6.696s
(APIServer pid=1) INFO 04-01 01:49:27 [api_server.py:590] Starting vLLM server on http://0.0.0.0:8080
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-01 01:49:27 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: [windows本地IP]:53539 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: [windows本地IP]:63463 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: [windows本地IP]:63463 - "GET /favicon.ico HTTP/1.1" 404 Not Found
(APIServer pid=1) INFO: [windows本地IP]:60923 - "GET /health HTTP/1.1" 200 OK
Beta Was this translation helpful? Give feedback.
All reactions