-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of python collect_env.py
INFO 05-11 09:47:46 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
/opt/conda/lib/python3.11/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.35
Python version: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-25-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB
Nvidia driver version: 550.90.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
Stepping: 6
CPU max MHz: 3400.0000
CPU min MHz: 800.0000
BogoMIPS: 5200.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization: VT-x
L1d cache: 3 MiB (64 instances)
L1i cache: 2 MiB (64 instances)
L2 cache: 80 MiB (64 instances)
L3 cache: 96 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flake8==7.2.0
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] onnxruntime==1.21.1
[pip3] optree==0.13.0
[pip3] pytorch-lightning==2.5.1.post0
[pip3] pytorch-wpe==0.0.1
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0
[pip3] torch-complex==0.4.4
[pip3] torchaudio==2.6.0
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==1.7.1
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] optree 0.13.0 pypi_0 pypi
[conda] pytorch-lightning 2.5.1.post0 pypi_0 pypi
[conda] pytorch-wpe 0.0.1 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.6.0 pypi_0 pypi
[conda] torch-complex 0.4.4 pypi_0 pypi
[conda] torchaudio 2.6.0 pypi_0 pypi
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchmetrics 1.7.1 pypi_0 pypi
[conda] torchvision 0.21.0 pypi_0 pypi
[conda] transformers 4.51.3 pypi_0 pypi
[conda] triton 3.2.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NODE NODE SYS SYS NODE PXB SYS SYS PXB 1-5,21-23,65-69 0 N/A
GPU1 NV8 X SYS SYS NODE NODE SYS SYS NODE PXB SYS 39-45,103-109 1 N/A
NIC0 NODE SYS X PIX SYS SYS PIX NODE SYS SYS NODE
NIC1 NODE SYS PIX X SYS SYS PIX NODE SYS SYS NODE
NIC2 SYS NODE SYS SYS X PIX SYS SYS PIX NODE SYS
NIC3 SYS NODE SYS SYS PIX X SYS SYS PIX NODE SYS
NIC4 NODE SYS PIX PIX SYS SYS X NODE SYS SYS NODE
NIC5 PXB SYS NODE NODE SYS SYS NODE X SYS SYS PIX
NIC6 SYS NODE SYS SYS PIX PIX SYS SYS X NODE SYS
NIC7 SYS PXB SYS SYS NODE NODE SYS SYS NODE X SYS
NIC8 PXB SYS NODE NODE SYS SYS NODE PIX SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_10
NIC5: mlx5_11
NIC6: mlx5_12
NIC7: mlx5_13
NIC8: mlx5_20
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
Run the following command on a node with A800(80G) * 2 GPUs:
vllm serve Kimi-VL-A3B-Instruct --trust-remote-code --max-model-len=8192 --limit-mm-per-prompt image=2
Result:
INFO 05-11 09:41:02 [__init__.py:239] Automatically detected platform cuda.
INFO 05-11 09:41:06 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-11 09:41:06 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='Kimi-VL-A3B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Kimi-VL-A3B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=8192, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={'image': 2}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f140990bec0>)
INFO 05-11 09:41:13 [config.py:717] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 05-11 09:41:13 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 05-11 09:41:14 [tokenizer.py:251] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 05-11 09:41:17 [__init__.py:239] Automatically detected platform cuda.
INFO 05-11 09:41:20 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='Kimi-VL-A3B-Instruct', speculative_config=None, tokenizer='Kimi-VL-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Kimi-VL-A3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-11 09:41:20 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ff55f8d8210>
INFO 05-11 09:41:25 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-11 09:41:25 [cuda.py:184] Using Triton MLA backend on V1 engine.
WARNING 05-11 09:41:25 [tokenizer.py:251] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 05-11 09:41:27 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-11 09:41:27 [gpu_model_runner.py:1329] Starting to load model Kimi-VL-A3B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 14% Completed | 1/7 [00:01<00:10, 1.77s/it]
Loading safetensors checkpoint shards: 29% Completed | 2/7 [00:03<00:09, 1.82s/it]
Loading safetensors checkpoint shards: 43% Completed | 3/7 [00:05<00:07, 1.84s/it]
Loading safetensors checkpoint shards: 57% Completed | 4/7 [00:06<00:04, 1.54s/it]
Loading safetensors checkpoint shards: 71% Completed | 5/7 [00:08<00:03, 1.64s/it]
Loading safetensors checkpoint shards: 86% Completed | 6/7 [00:10<00:01, 1.70s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12<00:00, 1.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12<00:00, 1.72s/it]
INFO 05-11 09:41:40 [loader.py:458] Loading weights took 12.13 seconds
INFO 05-11 09:41:40 [gpu_model_runner.py:1347] Model loading took 30.6140 GiB and 12.419668 seconds
INFO 05-11 09:41:40 [gpu_model_runner.py:1620] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 8 image items of the maximum feature size.
ERROR 05-11 09:41:40 [core.py:396] EngineCore failed to start.
ERROR 05-11 09:41:40 [core.py:396] Traceback (most recent call last):
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-11 09:41:40 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-11 09:41:40 [core.py:396] super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 05-11 09:41:40 [core.py:396] self._initialize_kv_caches(vllm_config)
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 129, in _initialize_kv_caches
ERROR 05-11 09:41:40 [core.py:396] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory
ERROR 05-11 09:41:40 [core.py:396] output = self.collective_rpc("determine_available_memory")
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-11 09:41:40 [core.py:396] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method
ERROR 05-11 09:41:40 [core.py:396] return func(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-11 09:41:40 [core.py:396] return func(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 183, in determine_available_memory
ERROR 05-11 09:41:40 [core.py:396] self.model_runner.profile_run()
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1640, in profile_run
ERROR 05-11 09:41:40 [core.py:396] dummy_encoder_outputs = self.model.get_multimodal_embeddings(
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 383, in get_multimodal_embeddings
ERROR 05-11 09:41:40 [core.py:396] vision_embeddings = self._process_image_input(image_input)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 366, in _process_image_input
ERROR 05-11 09:41:40 [core.py:396] image_features = self._process_image_pixels(image_input)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-11 09:41:40 [core.py:396] return func(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 361, in _process_image_pixels
ERROR 05-11 09:41:40 [core.py:396] return self.vision_tower(pixel_values, image_grid_hws)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 05-11 09:41:40 [core.py:396] return self._call_impl(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 05-11 09:41:40 [core.py:396] return forward_call(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 624, in forward
ERROR 05-11 09:41:40 [core.py:396] hidden_states = self.encoder(hidden_states, grid_hw)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 05-11 09:41:40 [core.py:396] return self._call_impl(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 05-11 09:41:40 [core.py:396] return forward_call(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 516, in forward
ERROR 05-11 09:41:40 [core.py:396] hidden_states = block(hidden_states,
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 05-11 09:41:40 [core.py:396] return self._call_impl(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 05-11 09:41:40 [core.py:396] return forward_call(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 477, in forward
ERROR 05-11 09:41:40 [core.py:396] attn_out = self.attention_qkvpacked(hidden_states,
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 453, in attention_qkvpacked
ERROR 05-11 09:41:40 [core.py:396] attn_out = attn_func(xq,
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 139, in sdpa_attention
ERROR 05-11 09:41:40 [core.py:396] attn_output = F.scaled_dot_product_attention(q,
ERROR 05-11 09:41:40 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.38 GiB. GPU 0 has a total capacity of 79.32 GiB of which 43.16 GiB is free. Including non-PyTorch memory, this process has 36.16 GiB memory in use. Of the allocated memory 35.59 GiB is allocated by PyTorch, and 72.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Process EngineCore_0:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
self._initialize_kv_caches(vllm_config)
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 129, in _initialize_kv_caches
available_gpu_memory = self.model_executor.determine_available_memory()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 183, in determine_available_memory
self.model_runner.profile_run()
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1640, in profile_run
dummy_encoder_outputs = self.model.get_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 383, in get_multimodal_embeddings
vision_embeddings = self._process_image_input(image_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 366, in _process_image_input
image_features = self._process_image_pixels(image_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 361, in _process_image_pixels
return self.vision_tower(pixel_values, image_grid_hws)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 624, in forward
hidden_states = self.encoder(hidden_states, grid_hw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 516, in forward
hidden_states = block(hidden_states,
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 477, in forward
attn_out = self.attention_qkvpacked(hidden_states,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 453, in attention_qkvpacked
attn_out = attn_func(xq,
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 139, in sdpa_attention
attn_output = F.scaled_dot_product_attention(q,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.38 GiB. GPU 0 has a total capacity of 79.32 GiB of which 43.16 GiB is free. Including non-PyTorch memory, this process has 36.16 GiB memory in use. Of the allocated memory 35.59 GiB is allocated by PyTorch, and 72.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W511 09:41:41.780741646 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/opt/conda/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/opt/conda/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/opt/conda/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/opt/conda/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
return cls(
^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
self.engine_core = core_client_class(
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 642, in __init__
super().__init__(
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 398, in __init__
self._wait_for_engine_startup()
File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working