Skip to content

[Bug]: Kimi-VL-A3B-Instruct OOM with 80G*2 VRAM #17952

@gitlawr

Description

@gitlawr

Your current environment

The output of python collect_env.py
INFO 05-11 09:47:46 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
/opt/conda/lib/python3.11/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.35

Python version: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-25-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800-SXM4-80GB
GPU 1: NVIDIA A800-SXM4-80GB

Nvidia driver version: 550.90.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
CPU family:                      6
Model:                           106
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
Stepping:                        6
CPU max MHz:                     3400.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5200.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization:                  VT-x
L1d cache:                       3 MiB (64 instances)
L1i cache:                       2 MiB (64 instances)
L2 cache:                        80 MiB (64 instances)
L3 cache:                        96 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] flake8==7.2.0
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] onnxruntime==1.21.1
[pip3] optree==0.13.0
[pip3] pytorch-lightning==2.5.1.post0
[pip3] pytorch-wpe==0.0.1
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0
[pip3] torch-complex==0.4.4
[pip3] torchaudio==2.6.0
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==1.7.1
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] optree                    0.13.0                   pypi_0    pypi
[conda] pytorch-lightning         2.5.1.post0              pypi_0    pypi
[conda] pytorch-wpe               0.0.1                    pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torch-complex             0.4.4                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchmetrics              1.7.1                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.51.3                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
	GPU0	GPU1	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV8	NODE	NODE	SYS	SYS	NODE	PXB	SYS	SYS	PXB	1-5,21-23,65-69	0		N/A
GPU1	NV8	 X 	SYS	SYS	NODE	NODE	SYS	SYS	NODE	PXB	SYS	39-45,103-109	1		N/A
NIC0	NODE	SYS	 X 	PIX	SYS	SYS	PIX	NODE	SYS	SYS	NODE
NIC1	NODE	SYS	PIX	 X 	SYS	SYS	PIX	NODE	SYS	SYS	NODE
NIC2	SYS	NODE	SYS	SYS	 X 	PIX	SYS	SYS	PIX	NODE	SYS
NIC3	SYS	NODE	SYS	SYS	PIX	 X 	SYS	SYS	PIX	NODE	SYS
NIC4	NODE	SYS	PIX	PIX	SYS	SYS	 X 	NODE	SYS	SYS	NODE
NIC5	PXB	SYS	NODE	NODE	SYS	SYS	NODE	 X 	SYS	SYS	PIX
NIC6	SYS	NODE	SYS	SYS	PIX	PIX	SYS	SYS	 X 	NODE	SYS
NIC7	SYS	PXB	SYS	SYS	NODE	NODE	SYS	SYS	NODE	 X 	SYS
NIC8	PXB	SYS	NODE	NODE	SYS	SYS	NODE	PIX	SYS	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_10
  NIC5: mlx5_11
  NIC6: mlx5_12
  NIC7: mlx5_13
  NIC8: mlx5_20

NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

Run the following command on a node with A800(80G) * 2 GPUs:

vllm serve Kimi-VL-A3B-Instruct --trust-remote-code --max-model-len=8192 --limit-mm-per-prompt image=2

Result:

INFO 05-11 09:41:02 [__init__.py:239] Automatically detected platform cuda.
INFO 05-11 09:41:06 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-11 09:41:06 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='Kimi-VL-A3B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Kimi-VL-A3B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=8192, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={'image': 2}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f140990bec0>)
INFO 05-11 09:41:13 [config.py:717] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 05-11 09:41:13 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 05-11 09:41:14 [tokenizer.py:251] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 05-11 09:41:17 [__init__.py:239] Automatically detected platform cuda.
INFO 05-11 09:41:20 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='Kimi-VL-A3B-Instruct', speculative_config=None, tokenizer='Kimi-VL-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Kimi-VL-A3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-11 09:41:20 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ff55f8d8210>
INFO 05-11 09:41:25 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-11 09:41:25 [cuda.py:184] Using Triton MLA backend on V1 engine.
WARNING 05-11 09:41:25 [tokenizer.py:251] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 05-11 09:41:27 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 05-11 09:41:27 [gpu_model_runner.py:1329] Starting to load model Kimi-VL-A3B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:01<00:10,  1.77s/it]
Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:03<00:09,  1.82s/it]
Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:05<00:07,  1.84s/it]
Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:06<00:04,  1.54s/it]
Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:08<00:03,  1.64s/it]
Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:10<00:01,  1.70s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12<00:00,  1.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:12<00:00,  1.72s/it]

INFO 05-11 09:41:40 [loader.py:458] Loading weights took 12.13 seconds
INFO 05-11 09:41:40 [gpu_model_runner.py:1347] Model loading took 30.6140 GiB and 12.419668 seconds
INFO 05-11 09:41:40 [gpu_model_runner.py:1620] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 8 image items of the maximum feature size.
ERROR 05-11 09:41:40 [core.py:396] EngineCore failed to start.
ERROR 05-11 09:41:40 [core.py:396] Traceback (most recent call last):
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-11 09:41:40 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-11 09:41:40 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 05-11 09:41:40 [core.py:396]     self._initialize_kv_caches(vllm_config)
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 129, in _initialize_kv_caches
ERROR 05-11 09:41:40 [core.py:396]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 05-11 09:41:40 [core.py:396]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory
ERROR 05-11 09:41:40 [core.py:396]     output = self.collective_rpc("determine_available_memory")
ERROR 05-11 09:41:40 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-11 09:41:40 [core.py:396]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-11 09:41:40 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method
ERROR 05-11 09:41:40 [core.py:396]     return func(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-11 09:41:40 [core.py:396]     return func(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 183, in determine_available_memory
ERROR 05-11 09:41:40 [core.py:396]     self.model_runner.profile_run()
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1640, in profile_run
ERROR 05-11 09:41:40 [core.py:396]     dummy_encoder_outputs = self.model.get_multimodal_embeddings(
ERROR 05-11 09:41:40 [core.py:396]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 383, in get_multimodal_embeddings
ERROR 05-11 09:41:40 [core.py:396]     vision_embeddings = self._process_image_input(image_input)
ERROR 05-11 09:41:40 [core.py:396]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 366, in _process_image_input
ERROR 05-11 09:41:40 [core.py:396]     image_features = self._process_image_pixels(image_input)
ERROR 05-11 09:41:40 [core.py:396]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-11 09:41:40 [core.py:396]     return func(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 361, in _process_image_pixels
ERROR 05-11 09:41:40 [core.py:396]     return self.vision_tower(pixel_values, image_grid_hws)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 05-11 09:41:40 [core.py:396]     return self._call_impl(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 05-11 09:41:40 [core.py:396]     return forward_call(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 624, in forward
ERROR 05-11 09:41:40 [core.py:396]     hidden_states = self.encoder(hidden_states, grid_hw)
ERROR 05-11 09:41:40 [core.py:396]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 05-11 09:41:40 [core.py:396]     return self._call_impl(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 05-11 09:41:40 [core.py:396]     return forward_call(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 516, in forward
ERROR 05-11 09:41:40 [core.py:396]     hidden_states = block(hidden_states,
ERROR 05-11 09:41:40 [core.py:396]                     ^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 05-11 09:41:40 [core.py:396]     return self._call_impl(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 05-11 09:41:40 [core.py:396]     return forward_call(*args, **kwargs)
ERROR 05-11 09:41:40 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 477, in forward
ERROR 05-11 09:41:40 [core.py:396]     attn_out = self.attention_qkvpacked(hidden_states,
ERROR 05-11 09:41:40 [core.py:396]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 453, in attention_qkvpacked
ERROR 05-11 09:41:40 [core.py:396]     attn_out = attn_func(xq,
ERROR 05-11 09:41:40 [core.py:396]                ^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396]   File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 139, in sdpa_attention
ERROR 05-11 09:41:40 [core.py:396]     attn_output = F.scaled_dot_product_attention(q,
ERROR 05-11 09:41:40 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-11 09:41:40 [core.py:396] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.38 GiB. GPU 0 has a total capacity of 79.32 GiB of which 43.16 GiB is free. Including non-PyTorch memory, this process has 36.16 GiB memory in use. Of the allocated memory 35.59 GiB is allocated by PyTorch, and 72.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Process EngineCore_0:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 71, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 129, in _initialize_kv_caches
    available_gpu_memory = self.model_executor.determine_available_memory()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/executor/abstract.py", line 75, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/utils.py", line 2456, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 183, in determine_available_memory
    self.model_runner.profile_run()
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1640, in profile_run
    dummy_encoder_outputs = self.model.get_multimodal_embeddings(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 383, in get_multimodal_embeddings
    vision_embeddings = self._process_image_input(image_input)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 366, in _process_image_input
    image_features = self._process_image_pixels(image_input)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/kimi_vl.py", line 361, in _process_image_pixels
    return self.vision_tower(pixel_values, image_grid_hws)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 624, in forward
    hidden_states = self.encoder(hidden_states, grid_hw)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 516, in forward
    hidden_states = block(hidden_states,
                    ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 477, in forward
    attn_out = self.attention_qkvpacked(hidden_states,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 453, in attention_qkvpacked
    attn_out = attn_func(xq,
               ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/model_executor/models/moonvit.py", line 139, in sdpa_attention
    attn_output = F.scaled_dot_product_attention(q,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 72.38 GiB. GPU 0 has a total capacity of 79.32 GiB of which 43.16 GiB is free. Including non-PyTorch memory, this process has 36.16 GiB memory in use. Of the allocated memory 35.59 GiB is allocated by PyTorch, and 72.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]:[W511 09:41:41.780741646 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/opt/conda/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 53, in main
    args.dispatch_function(args)
  File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/opt/conda/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/conda/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/opt/conda/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/opt/conda/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/opt/conda/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions