Merge 'rhds/main' into 'rhds/rhoai-2.23'#172
Merged
vaibhavjainwiz merged 5 commits intorhoai-2.23from Jul 22, 2025
Merged
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
An issue was reported with Mistral model on Blackwell and B200 hardware
with the error below:
<details>
<summary>Error log from pod</summary>
```
INFO 07-15 15:17:43 [__init__.py:244] Automatically detected platform cuda.
INFO 07-15 15:17:45 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-15 15:17:45 [cli_args.py:325] non-default args: {'uvicorn_log_level': 'debug', 'model': 'RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', 'trust_remote_code': True, 'max_model_len': 10000, 'limit_mm_per_prompt': {'image': 5, 'video': 5}, 'enable_chunked_prefill': True}
INFO 07-15 15:17:50 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 07-15 15:17:50 [config.py:1472] Using max model len 10000
INFO 07-15 15:17:50 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-15 15:17:52 [core.py:526] Waiting for init message from front-end.
INFO 07-15 15:17:52 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', speculative_config=None, tokenizer='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
INFO 07-15 15:17:53 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 07-15 15:17:56 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-15 15:17:56 [gpu_model_runner.py:1770] Starting to load model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16...
INFO 07-15 15:17:56 [gpu_model_runner.py:1775] Loading model from scratch...
INFO 07-15 15:17:56 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 07-15 15:17:57 [cuda.py:284] Using Flash Attention backend on V1 engine.
INFO 07-15 15:17:57 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 4.68it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:01, 1.87it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.67it/s]
INFO 07-15 15:18:00 [default_loader.py:272] Loading weights took 2.45 seconds
INFO 07-15 15:18:04 [gpu_model_runner.py:1801] Model loading took 14.0460 GiB and 6.938856 seconds
INFO 07-15 15:18:04 [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size.
CUDA error (/mnt/work-dir/xformers-0.0.30/xformers-0.0.30/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device
Traceback (most recent call last):
File "/opt/app-root/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
args.dispatch_function(args)
File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
uvloop.run(run_server(args))
File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
return cls(
^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
return AsyncMPClient(*client_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 666, in __init__
super().__init__(
File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 403, in __init__
with launch_core_engines(vllm_config, executor_class,
File "/usr/lib64/python3.12/contextlib.py", line 144, in __exit__
next(self.gen)
File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
wait_for_engine_startup(
File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```
</details>
Image built with this PR: quay.io/vllm/automation-vllm:cuda-16300391547
Manual test on Blackwell was successful. For details see comments in:
https://issues.redhat.com/browse/INFERENG-1126
A100 ocp-test validation is green (ie.
https://github.com/neuralmagic/nm-cicd/actions/runs/16303950486)
Accept-sync:
CUDA: https://github.com/neuralmagic/nm-cicd/actions/runs/16304501784
ROCM: https://github.com/neuralmagic/nm-cicd/actions/runs/16304505641
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…at completions] (vllm-project#19126) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.