Skip to content

Commit a699a11

Browse files
Alexei-V-Ivanov-AMDhmellormarkmcnjhillmgoin
authored
Merging in the latest merge from vllm-project to ROCm (#472)
* Fix `head_dim` not existing in all model configs (Transformers backend) (vllm-project#14141) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V0][Metrics] Remove unimplemented `vllm:tokens_total` (vllm-project#14134) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V0][Metrics] Deprecate some KV/prefix cache metrics (vllm-project#14136) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V1] Simplify stats logging (vllm-project#14082) Signed-off-by: Nick Hill <nhill@redhat.com> * [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (vllm-project#14055) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (vllm-project#14100) Signed-off-by: mgoin <mgoin64@gmail.com> * [Kernel] Optimize moe intermediate_cache usage (vllm-project#13625) Signed-off-by: mgoin <mgoin64@gmail.com> * [Docs] Add GPTQModel (vllm-project#14056) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [v1] Add comments to the new ragged paged attention Pallas kernel (vllm-project#14155) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Model] Add support for GraniteMoeShared models (vllm-project#13313) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [core] moe fp8 block quant tuning support (vllm-project#14068) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Misc] Remove lru_cache in NvmlCudaPlatform (vllm-project#14156) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> * [core] Pass all driver env vars to ray workers unless excluded (vllm-project#14099) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * Use math.prod instead of np.prod for trivial ops (vllm-project#14142) * Fix benchmark_moe.py tuning for CUDA devices (vllm-project#14164) * [platform] add debug logging during inferring the device type (vllm-project#14195) Signed-off-by: youkaichao <youkaichao@gmail.com> * [sleep mode] error out with expandable_segments (vllm-project#14189) Signed-off-by: youkaichao <youkaichao@gmail.com> * [doc] add "Failed to infer device type" to faq (vllm-project#14200) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Restrict MacOS CPU detection (vllm-project#14210) Signed-off-by: mgoin <mgoin64@gmail.com> * [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (vllm-project#13869) Signed-off-by: Nick Hill <nhill@redhat.com> * [V0][Metrics] Deprecate some questionable request time metrics (vllm-project#14135) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (vllm-project#14161) * add cutlass support for blackwell fp8 gemm (vllm-project#13798) * [TPU][Profiler] Support start_profile/stop_profile in TPU worker (vllm-project#13988) Signed-off-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: mgoin <mgoin64@gmail.com> * Fix performance when `--generation-config` is not `None` (vllm-project#14223) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Do `prompt_logprobs` clamping for chat as well as completions (vllm-project#14225) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Docs] Update Dockerfile dependency image (vllm-project#14215) Signed-off-by: mgoin <mgoin64@gmail.com> * [v1][Metrics] Add design doc (vllm-project#12745) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> * [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (vllm-project#14228) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> * Clean up unused padding_idx variables across many model definitions (vllm-project#13240) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [ROCm] Disable a few more kernel tests that are broken on ROCm (vllm-project#14145) Signed-off-by: Sage Moore <sage@neuralmagic.com> * [V1][TPU] TPU multimodal model support for ragged attention (vllm-project#14158) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [misc] announce china meetup (vllm-project#14248) Signed-off-by: youkaichao <youkaichao@gmail.com> * Moved numba from common requirements to cuda/rocm specific requirements (vllm-project#14199) Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> * Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (vllm-project#14157) Signed-off-by: mgoin <mgoin64@gmail.com> * [Bugfix] Fix gptq_marlin for deepseek-v3 (vllm-project#13750) Signed-off-by: dangshunya <dangshunya@baichuan-inc.com> Co-authored-by: dangshunya <dangshunya@baichuan-inc.com> * [V1][Bugfix] Do not reset prefix caching metrics (vllm-project#14235) * [Model] New model support for Phi-4-multimodal-instruct (vllm-project#14119) * [V1] EP/TP MoE + DP Attention (vllm-project#13931) * [platforms] improve rocm debugging info (vllm-project#14257) * Temporarily disable test_awq_gemm_opcheck (vllm-project#14251) Signed-off-by: mgoin <mgoin64@gmail.com> * [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (vllm-project#14066) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> * [Misc][V1] Avoid using `envs.VLLM_USE_V1` in mm processing (vllm-project#14256) Signed-off-by: Roger Wang <ywang@roblox.com> * [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (vllm-project#14169) Signed-off-by: Lu Fang <lufang@fb.com> * [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (vllm-project#14217) Signed-off-by: Iacopo Poli <iacopo@lighton.ai> * [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (vllm-project#14278) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Small update for external_launcher backend docs (vllm-project#14288) * [V1][Frontend] Add Testing For V1 Runtime Parameters (vllm-project#14159) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> * [LoRA] Remove linear hack outside transformers backend (vllm-project#14177) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] Add Qwen2MoeForCausalLM moe tuning support (vllm-project#14276) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * prefix_caching.md: Fixed typo (vllm-project#14293) Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai> * [Bugfix] Fix broken vision language example (vllm-project#14292) Signed-off-by: Isotr0py <2037008807@qq.com> * [Docs] Add Meta Slides (vllm-project#14297) Signed-off-by: simon-mo <simon.mo@hey.com> * [V1][Minor] Remove obsolete FIXME comment (vllm-project#14304) Signed-off-by: Nick Hill <nhill@redhat.com> * Deprecate `best_of` Sampling Parameter in anticipation for vLLM V1 (vllm-project#13997) Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1][BugFix] Fix for mixed top_k batch (vllm-project#14301) Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com> * [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (vllm-project#14267) * [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (vllm-project#14308) Signed-off-by: Lu Fang <lufang@fb.com> * init Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (vllm-project#14237) Signed-off-by: pyc96 <pychen96@gmail.com> * [Bugfix] Remove num_tokens_across_dp (vllm-project#14302) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [BugFix] Fix prefix caching V0 MLA (vllm-project#14255) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com> * [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (vllm-project#14243) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (vllm-project#13917) Signed-off-by: mgoin <mgoin64@gmail.com> * [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (vllm-project#13850) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [BugFix] MLA + V1, illegal memory access and accuracy issues (vllm-project#14253) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [misc] Mention `ray list nodes` command to troubleshoot ray issues (vllm-project#14318) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (vllm-project#14114) * [V1] LoRA - Enable more V1 tests (vllm-project#14315) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (vllm-project#11301) * [Hardware] Update the flash attn tag to support Blackwell (vllm-project#14244) * [Model] Update Paligemma multimodal processing with PromptUpdate (vllm-project#14015) Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (vllm-project#14275) Signed-off-by: Linkun Chen <github@lkchen.net> * [Core] Optimizing cross-attention `QKVParallelLinear` computation (vllm-project#12325) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> * [Frontend][Docs] Transcription API streaming (vllm-project#13301) Signed-off-by: NickLucche <nlucches@redhat.com> * [Doc] Update reasoning with stream example to use OpenAI library (vllm-project#14077) Signed-off-by: liuyanyi <wolfsonliu@163.com> * [Doc] Correct beam_search using in generative_models.md (vllm-project#14363) * [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (vllm-project#14152) * [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (vllm-project#14326) Signed-off-by: courage17340 <courage17340@163.com> * [Core] Don't use cache during multi-modal profiling (vllm-project#14336) * [Doc] Fix date typo in README.md (vllm-project#14366) Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> * [RLHF] use worker_extension_cls for compatibility with V0 and V1 (vllm-project#14185) Signed-off-by: youkaichao <youkaichao@gmail.com> * Reinstate `best_of` for V0 (vllm-project#14356) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Adding cpu inference with VXE ISA for s390x architecture (vllm-project#12613) Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com> Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com> * Add authors to license header. (vllm-project#14371) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> * Fix mla prefill context performance (vllm-project#13897) Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com> * [V1] Do not detokenize if sampling param detokenize is False (vllm-project#14224) Signed-off-by: Himanshu Jaju <hj@mistral.ai> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Distributed] Add enable_expert_parallel arg (vllm-project#14305) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (vllm-project#13569) Signed-off-by: mgoin <mgoin64@gmail.com> * [CI] Disable spawn when running V1 Test (vllm-project#14345) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Kernel] Add needs_fixed_stride_order tag to most GEMMs (vllm-project#14306) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Fix use_direct_call condition in FusedMoE layer for (vllm-project#14382) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bug] Fix Attention when ignored in by quant_method (vllm-project#14313) Signed-off-by: mgoin <mgoin64@gmail.com> * [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (vllm-project#14221) Signed-off-by: mgoin <mgoin64@gmail.com> * [Docs] Add nsight guide to profiling docs (vllm-project#14298) Signed-off-by: mgoin <mgoin64@gmail.com> * cleanup boolean logic Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (vllm-project#14310) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [Doc] Fix a typo (vllm-project#14385) * [Bugfix] Correctly call `cudaProfilerStop` in benchmarks script (vllm-project#14183) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (vllm-project#14390) Signed-off-by: luka <luka@neuralmagic.com> * [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs (vllm-project#14396) * [Bugfix] Fix JambaForCausalLM LoRA (vllm-project#14370) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Build] Add nightly wheel fallback when latest commit wheel unavailable (vllm-project#14358) Signed-off-by: Isotr0py <2037008807@qq.com> * OpenVINO: added CPU-like conditions (vllm-project#14338) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> * [GH] Auto-apply multi-modality label to relevant PRs (vllm-project#14402) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * correct wrong markdown syntax (vllm-project#14414) Signed-off-by: vincent-pli <justdoit.pli@gmail.com> * [Bugfix] Further clean up LoRA test (vllm-project#14422) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Bugfix] Clean up multi-modal processors (vllm-project#14417) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] Set default value of seed to None (vllm-project#14274) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * [BUGFIX] Skip tokenization support for throughput benchmark (vllm-project#12712) Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * Fix missing `kv_caches` and `attn_metadata` in `OpenVINOCausalLM` (vllm-project#14271) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Use the optimized block sizes after tuning the kernel. (vllm-project#14329) * [V1][Core] Support for Structured Outputs (vllm-project#12388) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Doc] Update prefix_caching.md to match the example image (vllm-project#14420) * [Benchmarks] Make detokenization optional in benchmark scripts (vllm-project#11697) Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com> * comments Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Kernel] optimize performance of gptq marlin kernel when n is small (vllm-project#14138) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> * [Misc] Add Phi4-MM example (vllm-project#14343) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [v1] torch.compile integration explanation (vllm-project#14437) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Eagerly remove finished requests from the batch (vllm-project#14388) Signed-off-by: Nick Hill <nhill@redhat.com> * [V1][Metrics] Fix traceback with preemptions+LoRA (vllm-project#14220) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Bugfix] Fix torch_xla which can't handle None seed introduced in vllm-project#14274 (vllm-project#14459) Signed-off-by: Yarong Mu <ymu@google.com> * [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (vllm-project#13949) * [Bugfix][V1] Handle MLA in kv_cache_interface (vllm-project#14462) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Revert "[Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384)" (vllm-project#14471) * [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache (vllm-project#14369) Signed-off-by: Mathis Felardos <mathis@mistral.ai> * [MISC][V1] Register process killing handler only in the main thread (vllm-project#14380) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> * [core] add `extra_args` to `SamplingParams` (vllm-project#13300) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> * [CI/Build] refactor: set timezone of container to UTC (vllm-project#12888) Signed-off-by: Roger Meier <r.meier@siemens.com> * Default to `generation_config` from model (vllm-project#12622) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Doc]add doc for Qwen models tool calling (vllm-project#14478) Signed-off-by: WangErXiao <863579016@qq.com> * [Doc] Added QwQ-32B to the supported models list in the reasoning out… (vllm-project#14479) Signed-off-by: WangErXiao <863579016@qq.com> * [Bugfix] Make the deviceprofiler include LoRA memory. (vllm-project#14469) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Add training doc signposting to TRL (vllm-project#14439) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Build/BugFix] Fix hopper 12.8 build (vllm-project#14354) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * Add RLHF document (vllm-project#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [CI/Build] Use a fixed seed to avoid flaky tests (vllm-project#14480) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] TPU - Add tensor parallel support via Ray (vllm-project#13618) Signed-off-by: Alexander Matveev <amatveev@redhat.com> * [VLM] Add TP support for Phi-4-MM (vllm-project#14453) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] add `use_tqdm_on_load` to reduce logs (vllm-project#14407) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> * [V1][Core] Fix memory issue with logits & sampling (vllm-project#13776) Signed-off-by: Roger Wang <ywang@roblox.com> * [benchmarks] Add option to use unique jsonschema for each request (vllm-project#14457) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] Don't run ruff at all on 3rd party libs (vllm-project#14493) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Move requirements into their own directory (vllm-project#12547) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] DeepSeek Accuracy (vllm-project#14476) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling (vllm-project#14361) Signed-off-by: Isotr0py <2037008807@qq.com> * Update CODEOWNERS for structured output (vllm-project#14496) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] Upgrade to Python 3.9 typing for additional directories (vllm-project#14492) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] Support bad_words in sampler (vllm-project#13376) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com> * Revert "[V1][Core] Fix memory issue with logits & sampling" (vllm-project#14504) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Attention] Default to FlashMLA backend for MLA (vllm-project#14451) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [V1][TPU] Remove unnecessary padding for running on TPU. (vllm-project#14467) * [Feat] Support chunked prefill for LMCache connector (vllm-project#14505) Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 (vllm-project#12428) Signed-off-by: Yuchen Yan <740987012@qq.com> * [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work (vllm-project#14498) Signed-off-by: Isotr0py <2037008807@qq.com> * [Hardware][TPU] Fix the recompiling issue in logits processor after warmup (vllm-project#14510) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [Misc] Ensure out-of-tree quantization method recognize by cli args (vllm-project#14328) Signed-off-by: liuyanyi <wolfsonliu@163.com> * [Bugfix] Wrong requirements path - rocm (vllm-project#14527) Signed-off-by: Martin Hoyer <mhoyer@redhat.com> * [Feature] Consolidate performance benchmark datasets (vllm-project#14036) Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Misc] Add log information for handle_process_request. (vllm-project#14130) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Docs] Mention `model_impl` arg when explaining Transformers fallback (vllm-project#14552) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] support image embeds (vllm-project#13955) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Kernel] Add more dtype support for GGUF kernels (vllm-project#14043) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com> * [Doc] Update PaliGemma note to a warning (vllm-project#14565) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * V1 rocm support (#469) * Initial commit for V1 successfull compilation * Small improvement for linear * Small improvement for linear * making use of forward_cuda for all except ROPE in llama --------- Co-authored-by: maleksan85 <maleksan@amd.com> * nightly_fixed_aiter_integration_final_20250305 README update (#470) * nightly_fixed_aiter_integration_final_20250305 README update (perf results only) * Update Docker Manifest git hash * Update Docker Manifest and added nightly_fixed_aiter_integration_final_20250305 * some more updates * Update AITER section with example * Updated AITER command with larger batch size and model name * Fixing typo * Removed --max-model-len in AITER command * Updating AITER instructions * typo * Another typo * Whitespace * modifying whats new section * Another typo --------- Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: dangshunya <dangshunya@baichuan-inc.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Iacopo Poli <iacopo@lighton.ai> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: pyc96 <pychen96@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Signed-off-by: liuyanyi <wolfsonliu@163.com> Signed-off-by: courage17340 <courage17340@163.com> Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com> Signed-off-by: Himanshu Jaju <hj@mistral.ai> Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Signed-off-by: vincent-pli <justdoit.pli@gmail.com> Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Yarong Mu <ymu@google.com> Signed-off-by: Mathis Felardos <mathis@mistral.ai> Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Roger Meier <r.meier@siemens.com> Signed-off-by: WangErXiao <863579016@qq.com> Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> Signed-off-by: Yuchen Yan <740987012@qq.com> Signed-off-by: Martin Hoyer <mhoyer@redhat.com> Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Zhanwen Chen <phil.zhanwen.chen@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: lkchen <github@lkchen.net> Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: rainkert <93575312+rainkert@users.noreply.github.com> Co-authored-by: dangshunya <dangshunya@baichuan-inc.com> Co-authored-by: Congcong Chen <congcongchen@microsoft.com> Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Iacopo Poli <iacopo@lighton.ai> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Zhe Zhang <zhz@apache.org> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: DaividFrank <49250948+DaividFrank@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com> Co-authored-by: Serena <yangsijia.614@bytedance.com> Co-authored-by: pyc96 <pychen96@gmail.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Ce Gao <cegao@tensorchord.ai> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: kYLe <kylhuang@nvidia.com> Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Co-authored-by: Yanyi Liu <wolfsonliu@163.com> Co-authored-by: Irina Yuryeva <76484191+upayuryeva@users.noreply.github.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: courage17340 <courage17340@users.noreply.github.com> Co-authored-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> Co-authored-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com> Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Co-authored-by: Himanshu Jaju <hj@mistral.ai> Co-authored-by: Chengji Yao <chengjiyao@google.com> Co-authored-by: Daniel Li <dyli@google.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Peng Li <justdoit.pli@gmail.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: York-RDWang <103811994+York-RDWang@users.noreply.github.com> Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: yarongmu-google <150371854+yarongmu-google@users.noreply.github.com> Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com> Co-authored-by: Mathis Felardos <mathis@mistral.ai> Co-authored-by: Aviv Keshet <akeshet@scaledcognition.com> Co-authored-by: Roger Meier <r.meier@siemens.com> Co-authored-by: Robin <863579016@qq.com> Co-authored-by: Alexander Matveev <59768536+alexm-redhat@users.noreply.github.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Jiayi Yao <82156730+YaoJiayi@users.noreply.github.com> Co-authored-by: Yuchen Yan <50619811+yanyc428@users.noreply.github.com> Co-authored-by: Martin Hoyer <mhoyer@redhat.com> Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com> Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Mcirino1 <57415822+Mcirino1@users.noreply.github.com> Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
1 parent 5e31d5c commit a699a11

File tree

379 files changed

+19127
-4444
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

379 files changed

+19127
-4444
lines changed

.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -426,7 +426,7 @@ main() {
426426

427427
pip install -U transformers
428428

429-
pip install -r requirements-dev.txt
429+
pip install -r requirements/dev.txt
430430
which genai-perf
431431

432432
# check storage

.buildkite/run-amd-test.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,12 @@ if [[ $commands == *" kernels "* ]]; then
9393
--ignore=kernels/test_rand.py \
9494
--ignore=kernels/test_sampler.py \
9595
--ignore=kernels/test_cascade_flash_attn.py \
96-
--ignore=kernels/test_mamba_mixer2.py"
96+
--ignore=kernels/test_mamba_mixer2.py \
97+
--ignore=kernels/test_aqlm.py \
98+
--ignore=kernels/test_machete_mm.py \
99+
--ignore=kernels/test_mha_attn.py \
100+
--ignore=kernels/test_block_fp8.py \
101+
--ignore=kernels/test_permute_cols.py"
97102
fi
98103

99104
#ignore certain Entrypoints tests

.buildkite/run-cpu-test.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ function cpu_tests() {
3535
# Run basic model test
3636
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
3737
set -e
38-
pip install -r vllm/requirements-test.txt
38+
pip install -r vllm/requirements/test.txt
3939
pytest -v -s tests/models/decoder_only/language -m cpu_model
4040
pytest -v -s tests/models/embedding/language -m cpu_model
4141
pytest -v -s tests/models/encoder_decoder/language -m cpu_model

.buildkite/test-pipeline.yaml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ steps:
3535
fast_check: true
3636
no_gpu: True
3737
commands:
38-
- pip install -r requirements-docs.txt
38+
- pip install -r ../../requirements/docs.txt
3939
- SPHINXOPTS=\"-W\" make html
4040
# Check API reference (if it fails, you may have missing mock imports)
4141
- grep \"sig sig-object py\" build/html/api/inference_params.html
@@ -78,6 +78,7 @@ steps:
7878
- tests/basic_correctness/test_preemption
7979
- tests/basic_correctness/test_cumem.py
8080
commands:
81+
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
8182
- pytest -v -s basic_correctness/test_cumem.py
8283
- pytest -v -s basic_correctness/test_basic_correctness.py
8384
- pytest -v -s basic_correctness/test_cpu_offload.py
@@ -115,6 +116,7 @@ steps:
115116
- tests/entrypoints/test_chat_utils
116117
- tests/entrypoints/offline_mode
117118
commands:
119+
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
118120
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
119121
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
120122
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
@@ -146,8 +148,10 @@ steps:
146148
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
147149
# TODO: create a dedicated test section for multi-GPU example tests
148150
# when we have multiple distributed example tests
149-
- python3 ../examples/offline_inference/rlhf.py
150-
- RAY_DEDUP_LOGS=0 python3 ../examples/offline_inference/rlhf_colocate.py
151+
- pushd ../examples/offline_inference
152+
- python3 rlhf.py
153+
- RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
154+
- popd
151155

152156
- label: Metrics, Tracing Test # 10min
153157
num_gpus: 2
@@ -204,6 +208,7 @@ steps:
204208
- VLLM_USE_V1=1 pytest -v -s v1/engine
205209
- VLLM_USE_V1=1 pytest -v -s v1/sample
206210
- VLLM_USE_V1=1 pytest -v -s v1/worker
211+
- VLLM_USE_V1=1 pytest -v -s v1/structured_output
207212
- VLLM_USE_V1=1 pytest -v -s v1/test_stats.py
208213
- VLLM_USE_V1=1 pytest -v -s v1/test_utils.py
209214
# TODO: accuracy does not match, whether setting

.github/mergify.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,21 @@ pull_request_rules:
3636
add:
3737
- frontend
3838

39+
- name: label-multi-modality
40+
description: Automatically apply multi-modality label
41+
conditions:
42+
- or:
43+
- files~=^vllm/multimodal/
44+
- files~=^tests/multimodal/
45+
- files~=^tests/models/multimodal/
46+
- files~=^tests/models/*/audio_language/
47+
- files~=^tests/models/*/vision_language/
48+
- files=tests/models/test_vision.py
49+
actions:
50+
label:
51+
add:
52+
- multi-modality
53+
3954
- name: label-structured-output
4055
description: Automatically apply structured-output label
4156
conditions:

.github/workflows/scripts/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ python_executable=python3
55

66
# Update paths
77
# Install requirements
8-
$python_executable -m pip install -r requirements-rocm.txt
8+
$python_executable -m pip install -r requirements/rocm.txt
99

1010
# Limit the number of parallel jobs to avoid OOM
1111
export MAX_JOBS=1

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ _build/
197197
hip_compat.h
198198

199199
# Benchmark dataset
200-
benchmarks/*.json
200+
benchmarks/**/*.json
201201

202202
# Linting
203203
actionlint

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,8 @@ repos:
4444
rev: 0.6.2
4545
hooks:
4646
- id: pip-compile
47-
args: [requirements-test.in, -o, requirements-test.txt]
48-
files: ^requirements-test\.(in|txt)$
47+
args: [requirements/test.in, -o, requirements/test.txt]
48+
files: ^requirements/test\.(in|txt)$
4949
- repo: local
5050
hooks:
5151
- id: mypy-local

.readthedocs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ formats: []
1818
# Optionally declare the Python requirements required to build your docs
1919
python:
2020
install:
21-
- requirements: docs/requirements-docs.txt
21+
- requirements: requirements/docs.txt

CMakeLists.txt

Lines changed: 54 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ set(ignoreMe "${VLLM_PYTHON_PATH}")
3131
set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11" "3.12")
3232

3333
# Supported NVIDIA architectures.
34-
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0")
34+
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0")
3535

3636
# Supported AMD GPU architectures.
3737
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")
@@ -312,7 +312,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
312312
# Only build Marlin kernels if we are building for at least some compatible archs.
313313
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
314314
# are not supported by Machete yet.
315-
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
315+
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
316316
if (MARLIN_ARCHS)
317317
set(MARLIN_SRCS
318318
"csrc/quantization/fp8/fp8_marlin.cu"
@@ -334,7 +334,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
334334

335335
# Only build AllSpark kernels if we are building for at least some compatible archs.
336336
cuda_archs_loose_intersection(ALLSPARK_ARCHS "8.0;8.6;8.7;8.9" "${CUDA_ARCHS}")
337-
if (ALLSPARK_ARCHS)
337+
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND ALLSPARK_ARCHS)
338338
set(ALLSPARK_SRCS
339339
"csrc/quantization/gptq_allspark/allspark_repack.cu"
340340
"csrc/quantization/gptq_allspark/allspark_qgemm_w8a16.cu")
@@ -345,46 +345,74 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
345345
message(STATUS "Building AllSpark kernels for archs: ${ALLSPARK_ARCHS}")
346346
else()
347347
message(STATUS "Not building AllSpark kernels as no compatible archs found"
348-
" in CUDA target architectures")
348+
" in CUDA target architectures, or CUDA not >= 12.0")
349349
endif()
350350

351+
352+
set(SCALED_MM_3X_ARCHS)
351353
# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
352-
# CUDA 12.0 or later (and only work on Hopper, 9.0a for now).
353-
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a" "${CUDA_ARCHS}")
354-
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
354+
# CUDA 12.0 or later
355+
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a;" "${CUDA_ARCHS}")
356+
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_ARCHS)
355357
set(SRCS
356-
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu"
358+
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x_sm90.cu"
357359
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_fp8.cu"
358360
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_int8.cu"
359361
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_azp_sm90_int8.cu"
360362
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8.cu")
361363
set_gencode_flags_for_srcs(
362364
SRCS "${SRCS}"
363-
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
365+
CUDA_ARCHS "${SCALED_MM_ARCHS}")
364366
list(APPEND VLLM_EXT_SRC "${SRCS}")
365-
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_C3X=1")
366-
message(STATUS "Building scaled_mm_c3x for archs: ${SCALED_MM_3X_ARCHS}")
367+
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM90=1")
368+
# Let scaled_mm_c2x know it doesn't need to build these arches
369+
list(APPEND SCALED_MM_3X_ARCHS "${SCALED_MM_ARCHS}")
370+
message(STATUS "Building scaled_mm_c3x_sm90 for archs: ${SCALED_MM_ARCHS}")
367371
else()
368-
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
369-
message(STATUS "Not building scaled_mm_c3x as CUDA Compiler version is "
372+
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_ARCHS)
373+
message(STATUS "Not building scaled_mm_c3x_sm90 as CUDA Compiler version is "
370374
"not >= 12.0, we recommend upgrading to CUDA 12.0 or "
371375
"later if you intend on running FP8 quantized models on "
372376
"Hopper.")
373377
else()
374-
message(STATUS "Not building scaled_mm_c3x as no compatible archs found "
378+
message(STATUS "Not building scaled_mm_c3x_sm90 as no compatible archs found "
375379
"in CUDA target architectures")
376380
endif()
381+
endif()
377382

378-
# clear SCALED_MM_3X_ARCHS so the scaled_mm_c2x kernels know we didn't
379-
# build any 3x kernels
380-
set(SCALED_MM_3X_ARCHS)
383+
# The cutlass_scaled_mm kernels for Blackwell (c3x, i.e. CUTLASS 3.x) require
384+
# CUDA 12.8 or later
385+
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;12.0a" "${CUDA_ARCHS}")
386+
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND SCALED_MM_ARCHS)
387+
set(SRCS
388+
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x_sm100.cu"
389+
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm100_fp8.cu"
390+
)
391+
set_gencode_flags_for_srcs(
392+
SRCS "${SRCS}"
393+
CUDA_ARCHS "${SCALED_MM_ARCHS}")
394+
list(APPEND VLLM_EXT_SRC "${SRCS}")
395+
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM100=1")
396+
# Let scaled_mm_c2x know it doesn't need to build these arches
397+
list(APPEND SCALED_MM_3X_ARCHS "${SCALED_MM_ARCHS}")
398+
message(STATUS "Building scaled_mm_c3x_sm100 for archs: ${SCALED_MM_ARCHS}")
399+
else()
400+
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND SCALED_MM_ARCHS)
401+
message(STATUS "Not building scaled_mm_c3x_sm100 as CUDA Compiler version is "
402+
"not >= 12.8, we recommend upgrading to CUDA 12.8 or "
403+
"later if you intend on running FP8 quantized models on "
404+
"Blackwell.")
405+
else()
406+
message(STATUS "Not building scaled_mm_c3x_100 as no compatible archs found "
407+
"in CUDA target architectures")
408+
endif()
381409
endif()
382410

383411
#
384412
# For the cutlass_scaled_mm kernels we want to build the c2x (CUTLASS 2.x)
385413
# kernels for the remaining archs that are not already built for 3x.
386414
cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS
387-
"7.5;8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
415+
"7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
388416
# subtract out the archs that are already built for 3x
389417
list(REMOVE_ITEM SCALED_MM_2X_ARCHS ${SCALED_MM_3X_ARCHS})
390418
if (SCALED_MM_2X_ARCHS)
@@ -409,17 +437,17 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
409437
# 2:4 Sparse Kernels
410438

411439
# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
412-
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
413-
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
440+
# require CUDA 12.2 or later (and only work on Hopper and Blackwell).
441+
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_ARCHS)
414442
set(SRCS "csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
415443
set_gencode_flags_for_srcs(
416444
SRCS "${SRCS}"
417-
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
445+
CUDA_ARCHS "${SCALED_MM_ARCHS}")
418446
list(APPEND VLLM_EXT_SRC "${SRCS}")
419447
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SPARSE_SCALED_MM_C3X=1")
420-
message(STATUS "Building sparse_scaled_mm_c3x for archs: ${SCALED_MM_3X_ARCHS}")
448+
message(STATUS "Building sparse_scaled_mm_c3x for archs: ${SCALED_MM_ARCHS}")
421449
else()
422-
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
450+
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_ARCHS)
423451
message(STATUS "Not building sparse_scaled_mm_c3x kernels as CUDA Compiler version is "
424452
"not >= 12.2, we recommend upgrading to CUDA 12.2 or later "
425453
"if you intend on running FP8 sparse quantized models on Hopper.")
@@ -434,8 +462,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
434462
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND FP4_ARCHS)
435463
set(SRCS
436464
"csrc/quantization/fp4/nvfp4_quant_kernels.cu"
437-
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu"
438-
)
465+
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu")
439466
set_gencode_flags_for_srcs(
440467
SRCS "${SRCS}"
441468
CUDA_ARCHS "${FP4_ARCHS}")
@@ -534,6 +561,7 @@ define_gpu_extension_target(
534561
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
535562
ARCHITECTURES ${VLLM_GPU_ARCHES}
536563
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR}
564+
INCLUDE_DIRECTORIES ${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
537565
USE_SABI 3
538566
WITH_SOABI)
539567

@@ -557,7 +585,7 @@ set_gencode_flags_for_srcs(
557585
CUDA_ARCHS "${CUDA_ARCHS}")
558586

559587
if(VLLM_GPU_LANG STREQUAL "CUDA")
560-
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
588+
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
561589
if (MARLIN_MOE_ARCHS)
562590
set(MARLIN_MOE_SRC
563591
"csrc/moe/marlin_kernels/marlin_moe_kernel.h"

0 commit comments

Comments
 (0)