Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
948151a
WIP: Refactor chunked local attention for FULL CG support
LucasWilkinson Jan 3, 2026
fdcbacd
Remove CPU<>GPU syncs and debug prints from chunked local attention
LucasWilkinson Jan 3, 2026
f31be3a
Add comprehensive tests for chunked local attention Triton kernel
LucasWilkinson Jan 4, 2026
12e40bf
Fix query_start_loc_cpu to use proper CPU tensor
LucasWilkinson Jan 4, 2026
a1b28ce
make lazy
LucasWilkinson Jan 5, 2026
7e052c9
Add Multimodal Processor Benchmark (#29105)
reaganjlee Jan 2, 2026
cb78269
[Bugfix] Replace BaseException with specific exceptions in FLA utils …
c0de128 Jan 2, 2026
534812f
[ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by r…
AndreasKaratzas Jan 2, 2026
99c200c
feat: support LoRA for DeepSeek-OCR(Language Model part) (#31569)
zhima771 Jan 2, 2026
0d52752
[Bugfix] Fix block size used in EAGLE slot mapping (#31540)
benchislett Jan 2, 2026
89d072d
[Model] Enable LoRA support for tower and connector in LLaVA (#31513)
Jan 2, 2026
12dae45
[ROCm][CI] Fix ModernBERT token classification test (#31612)
AndreasKaratzas Jan 2, 2026
7f614e8
[Bugfix] Fix activation quantization for compressed-tensors W4A16 (#3…
Tmn07 Jan 2, 2026
7058d8d
Remove unused `use_marlin` variable in `Mxfp4MoEMethod` (#31549)
vsourirajan Jan 2, 2026
fd80f78
[Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA de…
c0de128 Jan 2, 2026
da898a5
[Bugfix] Fix weight_loader v1 block scale (#31103)
kyuyeunk Jan 2, 2026
1a60e5f
Add multimodal input method in the documentation (#31601)
labAxiaoming Jan 2, 2026
298a223
CustomOp: test forward dispatch for grouped_topk (#31530)
xinyu-intel Jan 2, 2026
d12c3c6
[BugFix] Support online dense model DP without overhead (#30739)
njhill Jan 2, 2026
89f9ef4
[MoE] Fix output_shape calculation in Attention layer to handle 3D qu…
AndreasKaratzas Jan 2, 2026
16d28bb
[MoE Refactor] Split `invoke_fused_moe_kernel` (#31050)
zyongye Jan 2, 2026
c18d2d4
[MoE Refactor] Explicit construct mk for flashinfer bf16 kernel (#31504)
zyongye Jan 2, 2026
706c452
[Benchmark] Fix OOM during MoE kernel tuning for large models (#31604)
massif-01 Jan 2, 2026
b94a63d
[Core] Parse vLLM engine required fields from hf_config to model_arch…
charlotte12l Jan 2, 2026
f193b01
Improve HF qwen3_omni: preserve audio_sample_rate in kwargs restructu…
jeremyteboul Jan 3, 2026
ac248d0
[CI][Bugfix] Fix token counting in chunked prefill compl test (#31630)
AndreasKaratzas Jan 3, 2026
2088b9b
[MoE Refactor][13/N] Convert FI to Use PFNoEP (#31533)
robertgshaw2-redhat Jan 3, 2026
7905690
[Docs] Fix argparse include path for mm-processor benchmark (#31654)
reaganjlee Jan 4, 2026
c2316d1
fix no think of GLM-4.5 / GLM-4.7 (#31449)
zRzRzRzRzRzRzR Jan 4, 2026
441dbea
[misc] Sort uvicorn log level description according to verbosity (#31…
andyxning Jan 4, 2026
08b3492
[BugFix] Async scheduling: handle model forward errors more cleanly (…
njhill Jan 4, 2026
69b8f42
[CI] Skip Phi-MoE test due to old API util (#31632)
AndreasKaratzas Jan 5, 2026
1be2204
[ROCm][CI] Fix language generation test accuracy by disabling HF flas…
AndreasKaratzas Jan 5, 2026
99fe46d
[Bugfix] Fix AttributeError: 'Stream' object has no attribute 'dp_si…
jeejeelee Jan 5, 2026
3b42f4e
[Minor] Small pooler output processing optimization (#31667)
njhill Jan 5, 2026
959881a
[CI/Build] Revive skipped reward models e2e test (#31665)
Isotr0py Jan 5, 2026
c87b648
[Platform] Deprecate seed_everything (#31659)
wangxiyuan Jan 5, 2026
bab887e
[Misc] Various code simplifications (#31666)
njhill Jan 5, 2026
63de4b3
[CI Failure] Fix NomicBert max_model_len validation (#31662)
noooop Jan 5, 2026
eaf6681
Add chat prefix completion feature to DeepSeek v3.2 (#31147)
PHOEBEMOON0802 Jan 5, 2026
a99dfec
[log] enable max_log_len trim only when needed (#31482)
andyxning Jan 5, 2026
8c94e01
[Bugfix] Fix EPLB state logging error (#31455)
tlrmchlsmth Jan 5, 2026
6459a94
[Frontend] [Bugfix] respect server-level default chat template kwargs…
cjackal Jan 5, 2026
712e9b4
[CI] Bump sentence-transformer from 3.2.1 to 5.2.0 (#31664)
noooop Jan 5, 2026
a10b3a5
[ROCM] Reorder arguments and rename parameters for rope_cached_thd_po…
tpopp Jan 5, 2026
55dea0b
[Model] Enable LoRA support for BLIP2 (#31620)
ppppqp Jan 5, 2026
842dd7b
[LoRA] LoRA PDL improvement (#31660)
jeejeelee Jan 5, 2026
a288d7a
[KVconnector][LMCache] remove the import of legacy LMCache code (#31704)
ApostaC Jan 5, 2026
c6b0298
[platform] Support additional forward context for OOT (#31674)
zzzzwwjj Jan 5, 2026
7dcc47e
[Model] Let more models to support the score template. (#31335)
noooop Jan 5, 2026
4519f59
[v1] Add encoder-only/cross attention support to Triton Attention bac…
Isotr0py Jan 5, 2026
4b3a894
[Frontend] [Doc] Exclude log deltas feature (#30322)
Catacomba Jan 5, 2026
489b336
[CI Failure] Disable B200 tests while runner is broken (#31732)
mgoin Jan 5, 2026
75a3919
[Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with --e…
ricky-chaoju Jan 5, 2026
139b946
[Bugfix] Add missing extra_tensors arg to DeviceCommunicatorBase.disp…
kzwrime Jan 5, 2026
86f5357
Triton Attention: Support cross-layers blocks (#30687)
orozery Jan 5, 2026
bcf0e58
[Misc] Enable Paligemma's PrefixLM attention mask computation (#31725)
Isotr0py Jan 5, 2026
c5f580d
Fix GLM-4.6v flash tool calling in transformers 5.x (#31622)
baonudesifeizhai Jan 5, 2026
86625fe
[Misc][Model][Refactor] Pass the prefix into Linear layers (#31669)
kunpengW-code Jan 5, 2026
54a2b3a
[BugFix] Fix architecture flags to prevent issues on SM103 (#31150)
LopezCastroRoberto Jan 5, 2026
a0e89e8
pin lora_b moe weights on cpu (#31317)
gnovack Jan 5, 2026
34dedaa
[docker] install cuda13 version of lmcache and nixl (#30913)
soodoshll Jan 5, 2026
1a64930
[Model] Nemotron Parse 1.1 Support (#30864)
amitz-nv Jan 5, 2026
7489f3b
[Cleanup] Remove deprecated fields from CachedRequestData class (#31734)
njhill Jan 5, 2026
6d3ed5b
[CI][DeepSeek] Add nightly DeepSeek R1 `lm_eval` tests on H200 (#30356)
MatthewBonanni Jan 5, 2026
af6cd0c
[Bug] Revert torch warning fix (#31585)
yewentao256 Jan 5, 2026
46b90c7
[MoE Refactor] Aiter Experts for BF16 MoE (#31542)
zyongye Jan 5, 2026
c04c662
[Bugfix] Fix Broken ModelOpt NVFP4 MoE (#31742)
robertgshaw2-redhat Jan 5, 2026
1129074
[Bugfix] Properly apply v_scale for mimo_v2_flash (#31175)
mgoin Jan 5, 2026
6ff70ae
[CI/Build] Allow user to configure NVSHMEM version via ENV or command…
eicherseiji Jan 5, 2026
7213e1e
[Bugfix] vLLM produces invalid UTF-8 tokens and “�” (#28874)
johncalesp Jan 6, 2026
f467648
Revert "[CI Failure] Disable B200 tests while runner is broken" (#31750)
mgoin Jan 6, 2026
57b047a
[Docs] Improve malformed exception caused by backslash line continuat…
maang-h Jan 6, 2026
ce802dd
[Perf] Optimize additional `fill(0)` in cutlass moe, 2.9% E2E through…
yewentao256 Jan 6, 2026
dcff285
[Cleanup] Remove redundant `decoder_layer_type` assignment in `Qwen2`…
maang-h Jan 6, 2026
f1871bf
[UX] Add `-ep` shorthand for `--enable-expert-parallel` (#30890)
mgoin Jan 6, 2026
4d9042b
[Bugfix] Add init_workspace_manager to moe kernel benchmarks (#31042)
mgoin Jan 6, 2026
4cb1891
[CI] Fix CPU MM PRocessor Test (#31764)
robertgshaw2-redhat Jan 6, 2026
26db352
[Bugfix][Hardware][AMD] Fix exception types in AITER MLA FP8 check (#…
c0de128 Jan 6, 2026
d1af5ee
[Doc] Show that `use_audio_in_video` is supported in docs (#30837)
DarkLight1337 Jan 6, 2026
f1714b5
[Bugfix][ROCm] Fix Unsupported attention metadata type for speculativ…
vllmellm Jan 6, 2026
83e4ed2
[Models]: Use `MMEncoderAttention` for MoonViT (#31738)
Isotr0py Jan 6, 2026
724d094
[Bugfix][CI/Build] Fix failing pooling models test due to Triton kern…
Isotr0py Jan 6, 2026
cc6f175
[Chore] Remove more V0 dead code from `sequence.py` (#31783)
DarkLight1337 Jan 6, 2026
7d4459f
[cpu][bench] Add CPU paged attention benchmarks (#31720)
fadara01 Jan 6, 2026
818c25b
[Misc] Use `deprecated` for `seed_everything` (#31780)
DarkLight1337 Jan 6, 2026
fd37ec9
[CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. (#31797)
noooop Jan 6, 2026
3bf3790
[Doc] Fix format of multimodal_inputs.md (#31800)
BlankRH Jan 6, 2026
4a3c93c
[Chore] Cleanup `mem_utils.py` (#31793)
DarkLight1337 Jan 6, 2026
af45517
[Attention][1/n] Remove usage of deprecated `seq_lens_cpu` and `num_c…
LucasWilkinson Jan 6, 2026
b901934
[Bugfix] Fix torch.compile error for DP + MoE on CPU Backend (#31650)
kzwrime Jan 6, 2026
da29be6
[Misc] Implement `TokenizerLike.convert_tokens_to_ids` (#31796)
DarkLight1337 Jan 6, 2026
1124e8c
[Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) (#31790)
Jzz1943 Jan 6, 2026
5045991
[Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false (#31…
chaunceyjiang Jan 6, 2026
6b549ce
[Model] rename use_pad_token to use_sep_token (#31784)
noooop Jan 6, 2026
240b6ed
[LoRA]Disable linear LoRA kernel PDL (#31777)
jeejeelee Jan 6, 2026
06c8fe1
[Bugfix]: Fix cross attention backend selection for Turing GPU (#31806)
Isotr0py Jan 6, 2026
73c5f5d
[MoE Refactor] Add Temporary Integration Tests - H100/B200 (#31759)
robertgshaw2-redhat Jan 6, 2026
10f2ac4
[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (#31593)
robertgshaw2-redhat Jan 6, 2026
05fb361
[NemotronH] Use ReplicatedLinear for fc1_latent_proj (#31807)
roikoren755 Jan 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 4 additions & 3 deletions .buildkite/test-amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -859,7 +859,7 @@ steps:
- label: Language Models Tests (Extra Standard) %N
timeout_in_minutes: 45
mirror_hardwares: [amdexperimental]
agent_pool: mi325_8
agent_pool: mi325_2
# grade: Blocking
torch_nightly: true
source_file_dependencies:
Expand All @@ -871,6 +871,7 @@ steps:
# Shard slow subset of standard language models tests. Only run when model
# source is modified, or when specified test files are modified
- pip freeze | grep -E 'torch'
- export TORCH_NCCL_BLOCKING_WAIT=1
- pytest -v -s models/language -m 'core_model and slow_test' \
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
--shard-id=$$BUILDKITE_PARALLEL_JOB
Expand All @@ -888,7 +889,7 @@ steps:
commands:
# Install fast path packages for testing against transformers
# Note: also needed to run plamo2 model in vLLM
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
- uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr'
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/[email protected]'
# Shard hybrid language model tests
- pytest -v -s models/language/generation \
Expand All @@ -909,7 +910,7 @@ steps:
commands:
# Install fast path packages for testing against transformers
# Note: also needed to run plamo2 model in vLLM
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
- uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr'
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/[email protected]'
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'

Expand Down
25 changes: 24 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -943,7 +943,6 @@ steps:
timeout_in_minutes: 30
working_dir: "/vllm-workspace/"
gpu: b200
# optional: true
source_file_dependencies:
- csrc/quantization/fp4/
- csrc/attention/mla/
Expand Down Expand Up @@ -1348,6 +1347,14 @@ steps:
- CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
- pytest -v -s tests/v1/distributed/test_dbo.py

- label: LM Eval Large Models (H200) # optional
timeout_in_minutes: 60
gpu: h200
optional: true
num_gpus: 8
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt

##### B200 test #####
- label: Distributed Tests (B200) # optional
gpu: b200
Expand Down Expand Up @@ -1399,3 +1406,19 @@ steps:
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1

##### MoE Refactor (Temporary) Tests #####

- label: MoE Refactor Integration Test (H100 - TEMPORARY) # optional
gpu: h100
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt

- label: MoE Refactor Integration Test (B200 - TEMPORARY) # optional
gpu: b200
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt
5 changes: 2 additions & 3 deletions benchmarks/kernels/benchmark_activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,9 @@

import vllm.model_executor.layers.activation # noqa F401
from vllm.model_executor.custom_op import CustomOp
from vllm.platforms import current_platform
from vllm.triton_utils import triton
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed

batch_size_range = [1, 16, 128]
seq_len_range = [1, 16, 64, 1024, 4096]
Expand All @@ -30,7 +29,7 @@ def benchmark_activation(
device = "cuda"
num_tokens = batch_size * seq_len
dim = intermediate_size
current_platform.seed_everything(42)
set_random_seed(42)
torch.set_default_device(device)

if func_name == "gelu_and_mul":
Expand Down
5 changes: 5 additions & 0 deletions benchmarks/kernels/benchmark_cutlass_moe_fp8.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.worker.workspace import init_workspace_manager

# Weight shapes for different models: [num_experts, topk, hidden_size,
# intermediate_size]
Expand Down Expand Up @@ -297,6 +298,10 @@ def bench_cuda_graph(graph, num_warmup=5, num_iters=100):


def main(args):
# Initialize workspace manager (required for CUTLASS MoE kernels)
device = torch.device("cuda:0")
init_workspace_manager(device)

print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
from vllm.scalar_type import scalar_types
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.worker.workspace import init_workspace_manager

WEIGHT_SHAPES_MOE = {
"nvidia/DeepSeek-R1-FP4": [
Expand Down Expand Up @@ -441,6 +442,10 @@ def replay_graph(graph, num_repeats):


def main(args):
# Initialize workspace manager (required for CUTLASS MoE kernels)
device = torch.device("cuda:0")
init_workspace_manager(device)

print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
Expand Down
5 changes: 5 additions & 0 deletions benchmarks/kernels/benchmark_grouped_gemm_cutlass.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
fused_topk,
)
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.worker.workspace import init_workspace_manager

DEFAULT_MODELS = [
"mistralai/Mixtral-8x7B-Instruct-v0.1",
Expand Down Expand Up @@ -364,6 +365,10 @@ def replay_graph(graph, num_repeats):


def main(args):
# Initialize workspace manager (required for CUTLASS MoE kernels)
device = torch.device("cuda:0")
init_workspace_manager(device)

print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
Expand Down
5 changes: 2 additions & 3 deletions benchmarks/kernels/benchmark_layernorm.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@
import torch

from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed


@torch.inference_mode()
Expand All @@ -22,7 +21,7 @@ def main(
num_warmup_iters: int = 5,
num_iters: int = 100,
) -> None:
current_platform.seed_everything(seed)
set_random_seed(seed)
torch.set_default_device("cuda")

layer = RMSNorm(hidden_size).to(dtype=dtype)
Expand Down
61 changes: 58 additions & 3 deletions benchmarks/kernels/benchmark_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import argparse
import gc
import json
import os
import time
Expand All @@ -23,9 +24,50 @@
from vllm.transformers_utils.config import get_config
from vllm.triton_utils import triton
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import set_random_seed

FP8_DTYPE = current_platform.fp8_dtype()

# Default interval for clearing Triton JIT cache during tuning
# Set to 0 to disable automatic cache clearing
_CACHE_CLEAR_INTERVAL_ENV = "VLLM_MOE_TUNE_CACHE_CLEAR_INTERVAL"
TRITON_CACHE_CLEAR_INTERVAL = int(os.environ.get(_CACHE_CLEAR_INTERVAL_ENV, "50"))


def clear_triton_cache():
"""Clear Triton JIT compilation cache and Python/CUDA memory.

This helps prevent OOM during tuning with large models (many experts).
"""
# Force Python garbage collection
gc.collect()

# Clear CUDA memory cache
if torch.cuda.is_available():
torch.cuda.empty_cache()

# Try to clear Triton's runtime cache
try:
import triton

if (
hasattr(triton, "runtime")
and hasattr(triton.runtime, "cache")
and hasattr(triton.runtime.cache, "clear")
):
triton.runtime.cache.clear()
except ImportError:
# Triton not installed, skip cache clearing
pass
except AttributeError:
# Triton version doesn't have expected cache API
pass
except Exception as e:
print(f"Warning: Failed to clear Triton cache: {e}")

# Additional garbage collection after clearing caches
gc.collect()


def ensure_divisibility(numerator, denominator, text):
"""Ensure that numerator is divisible by the denominator."""
Expand Down Expand Up @@ -390,7 +432,7 @@ def merge_unique_dicts(list1, list2):
class BenchmarkWorker:
def __init__(self, seed: int) -> None:
torch.set_default_device("cuda")
current_platform.seed_everything(seed)
set_random_seed(seed)
self.seed = seed
# Get the device ID to allocate tensors and kernels
# on the respective GPU. This is required for Ray to work
Expand All @@ -410,7 +452,7 @@ def benchmark(
block_quant_shape: list[int] = None,
use_deep_gemm: bool = False,
) -> tuple[dict[str, int], float]:
current_platform.seed_everything(self.seed)
set_random_seed(self.seed)
dtype_str = _get_config_dtype_str(
dtype, use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8
)
Expand Down Expand Up @@ -483,7 +525,7 @@ def tune(
need_device_guard = True

with torch.cuda.device(self.device_id) if need_device_guard else nullcontext():
for config in tqdm(search_space):
for idx, config in enumerate(tqdm(search_space)):
try:
kernel_time = benchmark_config(
config,
Expand All @@ -506,6 +548,19 @@ def tune(
if kernel_time < best_time:
best_time = kernel_time
best_config = config

# Periodically clear Triton JIT cache to prevent OOM
# This is especially important for large models with many experts
if (
TRITON_CACHE_CLEAR_INTERVAL > 0
and idx > 0
and idx % TRITON_CACHE_CLEAR_INTERVAL == 0
):
clear_triton_cache()

# Final cleanup after tuning completes
clear_triton_cache()

now = datetime.now()
print(f"{now.ctime()}] Completed tuning for batch_size={num_tokens}")
assert best_config is not None
Expand Down
5 changes: 3 additions & 2 deletions benchmarks/kernels/benchmark_moe_permute_unpermute.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import set_random_seed

FP8_DTYPE = current_platform.fp8_dtype()

Expand Down Expand Up @@ -261,7 +262,7 @@ def run(input: tuple):
class BenchmarkWorker:
def __init__(self, seed: int) -> None:
torch.set_default_device("cuda")
current_platform.seed_everything(seed)
set_random_seed(seed)
self.seed = seed
# Get the device ID to allocate tensors and kernels
# on the respective GPU. This is required for Ray to work
Expand All @@ -279,7 +280,7 @@ def benchmark(
use_int8_w8a16: bool,
use_customized_permute: bool = False,
) -> tuple[dict[str, int], float]:
current_platform.seed_everything(self.seed)
set_random_seed(self.seed)

permute_time = benchmark_permute(
num_tokens,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/kernels/benchmark_mrope.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@
import torch

from vllm.model_executor.layers.rotary_embedding import get_rope
from vllm.platforms import current_platform
from vllm.transformers_utils.config import get_config
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import set_random_seed

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Expand Down Expand Up @@ -94,7 +94,7 @@ def benchmark_mrope(
benchmark_iter: int = 100,
csv_writer=None,
):
current_platform.seed_everything(seed)
set_random_seed(seed)
torch.set_default_device(device)
# the parameters to compute the q k v size based on tp_size
mrope_helper_class = get_rope(
Expand Down
3 changes: 2 additions & 1 deletion benchmarks/kernels/benchmark_paged_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
create_kv_caches_with_random,
set_random_seed,
)

logger = init_logger(__name__)
Expand All @@ -38,7 +39,7 @@ def main(
device: str = "cuda",
kv_cache_dtype: str | None = None,
) -> None:
current_platform.seed_everything(seed)
set_random_seed(seed)

scale = float(1.0 / (head_size**0.5))
query = torch.empty(
Expand Down
5 changes: 2 additions & 3 deletions benchmarks/kernels/benchmark_quant.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@
import torch

from vllm import _custom_ops as ops
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed


@torch.inference_mode()
Expand All @@ -23,7 +22,7 @@ def main(
num_warmup_iters: int = 5,
num_iters: int = 100,
) -> None:
current_platform.seed_everything(seed)
set_random_seed(seed)
torch.set_default_device("cuda")

x = torch.randn(num_tokens, hidden_size, dtype=dtype)
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/kernels/benchmark_reshape_and_cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@

from vllm import _custom_ops as ops
from vllm.logger import init_logger
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
create_kv_caches_with_random,
set_random_seed,
)

logger = init_logger(__name__)
Expand All @@ -36,7 +36,7 @@ def run_benchmark(
if kv_cache_dtype == "fp8" and head_size % 16:
raise ValueError("fp8 kv-cache requires head_size to be a multiple of 16.")

current_platform.seed_everything(42)
set_random_seed(42)
torch.set_default_device(device)

# create random key / value tensors [T, H, D].
Expand Down
Loading