Skip to content

Commit b4930c2

Browse files
Manikandan-Thangaraj-ZS0321ElizaWszoladsikkalewtunnjhill
authored
Updating Branch (#26)
* [Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032) Co-authored-by: Dipika <[email protected]> * [Frontend] Expose revision arg in OpenAI server (vllm-project#8501) * [BugFix] Fix clean shutdown issues (vllm-project#8492) * [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506) * [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270) * [doc] update doc on testing and debugging (vllm-project#8514) * [Bugfix] Bind api server port before starting engine (vllm-project#8491) * [perf bench] set timeout to debug hanging (vllm-project#8516) * [misc] small qol fixes for release process (vllm-project#8517) * [Bugfix] Fix 3.12 builds on main (vllm-project#8510) Signed-off-by: Joe Runde <[email protected]> * [refactor] remove triton based sampler (vllm-project#8524) * [Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525) Signed-off-by: Alex-Brooks <[email protected]> * [Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521) * [torch.compile] register allreduce operations as custom ops (vllm-project#8526) * [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509) Signed-off-by: Rui Qiao <[email protected]> * [Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495) * [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631) * [Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434) * [Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515) Co-authored-by: Cyrus Leung <[email protected]> * [Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527) * [Bugfix] Fix TP > 1 for new granite (vllm-project#8544) Signed-off-by: Joe Runde <[email protected]> * [doc] improve installation doc (vllm-project#8550) Co-authored-by: Andy Dai <[email protected]> * [CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520) * [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012) * [CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540) * [Misc] Add argument to disable FastAPI docs (vllm-project#8554) * [CI/Build] Avoid CUDA initialization (vllm-project#8534) * [CI/Build] Update Ruff version (vllm-project#8469) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157) Co-authored-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Simon Mo <[email protected]> * [Core] *Prompt* logprobs support in Multi-step (vllm-project#8199) * [Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543) Signed-off-by: Russell Bryant <[email protected]> * [Model] Support Solar Model (vllm-project#8386) Co-authored-by: Michael Goin <[email protected]> * [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380) Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039) * [BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572) * [Bugfix] add `dead_error` property to engine client (vllm-project#8574) Signed-off-by: Joe Runde <[email protected]> * [Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573) Co-authored-by: [email protected] * [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (vllm-project#8545) * Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593) * [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616) * [MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615) * [Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584) * [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577) * [Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619) * [Doc] Add documentation for GGUF quantization (vllm-project#8618) * Create SECURITY.md (vllm-project#8642) * [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551) * [Misc] guard against change in cuda library name (vllm-project#8609) * [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571) * [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474) * [Core] Support Lora lineage and base model metadata management (vllm-project#6315) * [Model] Add OLMoE (vllm-project#7922) * [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670) * [Bugfix] Validate SamplingParam n is an int (vllm-project#8548) * [Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649) * [Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556) * [Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640) * [Doc] neuron documentation update (vllm-project#8671) Signed-off-by: omrishiv <[email protected]> * [Hardware][AWS] update neuron to 2.20 (vllm-project#8676) Signed-off-by: omrishiv <[email protected]> * [Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496) * [Core] Rename `PromptInputs` and `inputs`(vllm-project#8673) * [MISC] add support custom_op check (vllm-project#8557) Co-authored-by: youkaichao <[email protected]> * [Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675) * [beam search] add output for manually checking the correctness (vllm-project#8684) * [Kernel] Build flash-attn from source (vllm-project#8245) * [VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687) * [Doc] Fix typo in AMD installation guide (vllm-project#8689) * [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646) * [dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518) * [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643) * [Bugfix] Refactor composite weight loading logic (vllm-project#8656) * [ci][build] fix vllm-flash-attn (vllm-project#8699) * [Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407) * [Misc] Use NamedTuple in Multi-image example (vllm-project#8705) Signed-off-by: Alex-Brooks <[email protected]> * [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703) * [Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486) Co-authored-by: litianjian <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701) * [build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713) * [misc] upgrade mistral-common (vllm-project#8715) * [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702) * [Bugfix] Fix CPU CMake build (vllm-project#8723) Co-authored-by: Yuan <[email protected]> * [Bugfix] fix docker build for xpu (vllm-project#8652) * [Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657) Signed-off-by: Alex-Brooks <[email protected]> * [Hardware][CPU] Refactor CPU model runner (vllm-project#8729) * [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733) * [Model] Support pp for qwen2-vl (vllm-project#8696) * [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707) * [CI/Build] use setuptools-scm to set __version__ (vllm-project#4738) Co-authored-by: youkaichao <[email protected]> * [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701) Co-authored-by: mgoin <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [Kernel][LoRA] Add assertion for punica sgmv kernels (vllm-project#7585) * [Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575) Signed-off-by: Russell Bryant <[email protected]> * Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562) * Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335) * [Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674) * Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728) * re-implement beam search on top of vllm core (vllm-project#8726) Co-authored-by: Brendan Wong <[email protected]> * Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750) * [MISC] Skip dumping inputs when unpicklable (vllm-project#8744) * [Core][Model] Support loading weights by ID within models (vllm-project#7931) * [Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558) * [Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661) Co-authored-by: mgoin <[email protected]> * [Frontend] Batch inference for llm.chat() API (vllm-project#8648) Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748) * [CI/Build] fix setuptools-scm usage (vllm-project#8771) * [misc] soft drop beam search (vllm-project#8763) * [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768) * [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047) Signed-off-by: Travis Johnson <[email protected]> * [Core] Adding Priority Scheduling (vllm-project#5958) * [Bugfix] Use heartbeats instead of health checks (vllm-project#8583) * Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780) * [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776) * Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752) * [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250) * [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770) * [Bugfix] load fc bias from config for eagle (vllm-project#8790) --------- Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: omrishiv <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Dipika <[email protected]> Co-authored-by: lewtun <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: sasha0552 <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Kevin Lin <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: chenqianfzh <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Andy Dai <[email protected]> Co-authored-by: Alexey Kondratiev(AMD) <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Jiaxin Shan <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Geun, Lim <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: 盏一 <[email protected]> Co-authored-by: bnellnm <[email protected]> Co-authored-by: Amit Garg <[email protected]> Co-authored-by: William Lin <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: saumya-saran <[email protected]> Co-authored-by: Pastel! <[email protected]> Co-authored-by: omrishiv <[email protected]> Co-authored-by: zyddnys <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Huazhong Ji <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: litianjian <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Yuan <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Yanyi Liu <[email protected]> Co-authored-by: Jani Monoses <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: jiqing-feng <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: Brendan Wong <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Peter Salas <[email protected]> Co-authored-by: Hanzhi Zhou <[email protected]> Co-authored-by: Andy <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Archit Patke <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: sohamparikh <[email protected]>
1 parent 1572362 commit b4930c2

File tree

342 files changed

+14836
-6876
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

342 files changed

+14836
-6876
lines changed

.buildkite/nightly-benchmarks/benchmark-pipeline.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,7 @@ steps:
88
containers:
99
- image: badouralix/curl-jq
1010
command:
11-
- sh
12-
- .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
11+
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
1312
- wait
1413
- label: "A100"
1514
agents:

.buildkite/nightly-benchmarks/scripts/wait-for-image.sh

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
33
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
44

5+
TIMEOUT_SECONDS=10
6+
57
retries=0
68
while [ $retries -lt 1000 ]; do
7-
if [ $(curl -s -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
9+
if [ $(curl -s --max-time $TIMEOUT_SECONDS -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
810
exit 0
911
fi
1012

.buildkite/run-amd-test.sh

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ if [[ $commands == *" kernels "* ]]; then
8383
--ignore=kernels/test_encoder_decoder_attn.py \
8484
--ignore=kernels/test_flash_attn.py \
8585
--ignore=kernels/test_flashinfer.py \
86+
--ignore=kernels/test_gguf.py \
8687
--ignore=kernels/test_int8_quant.py \
8788
--ignore=kernels/test_machete_gemm.py \
8889
--ignore=kernels/test_mamba_ssm.py \
@@ -93,6 +94,16 @@ if [[ $commands == *" kernels "* ]]; then
9394
--ignore=kernels/test_sampler.py"
9495
fi
9596

97+
#ignore certain Entrypoints tests
98+
if [[ $commands == *" entrypoints/openai "* ]]; then
99+
commands=${commands//" entrypoints/openai "/" entrypoints/openai \
100+
--ignore=entrypoints/openai/test_accuracy.py \
101+
--ignore=entrypoints/openai/test_audio.py \
102+
--ignore=entrypoints/openai/test_encoder_decoder.py \
103+
--ignore=entrypoints/openai/test_embedding.py \
104+
--ignore=entrypoints/openai/test_oot_registration.py "}
105+
fi
106+
96107
PARALLEL_JOB_COUNT=8
97108
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
98109
if [[ $commands == *"--shard-id="* ]]; then

.buildkite/test-pipeline.yaml

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -43,13 +43,15 @@ steps:
4343
fast_check: true
4444
source_file_dependencies:
4545
- vllm/
46+
- tests/mq_llm_engine
4647
- tests/async_engine
4748
- tests/test_inputs
4849
- tests/multimodal
4950
- tests/test_utils
5051
- tests/worker
5152
commands:
52-
- pytest -v -s async_engine # Async Engine
53+
- pytest -v -s mq_llm_engine # MQLLMEngine
54+
- pytest -v -s async_engine # AsyncLLMEngine
5355
- NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
5456
- pytest -v -s test_inputs.py
5557
- pytest -v -s multimodal
@@ -82,7 +84,7 @@ steps:
8284
- label: Entrypoints Test # 20min
8385
working_dir: "/vllm-workspace/tests"
8486
fast_check: true
85-
#mirror_hardwares: [amd]
87+
mirror_hardwares: [amd]
8688
source_file_dependencies:
8789
- vllm/
8890
commands:
@@ -163,13 +165,6 @@ steps:
163165
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
164166
- python3 offline_inference_encoder_decoder.py
165167

166-
- label: torch compile integration test
167-
source_file_dependencies:
168-
- vllm/
169-
commands:
170-
- pytest -v -s ./compile/test_full_graph.py
171-
- pytest -v -s ./compile/test_wrapper.py
172-
173168
- label: Prefix Caching Test # 7min
174169
#mirror_hardwares: [amd]
175170
source_file_dependencies:
@@ -259,6 +254,13 @@ steps:
259254
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
260255
- bash ./run-tests.sh -c configs/models-small.txt -t 1
261256

257+
- label: Encoder Decoder tests # 5min
258+
source_file_dependencies:
259+
- vllm/
260+
- tests/encoder_decoder
261+
commands:
262+
- pytest -v -s encoder_decoder
263+
262264
- label: OpenAI-Compatible Tool Use # 20 min
263265
fast_check: false
264266
mirror_hardwares: [ amd ]
@@ -348,7 +350,10 @@ steps:
348350
- vllm/executor/
349351
- vllm/model_executor/models/
350352
- tests/distributed/
353+
- vllm/compilation
351354
commands:
355+
- pytest -v -s ./compile/test_full_graph.py
356+
- pytest -v -s ./compile/test_wrapper.py
352357
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
353358
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
354359
# Avoid importing model tests that cause CUDA reinitialization error

.github/workflows/ruff.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ jobs:
2525
- name: Install dependencies
2626
run: |
2727
python -m pip install --upgrade pip
28-
pip install ruff==0.1.5 codespell==2.3.0 tomli==2.0.1 isort==5.13.2
28+
pip install -r requirements-lint.txt
2929
- name: Analysing the code with ruff
3030
run: |
31-
ruff .
31+
ruff check .
3232
- name: Spelling check with codespell
3333
run: |
3434
codespell --toml pyproject.toml

.github/workflows/scripts/build.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,6 @@ $python_executable -m pip install -r requirements-cuda.txt
1515
export MAX_JOBS=1
1616
# Make sure release wheels are built for the following architectures
1717
export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
18+
export VLLM_FA_CMAKE_GPU_ARCHES="80-real;90-real"
1819
# Build
1920
$python_executable setup.py bdist_wheel --dist-dir=dist

.gitignore

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
1-
# vllm commit id, generated by setup.py
2-
vllm/commit_id.py
1+
# version file generated by setuptools-scm
2+
/vllm/_version.py
3+
4+
# vllm-flash-attn built from source
5+
vllm/vllm_flash_attn/
36

47
# Byte-compiled / optimized / DLL files
58
__pycache__/
@@ -12,6 +15,8 @@ __pycache__/
1215
# Distribution / packaging
1316
.Python
1417
build/
18+
cmake-build-*/
19+
CMakeUserPresets.json
1520
develop-eggs/
1621
dist/
1722
downloads/

CMakeLists.txt

Lines changed: 86 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,16 @@
11
cmake_minimum_required(VERSION 3.26)
22

3+
# When building directly using CMake, make sure you run the install step
4+
# (it places the .so files in the correct location).
5+
#
6+
# Example:
7+
# mkdir build && cd build
8+
# cmake -G Ninja -DVLLM_PYTHON_EXECUTABLE=`which python3` -DCMAKE_INSTALL_PREFIX=.. ..
9+
# cmake --build . --target install
10+
#
11+
# If you want to only build one target, make sure to install it manually:
12+
# cmake --build . --target _C
13+
# cmake --install . --component _C
314
project(vllm_extensions LANGUAGES CXX)
415

516
# CUDA by default, can be overridden by using -DVLLM_TARGET_DEVICE=... (used by setup.py)
@@ -13,6 +24,9 @@ include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
1324
# Suppress potential warnings about unused manually-specified variables
1425
set(ignoreMe "${VLLM_PYTHON_PATH}")
1526

27+
# Prevent installation of dependencies (cutlass) by default.
28+
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" ALL_COMPONENTS)
29+
1630
#
1731
# Supported python versions. These versions will be searched in order, the
1832
# first match will be selected. These should be kept in sync with setup.py.
@@ -70,19 +84,6 @@ endif()
7084
find_package(Torch REQUIRED)
7185

7286
#
73-
# Add the `default` target which detects which extensions should be
74-
# built based on platform/architecture. This is the same logic that
75-
# setup.py uses to select which extensions should be built and should
76-
# be kept in sync.
77-
#
78-
# The `default` target makes direct use of cmake easier since knowledge
79-
# of which extensions are supported has been factored in, e.g.
80-
#
81-
# mkdir build && cd build
82-
# cmake -G Ninja -DVLLM_PYTHON_EXECUTABLE=`which python3` -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=../vllm ..
83-
# cmake --build . --target default
84-
#
85-
add_custom_target(default)
8687
message(STATUS "Enabling core extension.")
8788

8889
# Define _core_C extension
@@ -100,8 +101,6 @@ define_gpu_extension_target(
100101
USE_SABI 3
101102
WITH_SOABI)
102103

103-
add_dependencies(default _core_C)
104-
105104
#
106105
# Forward the non-CUDA device extensions to external CMake scripts.
107106
#
@@ -167,6 +166,8 @@ if(NVCC_THREADS AND VLLM_GPU_LANG STREQUAL "CUDA")
167166
list(APPEND VLLM_GPU_FLAGS "--threads=${NVCC_THREADS}")
168167
endif()
169168

169+
include(FetchContent)
170+
170171
#
171172
# Define other extension targets
172173
#
@@ -190,8 +191,11 @@ set(VLLM_EXT_SRC
190191
"csrc/torch_bindings.cpp")
191192

192193
if(VLLM_GPU_LANG STREQUAL "CUDA")
193-
include(FetchContent)
194194
SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library")
195+
196+
# Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case.
197+
set(CUTLASS_REVISION "v3.5.1" CACHE STRING "CUTLASS revision to use")
198+
195199
FetchContent_Declare(
196200
cutlass
197201
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
@@ -219,6 +223,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
219223
"csrc/quantization/gguf/gguf_kernel.cu"
220224
"csrc/quantization/fp8/fp8_marlin.cu"
221225
"csrc/custom_all_reduce.cu"
226+
"csrc/permute_cols.cu"
222227
"csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu"
223228
"csrc/quantization/cutlass_w8a8/scaled_mm_c2x.cu"
224229
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu")
@@ -283,6 +288,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
283288
csrc/quantization/machete/machete_pytorch.cu)
284289
endif()
285290

291+
message(STATUS "Enabling C extension.")
286292
define_gpu_extension_target(
287293
_C
288294
DESTINATION vllm
@@ -310,9 +316,15 @@ set(VLLM_MOE_EXT_SRC
310316

311317
if(VLLM_GPU_LANG STREQUAL "CUDA")
312318
list(APPEND VLLM_MOE_EXT_SRC
319+
"csrc/moe/marlin_kernels/marlin_moe_kernel.h"
320+
"csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.h"
321+
"csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.cu"
322+
"csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.h"
323+
"csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.cu"
313324
"csrc/moe/marlin_moe_ops.cu")
314325
endif()
315326

327+
message(STATUS "Enabling moe extension.")
316328
define_gpu_extension_target(
317329
_moe_C
318330
DESTINATION vllm
@@ -323,7 +335,6 @@ define_gpu_extension_target(
323335
USE_SABI 3
324336
WITH_SOABI)
325337

326-
327338
if(VLLM_GPU_LANG STREQUAL "HIP")
328339
#
329340
# _rocm_C extension
@@ -343,16 +354,66 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
343354
WITH_SOABI)
344355
endif()
345356

357+
# vllm-flash-attn currently only supported on CUDA
358+
if (NOT VLLM_TARGET_DEVICE STREQUAL "cuda")
359+
return()
360+
endif ()
346361

347-
if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
348-
message(STATUS "Enabling C extension.")
349-
add_dependencies(default _C)
362+
#
363+
# Build vLLM flash attention from source
364+
#
365+
# IMPORTANT: This has to be the last thing we do, because vllm-flash-attn uses the same macros/functions as vLLM.
366+
# Because functions all belong to the global scope, vllm-flash-attn's functions overwrite vLLMs.
367+
# They should be identical but if they aren't, this is a massive footgun.
368+
#
369+
# The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place.
370+
# To only install vllm-flash-attn, use --component vllm_flash_attn_c.
371+
# If no component is specified, vllm-flash-attn is still installed.
350372

351-
message(STATUS "Enabling moe extension.")
352-
add_dependencies(default _moe_C)
373+
# If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading.
374+
# This is to enable local development of vllm-flash-attn within vLLM.
375+
# It can be set as an environment variable or passed as a cmake argument.
376+
# The environment variable takes precedence.
377+
if (DEFINED ENV{VLLM_FLASH_ATTN_SRC_DIR})
378+
set(VLLM_FLASH_ATTN_SRC_DIR $ENV{VLLM_FLASH_ATTN_SRC_DIR})
353379
endif()
354380

355-
if(VLLM_GPU_LANG STREQUAL "HIP")
356-
message(STATUS "Enabling rocm extension.")
357-
add_dependencies(default _rocm_C)
381+
if(VLLM_FLASH_ATTN_SRC_DIR)
382+
FetchContent_Declare(vllm-flash-attn SOURCE_DIR ${VLLM_FLASH_ATTN_SRC_DIR})
383+
else()
384+
FetchContent_Declare(
385+
vllm-flash-attn
386+
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
387+
GIT_TAG 013f0c4fc47e6574060879d9734c1df8c5c273bd
388+
GIT_PROGRESS TRUE
389+
)
358390
endif()
391+
392+
# Set the parent build flag so that the vllm-flash-attn library does not redo compile flag and arch initialization.
393+
set(VLLM_PARENT_BUILD ON)
394+
395+
# Ensure the vllm/vllm_flash_attn directory exists before installation
396+
install(CODE "file(MAKE_DIRECTORY \"\${CMAKE_INSTALL_PREFIX}/vllm/vllm_flash_attn\")" COMPONENT vllm_flash_attn_c)
397+
398+
# Make sure vllm-flash-attn install rules are nested under vllm/
399+
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY FALSE)" COMPONENT vllm_flash_attn_c)
400+
install(CODE "set(OLD_CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
401+
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}/vllm/\")" COMPONENT vllm_flash_attn_c)
402+
403+
# Fetch the vllm-flash-attn library
404+
FetchContent_MakeAvailable(vllm-flash-attn)
405+
message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}")
406+
407+
# Restore the install prefix
408+
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${OLD_CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
409+
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" COMPONENT vllm_flash_attn_c)
410+
411+
# Copy over the vllm-flash-attn python files
412+
install(
413+
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
414+
DESTINATION vllm/vllm_flash_attn
415+
COMPONENT vllm_flash_attn_c
416+
FILES_MATCHING PATTERN "*.py"
417+
)
418+
419+
# Nothing after vllm-flash-attn, see comment about macros above

Dockerfile

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ RUN --mount=type=cache,target=/root/.cache/pip \
4848
# see https://github.com/pytorch/pytorch/pull/123243
4949
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0+PTX'
5050
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
51+
# Override the arch list for flash-attn to reduce the binary size
52+
ARG vllm_fa_cmake_gpu_arches='80-real;90-real'
53+
ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches}
5154
#################### BASE BUILD IMAGE ####################
5255

5356
#################### WHEEL BUILD IMAGE ####################
@@ -76,14 +79,13 @@ ENV MAX_JOBS=${max_jobs}
7679
ARG nvcc_threads=8
7780
ENV NVCC_THREADS=$nvcc_threads
7881

79-
ARG buildkite_commit
80-
ENV BUILDKITE_COMMIT=${buildkite_commit}
81-
8282
ARG USE_SCCACHE
8383
ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
8484
ARG SCCACHE_REGION_NAME=us-west-2
85+
ARG SCCACHE_S3_NO_CREDENTIALS=0
8586
# if USE_SCCACHE is set, use sccache to speed up compilation
8687
RUN --mount=type=cache,target=/root/.cache/pip \
88+
--mount=type=bind,source=.git,target=.git \
8789
if [ "$USE_SCCACHE" = "1" ]; then \
8890
echo "Installing sccache..." \
8991
&& curl -L -o sccache.tar.gz https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-unknown-linux-musl.tar.gz \
@@ -92,6 +94,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
9294
&& rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl \
9395
&& export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \
9496
&& export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
97+
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
9598
&& export SCCACHE_IDLE_TIMEOUT=0 \
9699
&& export CMAKE_BUILD_TYPE=Release \
97100
&& sccache --show-stats \
@@ -102,6 +105,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
102105
ENV CCACHE_DIR=/root/.cache/ccache
103106
RUN --mount=type=cache,target=/root/.cache/ccache \
104107
--mount=type=cache,target=/root/.cache/pip \
108+
--mount=type=bind,source=.git,target=.git \
105109
if [ "$USE_SCCACHE" != "1" ]; then \
106110
python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
107111
fi
@@ -180,10 +184,6 @@ FROM vllm-base AS test
180184
ADD . /vllm-workspace/
181185

182186
# install development dependencies (for testing)
183-
# A newer setuptools is required for installing some test dependencies from source that do not publish python 3.12 wheels
184-
# This installation must complete before the test dependencies are collected and installed.
185-
RUN --mount=type=cache,target=/root/.cache/pip \
186-
python3 -m pip install "setuptools>=74.1.1"
187187
RUN --mount=type=cache,target=/root/.cache/pip \
188188
python3 -m pip install -r requirements-dev.txt
189189

0 commit comments

Comments
 (0)