Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit 9ba8734

Browse files
committed
Merge branch 'upstream-main' into tms/add_mamba
2 parents b2a8cd8 + 98c12cf commit 9ba8734

File tree

325 files changed

+17020
-3472
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

325 files changed

+17020
-3472
lines changed

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ tasks:
44
- name: "gsm8k"
55
metrics:
66
- name: "exact_match,strict-match"
7-
value: 0.409
7+
value: 0.419
88
- name: "exact_match,flexible-extract"
9-
value: 0.406
9+
value: 0.416
1010
limit: 1000
1111
num_fewshot: 5

.buildkite/nightly-benchmarks/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,17 +34,18 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
3434

3535
Performance benchmark will be triggered when:
3636
- A PR being merged into vllm.
37-
- Every commit for those PRs with `perf-benchmarks` label.
37+
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
3838

3939
Nightly benchmark will be triggered when:
40-
- Every commit for those PRs with `nightly-benchmarks` label.
40+
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
4141

4242

4343

4444

4545
## Performance benchmark details
4646

47-
See [descriptions.md](tests/descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
47+
48+
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
4849

4950

5051
#### Latency test
@@ -68,7 +69,7 @@ Here is an example of one test inside `latency-tests.json`:
6869

6970
In this example:
7071
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
71-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
72+
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
7273

7374
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
7475

.buildkite/nightly-benchmarks/benchmark-pipeline.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ steps:
2121
containers:
2222
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
2323
command:
24-
- bash .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
24+
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
2525
resources:
2626
limits:
2727
nvidia.com/gpu: 8
File renamed without changes.

.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,8 +174,8 @@ def results_to_json(latency, throughput, serving):
174174
# document the result
175175
with open(results_folder / "benchmark_results.md", "w") as f:
176176

177-
results = read_markdown(
178-
"../.buildkite/nightly-benchmarks/tests/descriptions.md")
177+
results = read_markdown("../.buildkite/nightly-benchmarks/" +
178+
"performance-benchmarks-descriptions.md")
179179
results = results.format(
180180
latency_tests_markdown_table=latency_md_table,
181181
throughput_tests_markdown_table=throughput_md_table,

.buildkite/nightly-benchmarks/run-benchmarks-suite.sh renamed to .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,9 @@ check_hf_token() {
3737
ensure_sharegpt_downloaded() {
3838
local FILE=ShareGPT_V3_unfiltered_cleaned_split.json
3939
if [ ! -f "$FILE" ]; then
40-
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
40+
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
4141
else
42-
echo "$FILE already exists."
42+
echo "$FILE already exists."
4343
fi
4444
}
4545

@@ -68,11 +68,29 @@ wait_for_server() {
6868
done' && return 0 || return 1
6969
}
7070

71+
kill_processes_launched_by_current_bash() {
72+
# Kill all python processes launched from current bash script
73+
current_shell_pid=$$
74+
processes=$(ps -eo pid,ppid,command | awk -v ppid="$current_shell_pid" -v proc="$1" '$2 == ppid && $3 ~ proc {print $1}')
75+
if [ -n "$processes" ]; then
76+
echo "Killing the following processes matching '$1':"
77+
echo "$processes"
78+
echo "$processes" | xargs kill -9
79+
else
80+
echo "No processes found matching '$1'."
81+
fi
82+
}
83+
7184
kill_gpu_processes() {
72-
# kill all processes on GPU.
7385

74-
ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
75-
ps -e | grep pt_main_thread | awk '{print $1}' | xargs kill -9
86+
ps -aux
87+
lsof -t -i:8000 | xargs -r kill -9
88+
pkill -f pt_main_thread
89+
# this line doesn't work now
90+
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
91+
pkill -f python3
92+
pkill -f /usr/bin/python3
93+
7694

7795
# wait until GPU memory usage smaller than 1GB
7896
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
@@ -82,11 +100,6 @@ kill_gpu_processes() {
82100
# remove vllm config file
83101
rm -rf ~/.config/vllm
84102

85-
# Print the GPU memory usage
86-
# so that we know if all GPU processes are killed.
87-
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
88-
# The memory usage should be 0 MB.
89-
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
90103
}
91104

92105
upload_to_buildkite() {
@@ -104,7 +117,7 @@ upload_to_buildkite() {
104117
fi
105118

106119
# Use the determined command to annotate and upload artifacts
107-
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" < $RESULTS_FOLDER/benchmark_results.md
120+
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" <$RESULTS_FOLDER/benchmark_results.md
108121
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
109122
}
110123

@@ -156,7 +169,7 @@ run_latency_tests() {
156169
latency_command: $latency,
157170
gpu_type: $gpu
158171
}')
159-
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
172+
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
160173

161174
# run the benchmark
162175
eval "$latency_command"
@@ -166,7 +179,6 @@ run_latency_tests() {
166179
done
167180
}
168181

169-
170182
run_throughput_tests() {
171183
# run throughput tests using `benchmark_throughput.py`
172184
# $1: a json file specifying throughput test cases
@@ -214,7 +226,7 @@ run_throughput_tests() {
214226
throughput_command: $command,
215227
gpu_type: $gpu
216228
}')
217-
echo "$jq_output" > "$RESULTS_FOLDER/$test_name.commands"
229+
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
218230

219231
# run the benchmark
220232
eval "$throughput_command"
@@ -246,7 +258,6 @@ run_serving_tests() {
246258
continue
247259
fi
248260

249-
250261
# get client and server arguments
251262
server_params=$(echo "$params" | jq -r '.server_parameters')
252263
client_params=$(echo "$params" | jq -r '.client_parameters')
@@ -324,7 +335,7 @@ run_serving_tests() {
324335
client_command: $client,
325336
gpu_type: $gpu
326337
}')
327-
echo "$jq_output" > "$RESULTS_FOLDER/${new_test_name}.commands"
338+
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
328339

329340
done
330341

@@ -341,6 +352,7 @@ main() {
341352
# dependencies
342353
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
343354
(which jq) || (apt-get update && apt-get -y install jq)
355+
(which lsof) || (apt-get update && apt-get install -y lsof)
344356

345357
# get the current IP address, required by benchmark_serving.py
346358
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
@@ -359,7 +371,6 @@ main() {
359371
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
360372
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json
361373

362-
363374
# postprocess benchmarking results
364375
pip install tabulate pandas
365376
python3 $QUICK_BENCHMARK_ROOT/scripts/convert-results-json-to-markdown.py

.buildkite/run-amd-test.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ docker run \
7575
--network host \
7676
--shm-size=16gb \
7777
--rm \
78+
-e HIP_VISIBLE_DEVICES=0 \
7879
-e HF_TOKEN \
7980
-v ${HF_CACHE}:${HF_MOUNT} \
8081
-e HF_HOME=${HF_MOUNT} \

.buildkite/run-tpu-test.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,4 @@ remove_docker_container
1212
# For HF_TOKEN.
1313
source /etc/environment
1414
# Run a simple end-to-end example.
15-
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu \
16-
python3 /workspace/vllm/examples/offline_inference_tpu.py
15+
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"

.buildkite/test-pipeline.yaml

Lines changed: 45 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -86,15 +86,18 @@ steps:
8686
- vllm/
8787
commands:
8888
- pip install -e ./plugins/vllm_add_dummy_model
89-
- pytest -v -s entrypoints/llm
89+
- pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@a4987bba6e9e9b3f22bd3a6c1ecf0abd04fd5622#egg=lm_eval[api]
90+
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py
91+
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
9092
- pytest -v -s entrypoints/openai
9193

9294
- label: Distributed Tests (4 GPUs) # 10min
9395
working_dir: "/vllm-workspace/tests"
9496
num_gpus: 4
9597
fast_check: true
9698
source_file_dependencies:
97-
- vllm/
99+
- vllm/distributed/
100+
- vllm/core/
98101
- tests/distributed
99102
- tests/spec_decode/e2e/test_integration_dist_tp4
100103
commands:
@@ -111,10 +114,10 @@ steps:
111114
commands:
112115
- pytest -v -s metrics
113116
- "pip install \
114-
opentelemetry-sdk \
115-
opentelemetry-api \
116-
opentelemetry-exporter-otlp \
117-
opentelemetry-semantic-conventions-ai"
117+
'opentelemetry-sdk>=1.26.0,<1.27.0' \
118+
'opentelemetry-api>=1.26.0,<1.27.0' \
119+
'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' \
120+
'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'"
118121
- pytest -v -s tracing
119122

120123
##### fast check tests #####
@@ -230,12 +233,13 @@ steps:
230233
parallelism: 4
231234

232235
- label: Tensorizer Test # 11min
236+
mirror_hardwares: [amd]
233237
soft_fail: true
234238
source_file_dependencies:
235239
- vllm/model_executor/model_loader
236240
- tests/tensorizer_loader
237241
commands:
238-
- apt-get install -y curl libsodium23
242+
- apt-get update && apt-get install -y curl libsodium23
239243
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
240244
- pytest -v -s tensorizer_loader
241245

@@ -283,11 +287,15 @@ steps:
283287
num_gpus: 2
284288
num_nodes: 2
285289
source_file_dependencies:
286-
- vllm/
287-
- tests/distributed/test_same_node
290+
- vllm/distributed/
291+
- vllm/engine/
292+
- vllm/executor/
293+
- vllm/model_executor/models/
294+
- tests/distributed/
288295
commands:
289296
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
290297
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
298+
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
291299
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
292300
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
293301
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
@@ -297,8 +305,11 @@ steps:
297305
working_dir: "/vllm-workspace/tests"
298306
num_gpus: 2
299307
source_file_dependencies:
300-
- vllm/
301-
- tests/distributed
308+
- vllm/distributed/
309+
- vllm/engine/
310+
- vllm/executor/
311+
- vllm/model_executor/models/
312+
- tests/distributed/
302313
commands:
303314
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py
304315
- TARGET_TEST_SUITE=L4 pytest -v -s distributed/test_basic_distributed_correctness.py
@@ -311,13 +322,33 @@ steps:
311322
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
312323
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s distributed/test_utils.py
313324

325+
- label: Multi-step Tests (4 GPUs) # 21min
326+
working_dir: "/vllm-workspace/tests"
327+
num_gpus: 4
328+
source_file_dependencies:
329+
- vllm/model_executor/layers/sampler.py
330+
- vllm/sequence.py
331+
- vllm/worker/worker_base.py
332+
- vllm/worker/worker.py
333+
- vllm/worker/multi_step_worker.py
334+
- vllm/worker/model_runner_base.py
335+
- vllm/worker/model_runner.py
336+
- vllm/worker/multi_step_model_runner.py
337+
- vllm/engine
338+
- tests/multi_step
339+
commands:
340+
- pytest -v -s multi_step/test_correctness_async_llm.py
341+
- pytest -v -s multi_step/test_correctness_llm.py
342+
314343
- label: Pipeline Parallelism Test # 23min
315344
working_dir: "/vllm-workspace/tests"
316345
num_gpus: 4
317346
source_file_dependencies:
318-
- vllm/
319-
- tests/distributed/test_pp_cudagraph.py
320-
- tests/distributed/test_pipeline_parallel
347+
- vllm/distributed/
348+
- vllm/engine/
349+
- vllm/executor/
350+
- vllm/model_executor/models/
351+
- tests/distributed/
321352
commands:
322353
- pytest -v -s distributed/test_pp_cudagraph.py
323354
- pytest -v -s distributed/test_pipeline_parallel.py

.github/ISSUE_TEMPLATE/100-documentation.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,10 @@ body:
2020
attributes:
2121
value: >
2222
Thanks for contributing 🎉!
23+
- type: checkboxes
24+
id: askllm
25+
attributes:
26+
label: Before submitting a new issue...
27+
options:
28+
- label: Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
29+
required: true

0 commit comments

Comments
 (0)