@@ -269,6 +269,21 @@ python3 vllm/benchmarks/benchmark_serving.py \
269269 --num-prompts 10
270270```
271271
272+ ### Running With Ramp-Up Request Rate
273+
274+ The benchmark tool also supports ramping up the request rate over the
275+ duration of the benchmark run. This can be useful for stress testing the
276+ server or finding the maximum throughput that it can handle, given some latency budget.
277+
278+ Two ramp-up strategies are supported:
279+ - ` linear ` : Increases the request rate linearly from a start value to an end value.
280+ - ` exponential ` : Increases the request rate exponentially.
281+
282+ The following arguments can be used to control the ramp-up:
283+ - ` --ramp-up-strategy ` : The ramp-up strategy to use (` linear ` or ` exponential ` ).
284+ - ` --ramp-up-start-rps ` : The request rate at the beginning of the benchmark.
285+ - ` --ramp-up-end-rps ` : The request rate at the end of the benchmark.
286+
272287---
273288## Example - Offline Throughput Benchmark
274289
@@ -387,3 +402,178 @@ python3 vllm/benchmarks/benchmark_throughput.py \
387402 --enable-lora \
388403 --lora-path yard1/llama-2-7b-sql-lora-test
389404 ```
405+
406+ ---
407+ ## Example - Structured Output Benchmark
408+
409+ Benchmark the performance of structured output generation (JSON, grammar, regex).
410+
411+ ### Server Setup
412+
413+ ``` bash
414+ vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
415+ ```
416+
417+ ### JSON Schema Benchmark
418+
419+ ``` bash
420+ python3 benchmarks/benchmark_serving_structured_output.py \
421+ --backend vllm \
422+ --model NousResearch/Hermes-3-Llama-3.1-8B \
423+ --dataset json \
424+ --structured-output-ratio 1.0 \
425+ --request-rate 10 \
426+ --num-prompts 1000
427+ ```
428+
429+ ### Grammar-based Generation Benchmark
430+
431+ ``` bash
432+ python3 benchmarks/benchmark_serving_structured_output.py \
433+ --backend vllm \
434+ --model NousResearch/Hermes-3-Llama-3.1-8B \
435+ --dataset grammar \
436+ --structure-type grammar \
437+ --request-rate 10 \
438+ --num-prompts 1000
439+ ```
440+
441+ ### Regex-based Generation Benchmark
442+
443+ ``` bash
444+ python3 benchmarks/benchmark_serving_structured_output.py \
445+ --backend vllm \
446+ --model NousResearch/Hermes-3-Llama-3.1-8B \
447+ --dataset regex \
448+ --request-rate 10 \
449+ --num-prompts 1000
450+ ```
451+
452+ ### Choice-based Generation Benchmark
453+
454+ ``` bash
455+ python3 benchmarks/benchmark_serving_structured_output.py \
456+ --backend vllm \
457+ --model NousResearch/Hermes-3-Llama-3.1-8B \
458+ --dataset choice \
459+ --request-rate 10 \
460+ --num-prompts 1000
461+ ```
462+
463+ ### XGrammar Benchmark Dataset
464+
465+ ``` bash
466+ python3 benchmarks/benchmark_serving_structured_output.py \
467+ --backend vllm \
468+ --model NousResearch/Hermes-3-Llama-3.1-8B \
469+ --dataset xgrammar_bench \
470+ --request-rate 10 \
471+ --num-prompts 1000
472+ ```
473+
474+ ---
475+ ## Example - Long Document QA Throughput Benchmark
476+
477+ Benchmark the performance of long document question-answering with prefix caching.
478+
479+ ### Basic Long Document QA Test
480+
481+ ``` bash
482+ python3 benchmarks/benchmark_long_document_qa_throughput.py \
483+ --model meta-llama/Llama-2-7b-chat-hf \
484+ --enable-prefix-caching \
485+ --num-documents 16 \
486+ --document-length 2000 \
487+ --output-len 50 \
488+ --repeat-count 5
489+ ```
490+
491+ ### Different Repeat Modes
492+
493+ ``` bash
494+ # Random mode (default) - shuffle prompts randomly
495+ python3 benchmarks/benchmark_long_document_qa_throughput.py \
496+ --model meta-llama/Llama-2-7b-chat-hf \
497+ --enable-prefix-caching \
498+ --num-documents 8 \
499+ --document-length 3000 \
500+ --repeat-count 3 \
501+ --repeat-mode random
502+
503+ # Tile mode - repeat entire prompt list in sequence
504+ python3 benchmarks/benchmark_long_document_qa_throughput.py \
505+ --model meta-llama/Llama-2-7b-chat-hf \
506+ --enable-prefix-caching \
507+ --num-documents 8 \
508+ --document-length 3000 \
509+ --repeat-count 3 \
510+ --repeat-mode tile
511+
512+ # Interleave mode - repeat each prompt consecutively
513+ python3 benchmarks/benchmark_long_document_qa_throughput.py \
514+ --model meta-llama/Llama-2-7b-chat-hf \
515+ --enable-prefix-caching \
516+ --num-documents 8 \
517+ --document-length 3000 \
518+ --repeat-count 3 \
519+ --repeat-mode interleave
520+ ```
521+
522+ ---
523+ ## Example - Prefix Caching Benchmark
524+
525+ Benchmark the efficiency of automatic prefix caching.
526+
527+ ### Fixed Prompt with Prefix Caching
528+
529+ ``` bash
530+ python3 benchmarks/benchmark_prefix_caching.py \
531+ --model meta-llama/Llama-2-7b-chat-hf \
532+ --enable-prefix-caching \
533+ --num-prompts 1 \
534+ --repeat-count 100 \
535+ --input-length-range 128:256
536+ ```
537+
538+ ### ShareGPT Dataset with Prefix Caching
539+
540+ ``` bash
541+ # download dataset
542+ # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
543+
544+ python3 benchmarks/benchmark_prefix_caching.py \
545+ --model meta-llama/Llama-2-7b-chat-hf \
546+ --dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \
547+ --enable-prefix-caching \
548+ --num-prompts 20 \
549+ --repeat-count 5 \
550+ --input-length-range 128:256
551+ ```
552+
553+ ---
554+ ## Example - Request Prioritization Benchmark
555+
556+ Benchmark the performance of request prioritization in vLLM.
557+
558+ ### Basic Prioritization Test
559+
560+ ``` bash
561+ python3 benchmarks/benchmark_prioritization.py \
562+ --model meta-llama/Llama-2-7b-chat-hf \
563+ --input-len 128 \
564+ --output-len 64 \
565+ --num-prompts 100 \
566+ --scheduling-policy priority
567+ ```
568+
569+ ### Multiple Sequences per Prompt
570+
571+ ``` bash
572+ python3 benchmarks/benchmark_prioritization.py \
573+ --model meta-llama/Llama-2-7b-chat-hf \
574+ --input-len 128 \
575+ --output-len 64 \
576+ --num-prompts 100 \
577+ --scheduling-policy priority \
578+ --n 2
579+ ```
0 commit comments