Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 104 additions & 7 deletions docs/contributing/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,13 @@ Legend:
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>

First start serving your model
First start serving your model:

```bash
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
```

Then run the benchmarking script
Then run the benchmarking script:

```bash
# download dataset
Expand All @@ -87,7 +87,7 @@ vllm bench serve \
--num-prompts 10
```

If successful, you will see the following output
If successful, you will see the following output:

```text
============ Serving Benchmark Result ============
Expand Down Expand Up @@ -125,7 +125,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you

```bash
# start server
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
vllm serve meta-llama/Llama-3.1-8B-Instruct
```

```bash
Expand Down Expand Up @@ -167,7 +167,7 @@ vllm bench serve \
##### InstructCoder Benchmark with Speculative Decoding

``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-config $'{"method": "ngram",
"num_speculative_tokens": 5, "prompt_lookup_max": 5,
"prompt_lookup_min": 2}'
Expand All @@ -184,7 +184,7 @@ vllm bench serve \
##### Spec Bench Benchmark with Speculative Decoding

``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-config $'{"method": "ngram",
"num_speculative_tokens": 5, "prompt_lookup_max": 5,
"prompt_lookup_min": 2}'
Expand Down Expand Up @@ -366,7 +366,6 @@ Total num output tokens: 1280

``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
vllm bench throughput \
--dataset-name=hf \
--dataset-path=likaixin/InstructCoder \
Expand Down Expand Up @@ -781,6 +780,104 @@ This should be seen as an edge case, and if this behavior can be avoided by sett

</details>

#### Embedding Benchmark

Benchmark the performance of embedding requests in vLLM.

<details class="admonition abstract" markdown="1">
<summary>Show more</summary>

##### Text Embeddings

Unlike generative models which use Completions API or Chat Completions API,
you should set `--backend openai-embeddings` and `--endpoint /v1/embeddings` to use the Embeddings API.

You can use any text dataset to benchmark the model, such as ShareGPT.

Start the server:

```bash
vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
```

Run the benchmark:

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench serve \
--model jinaai/jina-embeddings-v3 \
--backend openai-embeddings \
--endpoint /v1/embeddings \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
```

##### Multi-modal Embeddings

Unlike generative models which use Completions API or Chat Completions API,
you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backend to use depends on the model:

- CLIP: `--backend openai-embeddings-clip`
- VLM2Vec: `--backend openai-embeddings-vlm2vec`

For other models, please add your own implementation inside <gh-file:vllm/benchmarks/lib/endpoint_request_func.py> to match the expected instruction format.

You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.

Serve and benchmark CLIP:

```bash
# Run this in another process
vllm serve openai/clip-vit-base-patch32

# Run these one by one after the server is up
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench serve \
--model openai/clip-vit-base-patch32 \
--backend openai-embeddings-clip \
--endpoint /v1/embeddings \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json

vllm bench serve \
--model openai/clip-vit-base-patch32 \
--backend openai-embeddings-clip \
--endpoint /v1/embeddings \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat
```

Serve and benchmark VLM2Vec:

```bash
# Run this in another process
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
--trust-remote-code \
--chat-template examples/template_vlm2vec_phi3v.jinja

# Run these one by one after the server is up
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench serve \
--model TIGER-Lab/VLM2Vec-Full \
--backend openai-embeddings-vlm2vec \
--endpoint /v1/embeddings \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json

vllm bench serve \
--model TIGER-Lab/VLM2Vec-Full \
--backend openai-embeddings-vlm2vec \
--endpoint /v1/embeddings \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat
```

</details>

[](){ #performance-benchmarks }

## Performance Benchmarks
Expand Down
8 changes: 4 additions & 4 deletions vllm/benchmarks/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -1582,10 +1582,10 @@ def get_samples(args, tokenizer) -> list[SampleRequest]:
"like to add support for additional dataset formats."
)

if dataset_class.IS_MULTIMODAL and args.backend not in [
"openai-chat",
"openai-audio",
]:
if dataset_class.IS_MULTIMODAL and not (
args.backend in ("openai-chat", "openai-audio")
or "openai-embeddings-" in args.backend
):
# multi-modal benchmark is only available on OpenAI Chat
# endpoint-type.
raise ValueError(
Expand Down
Loading