@@ -67,13 +67,13 @@ Legend:
67
67
<details class =" admonition abstract " markdown =" 1 " >
68
68
<summary >Show more</summary >
69
69
70
- First start serving your model
70
+ First start serving your model:
71
71
72
72
``` bash
73
73
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
74
74
```
75
75
76
- Then run the benchmarking script
76
+ Then run the benchmarking script:
77
77
78
78
``` bash
79
79
# download dataset
@@ -87,7 +87,7 @@ vllm bench serve \
87
87
--num-prompts 10
88
88
```
89
89
90
- If successful, you will see the following output
90
+ If successful, you will see the following output:
91
91
92
92
``` text
93
93
============ Serving Benchmark Result ============
@@ -125,7 +125,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you
125
125
126
126
``` bash
127
127
# start server
128
- VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
128
+ vllm serve meta-llama/Llama-3.1-8B-Instruct
129
129
```
130
130
131
131
``` bash
@@ -167,7 +167,7 @@ vllm bench serve \
167
167
##### InstructCoder Benchmark with Speculative Decoding
168
168
169
169
``` bash
170
- VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
170
+ vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
171
171
--speculative-config $' {"method": "ngram",
172
172
"num_speculative_tokens": 5, "prompt_lookup_max": 5,
173
173
"prompt_lookup_min": 2}'
@@ -184,7 +184,7 @@ vllm bench serve \
184
184
##### Spec Bench Benchmark with Speculative Decoding
185
185
186
186
``` bash
187
- VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
187
+ vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
188
188
--speculative-config $' {"method": "ngram",
189
189
"num_speculative_tokens": 5, "prompt_lookup_max": 5,
190
190
"prompt_lookup_min": 2}'
@@ -366,7 +366,6 @@ Total num output tokens: 1280
366
366
367
367
``` bash
368
368
VLLM_WORKER_MULTIPROC_METHOD=spawn \
369
- VLLM_USE_V1=1 \
370
369
vllm bench throughput \
371
370
--dataset-name=hf \
372
371
--dataset-path=likaixin/InstructCoder \
@@ -781,6 +780,104 @@ This should be seen as an edge case, and if this behavior can be avoided by sett
781
780
782
781
</details >
783
782
783
+ #### Embedding Benchmark
784
+
785
+ Benchmark the performance of embedding requests in vLLM.
786
+
787
+ <details class =" admonition abstract " markdown =" 1 " >
788
+ <summary >Show more</summary >
789
+
790
+ ##### Text Embeddings
791
+
792
+ Unlike generative models which use Completions API or Chat Completions API,
793
+ you should set ` --backend openai-embeddings ` and ` --endpoint /v1/embeddings ` to use the Embeddings API.
794
+
795
+ You can use any text dataset to benchmark the model, such as ShareGPT.
796
+
797
+ Start the server:
798
+
799
+ ``` bash
800
+ vllm serve jinaai/jina-embeddings-v3 --trust-remote-code
801
+ ```
802
+
803
+ Run the benchmark:
804
+
805
+ ``` bash
806
+ # download dataset
807
+ # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
808
+ vllm bench serve \
809
+ --model jinaai/jina-embeddings-v3 \
810
+ --backend openai-embeddings \
811
+ --endpoint /v1/embeddings \
812
+ --dataset-name sharegpt \
813
+ --dataset-path < your data path> /ShareGPT_V3_unfiltered_cleaned_split.json
814
+ ```
815
+
816
+ ##### Multi-modal Embeddings
817
+
818
+ Unlike generative models which use Completions API or Chat Completions API,
819
+ you should set ` --endpoint /v1/embeddings ` to use the Embeddings API. The backend to use depends on the model:
820
+
821
+ - CLIP: ` --backend openai-embeddings-clip `
822
+ - VLM2Vec: ` --backend openai-embeddings-vlm2vec `
823
+
824
+ For other models, please add your own implementation inside < gh-file:vllm/benchmarks/lib/endpoint_request_func.py > to match the expected instruction format.
825
+
826
+ You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
827
+ For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
828
+
829
+ Serve and benchmark CLIP:
830
+
831
+ ``` bash
832
+ # Run this in another process
833
+ vllm serve openai/clip-vit-base-patch32
834
+
835
+ # Run these one by one after the server is up
836
+ # download dataset
837
+ # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
838
+ vllm bench serve \
839
+ --model openai/clip-vit-base-patch32 \
840
+ --backend openai-embeddings-clip \
841
+ --endpoint /v1/embeddings \
842
+ --dataset-name sharegpt \
843
+ --dataset-path < your data path> /ShareGPT_V3_unfiltered_cleaned_split.json
844
+
845
+ vllm bench serve \
846
+ --model openai/clip-vit-base-patch32 \
847
+ --backend openai-embeddings-clip \
848
+ --endpoint /v1/embeddings \
849
+ --dataset-name hf \
850
+ --dataset-path lmarena-ai/VisionArena-Chat
851
+ ```
852
+
853
+ Serve and benchmark VLM2Vec:
854
+
855
+ ``` bash
856
+ # Run this in another process
857
+ vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
858
+ --trust-remote-code \
859
+ --chat-template examples/template_vlm2vec_phi3v.jinja
860
+
861
+ # Run these one by one after the server is up
862
+ # download dataset
863
+ # wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
864
+ vllm bench serve \
865
+ --model TIGER-Lab/VLM2Vec-Full \
866
+ --backend openai-embeddings-vlm2vec \
867
+ --endpoint /v1/embeddings \
868
+ --dataset-name sharegpt \
869
+ --dataset-path < your data path> /ShareGPT_V3_unfiltered_cleaned_split.json
870
+
871
+ vllm bench serve \
872
+ --model TIGER-Lab/VLM2Vec-Full \
873
+ --backend openai-embeddings-vlm2vec \
874
+ --endpoint /v1/embeddings \
875
+ --dataset-name hf \
876
+ --dataset-path lmarena-ai/VisionArena-Chat
877
+ ```
878
+
879
+ </details >
880
+
784
881
[ ] ( ) { #performance-benchmarks }
785
882
786
883
## Performance Benchmarks
0 commit comments