@@ -49,6 +49,17 @@ It can be used to benchmark any text generation server that exposes an OpenAI-co
4949
5050## Get started
5151
52+ ### Install
53+
54+ If you have [ cargo] ( https://rustup.rs/ ) already installed:
55+ ``` bash
56+ cargo install --git https://github.com/huggingface/inference-benchmarker/
57+ ```
58+
59+ Or download the [ latest released binary] ( https://github.com/huggingface/inference-benchmarker/releases/latest )
60+
61+ Or you can run docker images.
62+
5263### Run a benchmark
5364
5465#### 1. Start an inference server
@@ -76,24 +87,13 @@ docker run --runtime nvidia --gpus all \
7687 --model $MODEL
7788```
7889
79- #### 2. Run a benchmark using Docker image
90+
91+ #### 2. Run a benchmark
8092
8193``` shell
82- MODEL=meta-llama/Llama-3.1-8B-Instruct
83- HF_TOKEN=< your HF READ token>
84- # run a benchmark to evaluate the performance of the model for chat use case
85- # we mount results to the current directory
86- $ docker run \
87- --rm \
88- -it \
89- --net host \
90- -v $( pwd) :/opt/inference-benchmarker/results \
91- -e " HF_TOKEN=$HF_TOKEN " \
92- ghcr.io/huggingface/inference-benchmarker:latest \
93- inference-benchmarker \
94- --tokenizer-name " $MODEL " \
94+ inference-benchmarker
95+ --tokenizer-name " meta-llama/Llama-3.1-8B-Instruct" \
9596 --url http://localhost:8080 \
96- --profile chat
9797```
9898
9999Results will be saved in JSON format in current directory.
@@ -132,16 +132,7 @@ Available modes:
132132Example running a benchmark at a fixed request rates:
133133
134134``` shell
135- MODEL=meta-llama/Llama-3.1-8B-Instruct
136- HF_TOKEN=< your HF READ token>
137- $ docker run \
138- --rm \
139- -it \
140- --net host \
141- -v $( pwd) :/opt/inference-benchmarker/results \
142- -e " HF_TOKEN=$HF_TOKEN " \
143- ghcr.io/huggingface/inference-benchmarker:latest \
144- inference-benchmarker \
135+ inference-benchmarker \
145136 --tokenizer-name " meta-llama/Llama-3.1-8B-Instruct" \
146137 --max-vus 800 \
147138 --duration 120s \
0 commit comments