Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 16 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,17 @@ It can be used to benchmark any text generation server that exposes an OpenAI-co

## Get started

### Install

If you have [cargo](https://rustup.rs/) already installed:
```bash
cargo install --git https://github.com/huggingface/inference-benchmarker/
```

Or download the [latest released binary](https://github.com/huggingface/inference-benchmarker/releases/latest)

Or you can run docker images.

### Run a benchmark

#### 1. Start an inference server
Expand Down Expand Up @@ -76,22 +87,12 @@ docker run --runtime nvidia --gpus all \
--model $MODEL
```

#### 2. Run a benchmark using Docker image

#### 2. Run a benchmark

```shell
MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>
# run a benchmark to evaluate the performance of the model for chat use case
# we mount results to the current directory
$ docker run \
--rm \
-it \
--net host \
-v $(pwd):/opt/inference-benchmarker/results \
-e "HF_TOKEN=$HF_TOKEN" \
ghcr.io/huggingface/inference-benchmarker:latest \
inference-benchmarker \
--tokenizer-name "$MODEL" \
inference-benchmarker
--tokenizer-name "meta-llama/Llama-3.1-8B-Instruct" \
--url http://localhost:8080 \
--profile chat
```
Expand Down Expand Up @@ -132,16 +133,7 @@ Available modes:
Example running a benchmark at a fixed request rates:

```shell
MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>
$ docker run \
--rm \
-it \
--net host \
-v $(pwd):/opt/inference-benchmarker/results \
-e "HF_TOKEN=$HF_TOKEN" \
ghcr.io/huggingface/inference-benchmarker:latest \
inference-benchmarker \
inference-benchmarker \
--tokenizer-name "meta-llama/Llama-3.1-8B-Instruct" \
--max-vus 800 \
--duration 120s \
Expand Down
2 changes: 1 addition & 1 deletion src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ async fn main() {
let stop_sender_clone = stop_sender.clone();
// get HF token
let token_env_key = "HF_TOKEN".to_string();
let cache = hf_hub::Cache::default();
let cache = hf_hub::Cache::from_env();
let hf_token = match std::env::var(token_env_key).ok() {
Some(token) => Some(token),
None => cache.token(),
Expand Down
2 changes: 1 addition & 1 deletion src/requests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -547,7 +547,7 @@ impl ConversationTextRequestGenerator {
filename: String,
hf_token: Option<String>,
) -> anyhow::Result<PathBuf> {
let api = ApiBuilder::new().with_token(hf_token).build()?;
let api = ApiBuilder::from_env().with_token(hf_token).build()?;
let repo = api.dataset(repo_name);
let dataset = repo.get(&filename)?;
Ok(dataset)
Expand Down