Skip to content

Conversation

parfeniukink
Copy link
Contributor

@parfeniukink parfeniukink commented Feb 25, 2025

Setup the environment

  1. Run the model via vllm or llama.cpp
  2. Execute the guidellm command

Command

guidellm --target "http://localhost:8080/v1" --model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" --tokenizer "hf-internal-testing/llama-tokenizer" --data-type emulated --data "prompt_tokens=512,generated_tokens=128" --rate-type constant --rate 2 --max-seconds 100 --batch-size 2

Output

  Generating report... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (1/1) [ 0:01:40 < 0:00:00 ]
╭─ GuideLLM Benchmarks Report (stdout) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ╭─ Benchmark Report 1 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ Backend(type=openai_server, target=http://localhost:8080/v1, model=Phi-3-mini-4k-instruct-q4.gguf)                                                                                     │ │
│ │ Data(type=emulated, source=prompt_tokens=128,generated_tokens=128, tokenizer=hf-internal-testing/llama-tokenizer)                                                                      │ │
│ │ Rate(type=constant, rate=(8.0,))                                                                                                                                                       │ │
│ │ Limits(max_number=None requests, max_duration=100 sec)                                                                                                                                 │ │
│ │                                                                                                                                                                                        │ │
│ │                                                                                                                                                                                        │ │
│ │ Requests Data by Benchmark                                                                                                                                                             │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓                                                                                │ │
│ │ ┃ Benchmark                 ┃ Requests Completed ┃ Request Failed ┃ Duration  ┃ Start Time ┃ End Time ┃                                                                                │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩                                                                                │ │
│ │ │ [email protected] req/sec │ 12/12              │ 0/12           │ 90.03 sec │ 21:46:41   │ 21:48:11 │                                                                                │ │
│ │ └───────────────────────────┴────────────────────┴────────────────┴───────────┴────────────┴──────────┘                                                                                │ │
│ │                                                                                                                                                                                        │ │
│ │ Tokens Data by Benchmark                                                                                                                                                               │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                                  │ │
│ │ ┃ Benchmark                 ┃ Prompt ┃ Prompt (1%, 5%, 50%, 95%, 99%)    ┃ Output ┃ Output (1%, 5%, 50%, 95%, 99%)  ┃                                                                  │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                                  │ │
│ │ │ [email protected] req/sec │ 128.25 │ 128.0, 128.0, 128.0, 129.0, 129.0 │ 117.42 │ 56.3, 65.5, 128.0, 128.0, 128.0 │                                                                  │ │
│ │ └───────────────────────────┴────────┴───────────────────────────────────┴────────┴─────────────────────────────────┘                                                                  │ │
│ │                                                                                                                                                                                        │ │
│ │ Performance Stats by Benchmark                                                                                                                                                         │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
│ │ ┃                           ┃ Request Latency [1%, 5%, 10%, 50%, 90%, 95%,     ┃ Time to First Token [1%, 5%, 10%, 50%, 90%, 95%, ┃ Inter Token Latency [1%, 5%, 10%, 50%, 90% 95%,  ┃ │ │
│ │ ┃ Benchmark                 ┃ 99%] (sec)                                       ┃ 99%] (ms)                                        ┃ 99%] (ms)                                        ┃ │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
│ │ │ [email protected] req/sec │ 7.76, 7.83, 7.91, 10.56, 16.01, 16.60, 17.16     │ 828.5, 830.6, 833.0, 4789.5, 9004.5, 9510.3,     │ 49.7, 51.1, 51.8, 55.0, 66.4, 70.4, 75.6         │ │ │
│ │ │                           │                                                  │ 9994.8                                           │                                                  │ │ │
│ │ └───────────────────────────┴──────────────────────────────────────────────────┴──────────────────────────────────────────────────┴──────────────────────────────────────────────────┘ │ │
│ │                                                                                                                                                                                        │ │
│ │ Performance Summary by Benchmark                                                                                                                                                       │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓                                            │ │
│ │ ┃ Benchmark                 ┃ Requests per Second ┃ Request Latency ┃ Time to First Token ┃ Inter Token Latency ┃ Output Token Throughput ┃                                            │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩                                            │ │
│ │ │ [email protected] req/sec │ 0.13 req/sec        │ 11.60 sec       │ 4941.58 ms          │ 57.20 ms            │ 15.65 tokens/sec        │                                            │ │
│ │ └───────────────────────────┴─────────────────────┴─────────────────┴─────────────────────┴─────────────────────┴─────────────────────────┘                                            │ │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

@parfeniukink parfeniukink self-assigned this Feb 25, 2025
@parfeniukink parfeniukink marked this pull request as draft February 25, 2025 20:01
@parfeniukink parfeniukink removed the request for review from markurtz February 25, 2025 20:01
@markurtz
Copy link
Collaborator

Closing this out as all this will do is run a set number of requests equal to the batch size in parallel. To add batch support, we'll either need to run vLLM locally or go through the OpenAI batch processing API which is a significant expansion in scope and work.

@markurtz markurtz closed this Mar 10, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in GuideLLM Kanban Board Mar 10, 2025
@markurtz markurtz deleted the parfeniukink/batch-size-cli-parameter branch April 21, 2025 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

load-request load-request workstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants