-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf #12337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
reasonsolo
wants to merge
7
commits into
NVIDIA:main
Choose a base branch
from
reasonsolo:lizhiz/disagg-kvcache-router-fix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+313
−9
Open
[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf #12337
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
8f43d77
[None][fix] Fix KvCacheAwareRouter tokenization and add aiperf/router…
reasonsolo a6fe38c
[None][fix] Add run_benchmark_aiperf.sh for aiperf-based disagg bench…
reasonsolo 3c241a7
[None][fix] Set token IDs on request after router tokenization to avo…
reasonsolo 3e42cd2
[None][test] Add multi-turn conversation routing test for KvCacheAwar…
reasonsolo 32f7983
[None][test] Parameterize multi-turn router test for completion and c…
reasonsolo 47c2d0e
[None][feat] Add TRTLLM_KVCACHE_AWARE_ROUTER_HASH_TOKENS_PER_BLOCK en…
reasonsolo 2d0e8d1
[None][chore] Fix yapf formatting in router and benchmark files
reasonsolo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
102 changes: 102 additions & 0 deletions
102
examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,102 @@ | ||
| #!/bin/bash | ||
|
|
||
| # aiperf-based benchmark script for disaggregated serving | ||
| # Args: model_name dataset_file multi_round num_gen_servers concurrency_list streaming log_path hostname port ucx_warmup_requests | ||
|
|
||
| set -e | ||
| set -u | ||
| trap 'echo "Error occurred at line $LINENO"; exit 1' ERR | ||
|
|
||
| if [ "$#" -lt 10 ]; then | ||
| echo "Error: Missing required arguments, got $# arguments, args: $@" | ||
| echo "Usage: $0 model_name dataset_file multi_round num_gen_servers concurrency_list streaming log_path hostname port ucx_warmup_requests" | ||
| exit 1 | ||
| fi | ||
|
|
||
| model_name=$1 | ||
| dataset_file=$2 | ||
| multi_round=$3 | ||
| num_gen_servers=$4 | ||
| concurrency_list=$5 | ||
| streaming=$6 | ||
| log_path=$7 | ||
| hostname=$8 | ||
| port=$9 | ||
| ucx_warmup_requests=${10} | ||
|
|
||
| # check process id is not 0 | ||
| if [[ ${SLURM_PROCID} != "0" ]]; then | ||
| echo "Process id is ${SLURM_PROCID} for loadgen, exiting" | ||
| exit 0 | ||
| fi | ||
|
|
||
| # Always install/upgrade aiperf to ensure we have the version with trust_remote_code fix | ||
| # (container may have an older version with parallel_decode.py that lacks trust_remote_code) | ||
| echo "Installing aiperf..." | ||
| pip install --force-reinstall --no-deps 'aiperf @ git+https://github.com/ai-dynamo/aiperf.git@ac3d91652e5e024bfb4ac38d48603423aad666bc' | ||
|
|
||
| # warmup requests for ucx connections | ||
| if [ "${ucx_warmup_requests}" -gt 0 ]; then | ||
| echo "warming up ucx connections with small requests... ${ucx_warmup_requests}" | ||
| python -m tensorrt_llm.serve.scripts.benchmark_serving \ | ||
| --model ${model_name} \ | ||
| --dataset-name random \ | ||
| --random-ids \ | ||
| --random-input-len 100 \ | ||
| --random-output-len 10 \ | ||
| --num-prompts ${ucx_warmup_requests} \ | ||
| --host ${hostname} \ | ||
| --port ${port} \ | ||
| --ignore-eos \ | ||
| --trust-remote-code \ | ||
| --non-streaming | ||
| echo "UCX warmup done" | ||
| fi | ||
|
|
||
| # Trust remote code globally for custom tokenizers in parallel workers | ||
| export HF_HUB_TRUST_REMOTE_CODE=1 | ||
|
|
||
| echo "Hostname: ${hostname}, Port: ${port}" | ||
| echo "Starting aiperf benchmark..." | ||
|
|
||
| concurrency_list=$(echo "${concurrency_list}" | tr ',' ' ') | ||
| for concurrency in ${concurrency_list}; do | ||
| concurrency=$((concurrency)) | ||
| request_count=$((concurrency * multi_round)) | ||
| # benchmark_duration: 20min per round | ||
| benchmark_duration=$((multi_round * 1200)) | ||
| echo "Benchmarking with concurrency ${concurrency} ... ${request_count} requests, duration ${benchmark_duration}s" | ||
| mkdir -p ${log_path}/concurrency_${concurrency} | ||
|
|
||
| aiperf profile \ | ||
| -m ${model_name} \ | ||
| --tokenizer ${model_name} \ | ||
| --tokenizer-trust-remote-code \ | ||
| --url http://${hostname}:${port} \ | ||
| --streaming \ | ||
| --ui simple \ | ||
| --input-file ${dataset_file} \ | ||
| --artifact-dir ${log_path}/concurrency_${concurrency} \ | ||
| --concurrency ${concurrency} \ | ||
| --concurrency-ramp-duration 60 \ | ||
| --custom-dataset-type mooncake_trace \ | ||
| --benchmark-duration ${benchmark_duration} \ | ||
| --benchmark-grace-period 60 \ | ||
| --workers-max 200 \ | ||
| --request-timeout-seconds 1200 \ | ||
| --profile-export-level records \ | ||
| --extra-inputs ignore_eos:true \ | ||
| --request-count ${request_count} \ | ||
| --record-processors 8 | ||
|
|
||
| echo "Benchmark with concurrency ${concurrency} done" | ||
| done | ||
|
|
||
| # Fetch perf metrics from disagg server | ||
| echo "Fetching perf metrics from http://${hostname}:${port}/perf_metrics ..." | ||
| curl -s "http://${hostname}:${port}/perf_metrics" > ${log_path}/perf_metrics.json 2>&1 || true | ||
| if [ -s "${log_path}/perf_metrics.json" ]; then | ||
| echo "Perf metrics saved to ${log_path}/perf_metrics.json" | ||
| else | ||
| echo "Warning: perf_metrics response was empty or endpoint not available" | ||
| fi | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.