Skip to content

Commit b1976c2

Browse files
qiaoxj07Copilot
andauthored
Add wide-ep benchmarking scripts (NVIDIA#5760)
Signed-off-by: Xianjie <[email protected]> Signed-off-by: Xianjie Qiao <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 089fd55 commit b1976c2

File tree

13 files changed

+1010
-23
lines changed

13 files changed

+1010
-23
lines changed
File renamed without changes.
File renamed without changes.

examples/ep_load_balancer/report_load_statistics.py renamed to examples/wide_ep/ep_load_balancer/report_load_statistics.py

File renamed without changes.
File renamed without changes.
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# TensorRT-LLM Wide-EP Benchmark Scripts
2+
3+
This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.
4+
5+
## ⚠️ DISCLAIMER
6+
7+
**These scripts are currently not QA'ed and are provided for demonstration purposes only.**
8+
9+
Please note that:
10+
11+
- These scripts have not undergone formal quality assurance testing
12+
- They are intended for demonstration and educational purposes
13+
- Use at your own risk in production environments
14+
- Always review and test scripts thoroughly before running in your specific environment
15+
16+
## Scripts Overview
17+
18+
### Core Scripts
19+
20+
1. **`submit.sh`** - Main entry point for submitting benchmark jobs
21+
2. **`disaggr_torch.slurm`** - SLURM job script orchestrating the entire benchmark
22+
3. **`gen_yaml.py`** - Generates configuration files for serving setup
23+
4. **`start_server.sh`** - Starts the inference server
24+
5. **`start_worker.sh`** - Starts the worker processes
25+
6. **`run_benchmark.sh`** - Executes the benchmark workload
26+
7. **`process_gen_iterlog.py`** - Processes benchmark results and generates reports
27+
28+
## Usage
29+
30+
### Prerequisites
31+
32+
Before running the scripts, ensure you have:
33+
- Access to a SLURM cluster
34+
- Container image with TensorRT-LLM installed
35+
- Model files accessible on the cluster
36+
- Required environment variables set
37+
38+
### Configuration
39+
40+
Edit the following variables in `submit.sh` and `disaggr_torch.slurm`:
41+
42+
```bash
43+
# In disaggr_torch.slurm
44+
container_image=${container_image} # Your container image
45+
mount_dir=${mount_dir} # Mount directory path
46+
model_dir=${model_dir} # Model directory path
47+
```
48+
49+
### Running Benchmarks
50+
51+
1. **Submit benchmark jobs**:
52+
```bash
53+
./submit.sh
54+
```
55+
56+
2. **Monitor job progress**:
57+
```bash
58+
squeue -u $USER
59+
```
60+
61+
3. **View results**:
62+
Results are saved in `bm_20250703_deepseek-r1-{isl}-{osl}/` directory
63+
64+
## Script Details
65+
66+
### `submit.sh`
67+
Main entry script that submits multiple SLURM jobs with different configurations:
68+
- **DEP8**: 8-way parallelism for decode servers
69+
- **DEP16**: 16-way parallelism with different EPLB slot configurations
70+
- **DEP32**: 32-way parallelism for high-throughput scenarios
71+
72+
Parameters tested:
73+
- Concurrency levels: 1x, 64x, 1024x multipliers
74+
- EPLB slots: 0, 256, 288
75+
- Different parallelism sizes
76+
77+
### `disaggr_torch.slurm`
78+
SLURM job script that:
79+
1. Sets up container environment
80+
2. Generates configuration files
81+
3. Starts server and workers
82+
4. Executes benchmarks
83+
5. Cleans up processes
84+
85+
**Key parameters**:
86+
- `num_ctx_servers`: Number of context servers
87+
- `ctx_tp_size`: Tensor parallel size for context servers
88+
- `num_gen_servers`: Number of generation servers
89+
- `gen_tp_size`: Tensor parallel size for generation servers
90+
- `concurrency`: Number of concurrent requests
91+
92+
### `gen_yaml.py`
93+
Generates YAML configuration files with:
94+
- Server topology and resource allocation
95+
- Network configuration (hostnames, ports)
96+
- Memory and batch size settings
97+
- Optimization parameters (CUDA graphs, KV cache)
98+
99+
**Key features**:
100+
- Automatic node and task allocation
101+
- Support for attention data parallelism
102+
- MoE load balancing configuration
103+
- Speculative decoding (MTP) support
104+
105+
### `start_server.sh` & `start_worker.sh`
106+
- **Server**: Starts the main inference server with API endpoint
107+
- **Workers**: Starts MPI workers for distributed processing
108+
- Support for profiling with NSight Systems
109+
- Environment variable configuration for optimizations
110+
111+
### `run_benchmark.sh`
112+
Executes benchmarking using TensorRT-LLM's benchmark_serving tool:
113+
- Downloads ShareGPT dataset for realistic workloads
114+
- Waits for server health checks
115+
- Runs load testing with specified concurrency
116+
- Collects performance metrics
117+
- Gracefully shuts down services
118+
119+
**Metrics collected**:
120+
- Throughput (tokens/second)
121+
- Latency (request completion time)
122+
- Context vs generation only statistics
123+
124+
### `process_gen_iterlog.py`
125+
Post-processes benchmark results:
126+
- Parses iteration logs from workers
127+
- Calculates throughput metrics
128+
- Generates CSV reports
129+
- Supports MTP (Multi-Token Prediction) analysis
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
#!/bin/bash
2+
#SBATCH --nodes=2
3+
#SBATCH --ntasks=8
4+
#SBATCH --ntasks-per-node=4
5+
#SBATCH --partition=${partition} # add your partition here
6+
#SBATCH --account=${account} # add your account here
7+
#SBATCH --time=01:00:00
8+
#SBATCH --job-name=${job_name} # add your job name here
9+
10+
isl=1024
11+
osl=1024
12+
multi_round=1
13+
gen_yaml_file=gen_yaml.py
14+
container_image=${container_image} # add your container image here
15+
mount_dir=${mount_dir} # add your mount directory here
16+
workdir=${mount_dir}/bench-large-ep/slurm_scripts/
17+
model_dir=${model_dir} # add your model directory here
18+
logdir=${workdir}/bm_20250703_deepseek-r1-${isl}-${osl}/
19+
streaming=false
20+
mkdir -p ${logdir}
21+
22+
container_name=disaggr-test
23+
24+
num_ctx_servers=$1
25+
ctx_tp_size=$2
26+
ctx_batch_size=$3
27+
ctx_max_num_tokens=$4
28+
ctx_enable_attention_dp=$5
29+
num_gen_servers=$6
30+
gen_tp_size=$7
31+
gen_batch_size=$8
32+
gen_max_num_tokens=$9
33+
gen_enable_attention_dp=${10}
34+
gen_gpu_memory_fraction=${11}
35+
eplb_num_slots=${12}
36+
mtp_size=${13}
37+
concurrency=${14}
38+
39+
sub_dir=${logdir}/dep${gen_tp_size}_concurrency${concurrency}_eplb${eplb_num_slots}_mtp${mtp_size}
40+
41+
ctx_gpus=$((num_ctx_servers * ctx_tp_size))
42+
gen_gpus=$((num_gen_servers * gen_tp_size))
43+
44+
echo "enable_attention_dp: ${ctx_enable_attention_dp}, ${gen_enable_attention_dp}, gpu_memory_fraction: ${gen_gpu_memory_fraction}"
45+
46+
enable_pdl=false
47+
if [ "${gen_enable_attention_dp}" = "false" ]; then
48+
enable_pdl=true
49+
echo "enable_pdl: ${enable_pdl}"
50+
sub_dir=${logdir}/tep${gen_tp_size}_concurrency${concurrency}_eplb${eplb_num_slots}_mtp${mtp_size}
51+
fi
52+
53+
full_logdir=${sub_dir}
54+
mkdir -p ${full_logdir}
55+
56+
# start the container
57+
srun -l --container-image=${container_image} \
58+
--container-name=${container_name} \
59+
--container-mounts=${mount_dir}:${mount_dir} \
60+
--mpi=pmix \
61+
echo "Container up."
62+
63+
# generate the yaml file
64+
srun -l --container-name=${container_name} \
65+
--container-mounts=${mount_dir}:${mount_dir} \
66+
--mpi=pmix --overlap \
67+
python3 ${workdir}/${gen_yaml_file} --config ${full_logdir}/config.yaml \
68+
--model ${model_dir} \
69+
--num_ctx_servers ${num_ctx_servers} \
70+
--ctx_tp_size ${ctx_tp_size} \
71+
--ctx_batch_size ${ctx_batch_size} \
72+
--ctx_max_num_tokens ${ctx_max_num_tokens} \
73+
--num_gen_servers ${num_gen_servers} \
74+
--gen_tp_size ${gen_tp_size} \
75+
--gen_batch_size ${gen_batch_size} \
76+
--gen_max_num_tokens ${gen_max_num_tokens} \
77+
--gen_gpu_memory_fraction ${gen_gpu_memory_fraction} \
78+
--eplb_num_slots ${eplb_num_slots} \
79+
$(if [ "${gen_enable_attention_dp}" = "true" ]; then echo "--gen_enable_attention_dp"; fi) \
80+
$(if [ "${ctx_enable_attention_dp}" = "true" ]; then echo "--ctx_enable_attention_dp"; fi) \
81+
$(if [ "${mtp_size}" -gt 0 ]; then echo "--mtp_size ${mtp_size}"; fi)
82+
83+
echo "YAML file generated."
84+
85+
hostname_value=$(grep '^hostname:' ${full_logdir}/config.yaml | awk -F': ' '{print $2}')
86+
echo "server host name: $hostname_value"
87+
88+
# try to kill the server and workers
89+
srun -l --container-name=${container_name} \
90+
--container-mounts=${mount_dir}:${mount_dir} \
91+
--mpi=pmix --overlap \
92+
pkill -f "trtllm-serve" || true
93+
94+
nsys_on=""
95+
# nsys_on=${full_logdir}
96+
97+
# start the workers
98+
srun -l --container-name=${container_name} \
99+
--container-mounts=${mount_dir}:${mount_dir} \
100+
--mpi=pmix --overlap \
101+
bash ${workdir}/start_worker.sh ${full_logdir}/config.yaml "${concurrency}" "${enable_pdl}" ${ctx_gpus} ${nsys_on} &> ${full_logdir}/output_workers.log &
102+
# start the server
103+
srun -l --container-name=${container_name} \
104+
--container-mounts=${mount_dir}:${mount_dir} \
105+
--mpi=pmix --overlap -N 1 -n 1 \
106+
-w ${hostname_value} \
107+
bash ${workdir}/start_server.sh ${full_logdir}/config.yaml &> ${full_logdir}/output_server.log &
108+
# start benchmarking
109+
srun -l --container-name=${container_name} \
110+
--container-mounts=${mount_dir}:${mount_dir} \
111+
--mpi=pmix --overlap -N 1 -n 1 \
112+
bash ${workdir}/run_benchmark.sh ${isl} ${osl} ${multi_round} ${model_dir} "${concurrency}" ${streaming} ${full_logdir}/ > ${full_logdir}/benchmark.log 2>&1
113+
114+
# try to kill the server and workers
115+
srun -l --container-name=${container_name} \
116+
--container-mounts=${mount_dir}:${mount_dir} \
117+
--mpi=pmix --overlap \
118+
kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
119+
wait

0 commit comments

Comments
 (0)