Skip to content

Commit 4545700

Browse files
authored
[None][chore] Move submit.sh to python and use yaml configuration (#8003)
Signed-off-by: Zero Zeng <[email protected]>
1 parent 87eb508 commit 4545700

File tree

10 files changed

+538
-988
lines changed

10 files changed

+538
-988
lines changed
Lines changed: 146 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -1,192 +1,182 @@
11
# Disaggregated Inference Benchmark Scripts
22

3-
This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM.
3+
This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM. The benchmark system uses Python for orchestration and YAML for configuration.
44

55
## Overview
66

7-
The benchmarking process is orchestrated through a set of shell scripts and Python scripts that work together:
7+
The benchmarking process is orchestrated through a combination of Python scripts and YAML configuration:
88

9-
1. `submit.sh`: The main entry point for submitting benchmark jobs to SLURM. It runs a parameter sweep by calling `sbatch` with different configurations. Supports both context and generation server configurations with pipeline parallelism.
10-
2. `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, optionally builds TensorRT-LLM from source, generates configuration files, starts the server and workers, and runs the benchmark client.
11-
3. `gen_worker_config.py`: A Python script that generates the worker configuration YAML file needed by `trtllm-serve`. It determines the worker configuration based on SLURM environment variables and script arguments, supporting both context and generation workers with tensor/pipeline parallelism.
12-
4. `gen_server_config.py`: A Python script that generates the server configuration YAML file needed by `trtllm-serve`. It determines the server configuration based on the number of context and generation servers.
13-
5. `start_worker.sh`: A shell script responsible for starting disaggregated workers using `trtllm-serve` on each allocated machine. Supports both context and generation workers with profiling capabilities.
14-
6. `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine.
15-
7. `run_benchmark.sh`: A shell script that waits for the server to be healthy and then runs the actual benchmark client. Supports streaming mode and various metrics collection.
9+
1. `config.yaml`: The main configuration file that defines all benchmark parameters including SLURM settings, hardware configuration, worker settings, and benchmark modes.
10+
2. `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment based on the YAML configuration.
11+
3. Python scripts for configuration and execution:
12+
- Worker configuration generation
13+
- Server configuration generation
14+
- Benchmark execution and metrics collection
1615

17-
## File Descriptions
16+
## Configuration (config.yaml)
1817

19-
### `submit.sh`
18+
The benchmark is configured through a YAML file with the following sections:
2019

21-
This script is used to submit SLURM jobs for running benchmarks with specific configurations. It provides helper functions to calculate required nodes and submit jobs with the right parameters.
22-
23-
The script includes a user configuration section where you can set various parameters:
24-
25-
1. SLURM Configuration:
26-
- `partition`: SLURM partition to use
27-
- `account`: SLURM account to use
28-
- `job_time`: Job time limit
29-
- `job_name`: Name of the job
30-
31-
2. Hardware Configuration:
32-
- `gpus_per_node`: Number of GPUs per node (default: 4)
33-
34-
3. Benchmark Configuration:
35-
- `use_nv_sa_benchmark`: Whether to use NVIDIA SA benchmark script
36-
- `isl`: Input sequence length
37-
- `osl`: Output sequence length
38-
- `multi_round`: Number of benchmark rounds
39-
- `benchmark_ratio`: Benchmark ratio
40-
- `streaming`: Enable streaming mode
41-
- `cache_max_tokens`: Cache transceiver max tokens
42-
- `dataset_file`: Path to dataset file for benchmarking
43-
44-
4. Environment Configuration:
45-
- `mount_dir`: Directory to mount in container
46-
- `container_image`: Path to container image
47-
- `model_path`: Path to model directory
48-
- `trtllm_repo`: Path to TensorRT-LLM repository
49-
- `build_wheel`: Whether to build TensorRT-LLM from source
50-
51-
5. Workspace and Profiling Configuration:
52-
- `work_dir`: Path to work directory
53-
- `nsys_on`: Enable nsys profiling (true/false)
54-
55-
**Usage:**
56-
57-
The script provides a `run_single` function that takes all the necessary parameters for both context and generation servers. Example usage:
58-
59-
```bash
60-
# CTX: num tp_size pp_size batch tokens attn_dp gpu_frac GEN: num tp_size pp_size batch tokens attn_dp gpu_frac eplb mtp concurrency
61-
run_single 1 4 1 4 4608 true 0.85 1 8 1 32 128 false "0.9" 0 3 "16"
20+
### 1. SLURM Configuration
21+
```yaml
22+
slurm:
23+
script_file: "disaggr_torch.slurm"
24+
partition: "<partition>"
25+
account: "<account>"
26+
job_time: "02:00:00"
27+
job_name: "<job_name>"
28+
numa_bind: true
6229
```
6330
64-
The script automatically calculates the required number of nodes based on the tensor parallel size and server count.
65-
66-
### `disaggr_torch.slurm`
67-
68-
This is the core SLURM script for a single benchmark run. It is not meant to be run directly, but rather submitted via `sbatch` (e.g., by `submit.sh`).
69-
70-
It takes the following arguments in order:
71-
72-
1. `num_ctx_servers`: Number of context servers.
73-
2. `ctx_tp_size`: Tensor parallel size for context servers.
74-
3. `ctx_pp_size`: Pipeline parallel size for context servers.
75-
4. `ctx_batch_size`: Max batch size for context servers.
76-
5. `ctx_max_num_tokens`: Max number of tokens for context servers.
77-
6. `ctx_enable_attention_dp`: `true` or `false` to enable attention DP for context servers.
78-
7. `ctx_gpu_frac`: GPU memory fraction for context servers.
79-
8. `num_gen_servers`: Number of generation servers.
80-
9. `gen_tp_size`: Tensor parallel size for generation servers.
81-
10. `gen_pp_size`: Pipeline parallel size for generation servers.
82-
11. `gen_batch_size`: Max batch size for generation servers.
83-
12. `gen_max_num_tokens`: Max number of tokens for generation servers.
84-
13. `gen_enable_attention_dp`: `true` or `false` to enable attention DP for generation servers.
85-
14. `gen_gpu_memory_fraction`: GPU memory fraction for generation servers.
86-
15. `eplb_num_slots`: Number of slots for eplb.
87-
16. `mtp_size`: Number of nextn layers for MTP.
88-
17. `concurrency_list`: Space-separated list of concurrencies for benchmarking.
89-
18. `gpus_per_node`: Number of GPUs per node.
90-
19. `use_nv_sa_benchmark`: Whether to use NVIDIA SA benchmark script.
91-
20. `isl`: Input sequence length.
92-
21. `osl`: Output sequence length.
93-
22. `multi_round`: Number of benchmark rounds.
94-
23. `benchmark_ratio`: Benchmark ratio.
95-
24. `streaming`: Enable streaming mode.
96-
25. `cache_max_tokens`: Cache transceiver max tokens.
97-
26. `dataset_file`: Path to dataset file for benchmarking.
98-
27. `mount_dir`: Directory to mount in container.
99-
28. `container_image`: Path to container image.
100-
29. `model_path`: Path to model directory.
101-
30. `trtllm_repo`: Path to TensorRT-LLM repository.
102-
31. `build_wheel`: Whether to build TensorRT-LLM from source.
103-
32. `work_dir`: Path to work directory.
104-
33. `nsys_on`: Enable nsys profiling.
105-
106-
### `gen_worker_config.py`
31+
### 2. Benchmark Mode
32+
```yaml
33+
benchmark:
34+
mode: "e2e" # Options: e2e, gen_only
35+
use_nv_sa_benchmark: false
36+
multi_round: 8
37+
benchmark_ratio: 0.8
38+
streaming: true
39+
```
10740
108-
This Python script generates the worker configuration YAML file that configures the `trtllm-serve` workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters.
41+
### 3. Hardware Configuration
42+
```yaml
43+
hardware:
44+
gpus_per_node: 4
45+
num_ctx_servers: 1
46+
num_gen_servers: 1
47+
```
10948
110-
**Usage:**
49+
### 4. Sequence Configuration
50+
```yaml
51+
sequence:
52+
input_length: 1024
53+
output_length: 1024
54+
```
11155
112-
The script is called from within `disaggr_torch.slurm`. It takes numerous arguments to define the model, parallelism, and worker configurations for both context and generation phases.
56+
### 5. Environment Configuration
57+
```yaml
58+
environment:
59+
container_mount: "<container_mount>" # Format: path1:path1,path2:path2
60+
container_image: "<container_image>"
61+
model_path: "<model_path>"
62+
trtllm_repo: "<trtllm_repo>"
63+
build_wheel: false
64+
dataset_file: "<dataset_file>"
65+
work_dir: "<full_path_to_work_dir>"
66+
```
11367
114-
### `gen_server_config.py`
68+
### 6. Worker Configuration
69+
The worker configuration section defines detailed settings for both context and generation workers:
70+
71+
```yaml
72+
worker_config:
73+
concurrency_list: "16"
74+
eplb_num_slots: 0
75+
mtp_size: 0
76+
gen:
77+
tensor_parallel_size: 16
78+
pipeline_parallel_size: 1
79+
max_batch_size: 64
80+
max_num_tokens: 64
81+
enable_attention_dp: true
82+
# Additional generation worker settings...
83+
ctx:
84+
tensor_parallel_size: 4
85+
pipeline_parallel_size: 1
86+
max_batch_size: 4
87+
max_num_tokens: 4608
88+
enable_attention_dp: true
89+
# Additional context worker settings...
90+
```
11591

116-
This Python script generates the server configuration YAML file that configures the `trtllm-serve` disaggregated server. It reads hostname information from the work directory and creates a configuration that specifies the URLs for context and generation servers.
92+
## Running the Benchmark
11793

118-
**Usage:**
94+
The benchmark system now uses a more streamlined approach with configuration defined in YAML and execution handled by Python scripts.
11995

120-
The script is called from within `start_server.sh`. It takes arguments for the number of context and generation servers and the work directory.
96+
### Step 1: Configure the Benchmark
12197

122-
### `start_worker.sh`
98+
Edit the `config.yaml` file to set up your benchmark parameters. The configuration is organized into logical sections:
12399

124-
This script starts a `trtllm-serve disaggregated_mpi_worker`. It is launched by `srun` from the `disaggr_torch.slurm` script on all allocated nodes.
100+
1. SLURM settings (partition, account, time limits)
101+
2. Hardware configuration (GPUs, server counts)
102+
3. Benchmark parameters (mode, sequence lengths, streaming)
103+
4. Environment settings (container, model paths)
104+
5. Worker configurations (parallelism, batch sizes, memory settings)
125105

126-
**Arguments:**
106+
### Step 2: Launch the Benchmark
127107

128-
1. `worker_type`: Either "CTX" or "GEN" to specify the worker type.
129-
2. `worker_index`: Index of the worker instance.
130-
3. `model_dir`: Path to the model directory.
131-
4. `worker_port`: Port for the worker to listen on.
132-
5. `enable_pdl`: `true` or `false` for enabling PDL.
133-
6. `work_dir`: Work directory for logs and configuration.
134-
7. `nsys_on`: Enable nsys profiling (true/false).
108+
The benchmark can be launched using the SLURM system:
135109

136-
### `start_server.sh`
110+
```bash
111+
sbatch disaggr_torch.slurm
112+
```
137113

138-
This script starts the `trtllm-serve disaggregated` server. It first generates the server configuration using `gen_server_config.py`, then starts the server process.
114+
The SLURM script will:
115+
1. Read and validate the YAML configuration
116+
2. Set up the container environment
117+
3. Configure and start the workers and servers
118+
4. Execute the benchmark
119+
5. Collect and store metrics
139120

140-
**Arguments:**
121+
### Benchmark Modes
141122

142-
1. `num_ctx_servers`: Number of context servers.
143-
2. `num_gen_servers`: Number of generation servers.
144-
3. `work_dir`: Work directory for logs and configuration.
145-
4. `script_dir`: Directory containing the scripts.
123+
The system supports two primary benchmark modes:
146124

147-
### `run_benchmark.sh` and `run_benchmark_nv_sa.sh`
125+
1. **End-to-End (e2e)**: Tests the complete pipeline including both context and generation phases
126+
2. **Generation Only (gen_only)**: Focuses on testing just the generation phase
148127

149-
The benchmark can be run using either the default benchmark script (`run_benchmark.sh`) or the NVIDIA SA benchmark script (`run_benchmark_nv_sa.sh`), controlled by the `use_nv_sa_benchmark` parameter.
128+
Configure the mode in the YAML file:
129+
```yaml
130+
benchmark:
131+
mode: "e2e" # or "gen_only"
132+
```
150133
151-
**Default Benchmark Script Arguments (`run_benchmark.sh`):**
134+
### Metrics Collection
152135
153-
1. `model_name`: Path to the model directory.
154-
2. `dataset_file`: Path to the dataset file for benchmarking.
155-
3. `multi_round`: Number of rounds for the benchmark.
156-
4. `num_gen_servers`: Number of generation servers.
157-
5. `concurrency_list`: Space-separated list of concurrencies.
158-
6. `streaming`: `true` or `false` for streaming mode.
159-
7. `log_path`: Path to the log directory.
136+
The benchmark system collects various performance metrics:
160137
161-
The script supports various metrics collection including:
162138
- TTFT (Time to First Token)
163139
- TPOT (Throughput Over Time)
164140
- ITL (Inter-Token Latency)
165141
- E2EL (End-to-End Latency)
166142
167-
**NVIDIA SA Benchmark Script Arguments (`run_benchmark_nv_sa.sh`):**
168-
169-
1. `model_name`: Path to the model directory.
170-
2. `isl`: Input sequence length.
171-
3. `osl`: Output sequence length.
172-
4. `benchmark_ratio`: Ratio for benchmarking.
173-
5. `multi_round`: Number of rounds for the benchmark.
174-
6. `num_gen_servers`: Number of generation servers.
175-
7. `concurrency_list`: Space-separated list of concurrencies.
176-
8. `streaming`: `true` or `false` for streaming mode.
177-
9. `log_path`: Path to the log directory.
178-
179-
## Workflow
180-
181-
1. Configure the parameters in `submit.sh` (e.g., SLURM settings, sequence lengths, dataset file, model path, container image).
182-
2. The user runs `./submit.sh` with appropriate parameters for context and generation servers.
183-
3. `submit.sh` calculates required nodes based on tensor/pipeline parallelism and submits the job to SLURM using `sbatch disaggr_torch.slurm`.
184-
4. For each job, SLURM allocates resources and runs `disaggr_torch.slurm`.
185-
5. `disaggr_torch.slurm` validates all required parameters.
186-
6. `disaggr_torch.slurm` starts the container and optionally builds/installs TensorRT-LLM from source.
187-
7. `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files with tensor/pipeline parallelism settings.
188-
8. `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on allocated nodes for context and generation workers.
189-
9. `disaggr_torch.slurm` generates server configuration using `gen_server_config.py` and starts the server with `start_server.sh`.
190-
10. `disaggr_torch.slurm` runs either `run_benchmark.sh` or `run_benchmark_nv_sa.sh` based on `use_nv_sa_benchmark` setting.
191-
11. The benchmark script executes the benchmark for each concurrency level, collecting various metrics.
192-
12. After completion, processes are gracefully terminated and logs are stored in the specified log directory.
143+
Metrics are automatically collected and stored in the work directory specified in the configuration.
144+
145+
### Advanced Features
146+
147+
1. **NVIDIA SA Benchmark Integration**
148+
```yaml
149+
benchmark:
150+
use_nv_sa_benchmark: true
151+
```
152+
153+
2. **Profiling Support**
154+
```yaml
155+
profiling:
156+
nsys_on: true
157+
```
158+
159+
3. **Custom Worker Settings**
160+
The worker configuration section allows detailed customization of both context and generation workers, including:
161+
- Tensor and pipeline parallelism
162+
- Batch sizes and token limits
163+
- Memory management
164+
- Cache configuration
165+
- MoE settings (if applicable)
166+
167+
4. **Container and Build Options**
168+
```yaml
169+
environment:
170+
build_wheel: true # Build TensorRT-LLM from source
171+
container_mount: "path1:path1,path2:path2"
172+
```
173+
174+
### Output and Logs
175+
176+
Benchmark results and logs are stored in the specified work directory, including:
177+
- Performance metrics
178+
- Worker and server logs
179+
- Profiling data (if enabled)
180+
- Error logs and diagnostics
181+
182+
The system automatically organizes outputs by benchmark run and configuration.

0 commit comments

Comments
 (0)