|
1 | 1 | # Disaggregated Inference Benchmark Scripts |
2 | 2 |
|
3 | | -This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM. |
| 3 | +This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM. The benchmark system uses Python for orchestration and YAML for configuration. |
4 | 4 |
|
5 | 5 | ## Overview |
6 | 6 |
|
7 | | -The benchmarking process is orchestrated through a set of shell scripts and Python scripts that work together: |
| 7 | +The benchmarking process is orchestrated through a combination of Python scripts and YAML configuration: |
8 | 8 |
|
9 | | -1. `submit.sh`: The main entry point for submitting benchmark jobs to SLURM. It runs a parameter sweep by calling `sbatch` with different configurations. Supports both context and generation server configurations with pipeline parallelism. |
10 | | -2. `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, optionally builds TensorRT-LLM from source, generates configuration files, starts the server and workers, and runs the benchmark client. |
11 | | -3. `gen_worker_config.py`: A Python script that generates the worker configuration YAML file needed by `trtllm-serve`. It determines the worker configuration based on SLURM environment variables and script arguments, supporting both context and generation workers with tensor/pipeline parallelism. |
12 | | -4. `gen_server_config.py`: A Python script that generates the server configuration YAML file needed by `trtllm-serve`. It determines the server configuration based on the number of context and generation servers. |
13 | | -5. `start_worker.sh`: A shell script responsible for starting disaggregated workers using `trtllm-serve` on each allocated machine. Supports both context and generation workers with profiling capabilities. |
14 | | -6. `start_server.sh`: A shell script responsible for starting disaggregated server using `trtllm-serve` on each allocated machine. |
15 | | -7. `run_benchmark.sh`: A shell script that waits for the server to be healthy and then runs the actual benchmark client. Supports streaming mode and various metrics collection. |
| 9 | +1. `config.yaml`: The main configuration file that defines all benchmark parameters including SLURM settings, hardware configuration, worker settings, and benchmark modes. |
| 10 | +2. `disaggr_torch.slurm`: The SLURM script that sets up and runs a single benchmark experiment based on the YAML configuration. |
| 11 | +3. Python scripts for configuration and execution: |
| 12 | + - Worker configuration generation |
| 13 | + - Server configuration generation |
| 14 | + - Benchmark execution and metrics collection |
16 | 15 |
|
17 | | -## File Descriptions |
| 16 | +## Configuration (config.yaml) |
18 | 17 |
|
19 | | -### `submit.sh` |
| 18 | +The benchmark is configured through a YAML file with the following sections: |
20 | 19 |
|
21 | | -This script is used to submit SLURM jobs for running benchmarks with specific configurations. It provides helper functions to calculate required nodes and submit jobs with the right parameters. |
22 | | - |
23 | | -The script includes a user configuration section where you can set various parameters: |
24 | | - |
25 | | -1. SLURM Configuration: |
26 | | - - `partition`: SLURM partition to use |
27 | | - - `account`: SLURM account to use |
28 | | - - `job_time`: Job time limit |
29 | | - - `job_name`: Name of the job |
30 | | - |
31 | | -2. Hardware Configuration: |
32 | | - - `gpus_per_node`: Number of GPUs per node (default: 4) |
33 | | - |
34 | | -3. Benchmark Configuration: |
35 | | - - `use_nv_sa_benchmark`: Whether to use NVIDIA SA benchmark script |
36 | | - - `isl`: Input sequence length |
37 | | - - `osl`: Output sequence length |
38 | | - - `multi_round`: Number of benchmark rounds |
39 | | - - `benchmark_ratio`: Benchmark ratio |
40 | | - - `streaming`: Enable streaming mode |
41 | | - - `cache_max_tokens`: Cache transceiver max tokens |
42 | | - - `dataset_file`: Path to dataset file for benchmarking |
43 | | - |
44 | | -4. Environment Configuration: |
45 | | - - `mount_dir`: Directory to mount in container |
46 | | - - `container_image`: Path to container image |
47 | | - - `model_path`: Path to model directory |
48 | | - - `trtllm_repo`: Path to TensorRT-LLM repository |
49 | | - - `build_wheel`: Whether to build TensorRT-LLM from source |
50 | | - |
51 | | -5. Workspace and Profiling Configuration: |
52 | | - - `work_dir`: Path to work directory |
53 | | - - `nsys_on`: Enable nsys profiling (true/false) |
54 | | - |
55 | | -**Usage:** |
56 | | - |
57 | | -The script provides a `run_single` function that takes all the necessary parameters for both context and generation servers. Example usage: |
58 | | - |
59 | | -```bash |
60 | | -# CTX: num tp_size pp_size batch tokens attn_dp gpu_frac GEN: num tp_size pp_size batch tokens attn_dp gpu_frac eplb mtp concurrency |
61 | | -run_single 1 4 1 4 4608 true 0.85 1 8 1 32 128 false "0.9" 0 3 "16" |
| 20 | +### 1. SLURM Configuration |
| 21 | +```yaml |
| 22 | +slurm: |
| 23 | + script_file: "disaggr_torch.slurm" |
| 24 | + partition: "<partition>" |
| 25 | + account: "<account>" |
| 26 | + job_time: "02:00:00" |
| 27 | + job_name: "<job_name>" |
| 28 | + numa_bind: true |
62 | 29 | ``` |
63 | 30 |
|
64 | | -The script automatically calculates the required number of nodes based on the tensor parallel size and server count. |
65 | | - |
66 | | -### `disaggr_torch.slurm` |
67 | | - |
68 | | -This is the core SLURM script for a single benchmark run. It is not meant to be run directly, but rather submitted via `sbatch` (e.g., by `submit.sh`). |
69 | | - |
70 | | -It takes the following arguments in order: |
71 | | - |
72 | | -1. `num_ctx_servers`: Number of context servers. |
73 | | -2. `ctx_tp_size`: Tensor parallel size for context servers. |
74 | | -3. `ctx_pp_size`: Pipeline parallel size for context servers. |
75 | | -4. `ctx_batch_size`: Max batch size for context servers. |
76 | | -5. `ctx_max_num_tokens`: Max number of tokens for context servers. |
77 | | -6. `ctx_enable_attention_dp`: `true` or `false` to enable attention DP for context servers. |
78 | | -7. `ctx_gpu_frac`: GPU memory fraction for context servers. |
79 | | -8. `num_gen_servers`: Number of generation servers. |
80 | | -9. `gen_tp_size`: Tensor parallel size for generation servers. |
81 | | -10. `gen_pp_size`: Pipeline parallel size for generation servers. |
82 | | -11. `gen_batch_size`: Max batch size for generation servers. |
83 | | -12. `gen_max_num_tokens`: Max number of tokens for generation servers. |
84 | | -13. `gen_enable_attention_dp`: `true` or `false` to enable attention DP for generation servers. |
85 | | -14. `gen_gpu_memory_fraction`: GPU memory fraction for generation servers. |
86 | | -15. `eplb_num_slots`: Number of slots for eplb. |
87 | | -16. `mtp_size`: Number of nextn layers for MTP. |
88 | | -17. `concurrency_list`: Space-separated list of concurrencies for benchmarking. |
89 | | -18. `gpus_per_node`: Number of GPUs per node. |
90 | | -19. `use_nv_sa_benchmark`: Whether to use NVIDIA SA benchmark script. |
91 | | -20. `isl`: Input sequence length. |
92 | | -21. `osl`: Output sequence length. |
93 | | -22. `multi_round`: Number of benchmark rounds. |
94 | | -23. `benchmark_ratio`: Benchmark ratio. |
95 | | -24. `streaming`: Enable streaming mode. |
96 | | -25. `cache_max_tokens`: Cache transceiver max tokens. |
97 | | -26. `dataset_file`: Path to dataset file for benchmarking. |
98 | | -27. `mount_dir`: Directory to mount in container. |
99 | | -28. `container_image`: Path to container image. |
100 | | -29. `model_path`: Path to model directory. |
101 | | -30. `trtllm_repo`: Path to TensorRT-LLM repository. |
102 | | -31. `build_wheel`: Whether to build TensorRT-LLM from source. |
103 | | -32. `work_dir`: Path to work directory. |
104 | | -33. `nsys_on`: Enable nsys profiling. |
105 | | - |
106 | | -### `gen_worker_config.py` |
| 31 | +### 2. Benchmark Mode |
| 32 | +```yaml |
| 33 | +benchmark: |
| 34 | + mode: "e2e" # Options: e2e, gen_only |
| 35 | + use_nv_sa_benchmark: false |
| 36 | + multi_round: 8 |
| 37 | + benchmark_ratio: 0.8 |
| 38 | + streaming: true |
| 39 | +``` |
107 | 40 |
|
108 | | -This Python script generates the worker configuration YAML file that configures the `trtllm-serve` workers. It creates separate configurations for context and generation workers with different tensor parallelism, batch sizes, and other parameters. |
| 41 | +### 3. Hardware Configuration |
| 42 | +```yaml |
| 43 | +hardware: |
| 44 | + gpus_per_node: 4 |
| 45 | + num_ctx_servers: 1 |
| 46 | + num_gen_servers: 1 |
| 47 | +``` |
109 | 48 |
|
110 | | -**Usage:** |
| 49 | +### 4. Sequence Configuration |
| 50 | +```yaml |
| 51 | +sequence: |
| 52 | + input_length: 1024 |
| 53 | + output_length: 1024 |
| 54 | +``` |
111 | 55 |
|
112 | | -The script is called from within `disaggr_torch.slurm`. It takes numerous arguments to define the model, parallelism, and worker configurations for both context and generation phases. |
| 56 | +### 5. Environment Configuration |
| 57 | +```yaml |
| 58 | +environment: |
| 59 | + container_mount: "<container_mount>" # Format: path1:path1,path2:path2 |
| 60 | + container_image: "<container_image>" |
| 61 | + model_path: "<model_path>" |
| 62 | + trtllm_repo: "<trtllm_repo>" |
| 63 | + build_wheel: false |
| 64 | + dataset_file: "<dataset_file>" |
| 65 | + work_dir: "<full_path_to_work_dir>" |
| 66 | +``` |
113 | 67 |
|
114 | | -### `gen_server_config.py` |
| 68 | +### 6. Worker Configuration |
| 69 | +The worker configuration section defines detailed settings for both context and generation workers: |
| 70 | +
|
| 71 | +```yaml |
| 72 | +worker_config: |
| 73 | + concurrency_list: "16" |
| 74 | + eplb_num_slots: 0 |
| 75 | + mtp_size: 0 |
| 76 | + gen: |
| 77 | + tensor_parallel_size: 16 |
| 78 | + pipeline_parallel_size: 1 |
| 79 | + max_batch_size: 64 |
| 80 | + max_num_tokens: 64 |
| 81 | + enable_attention_dp: true |
| 82 | + # Additional generation worker settings... |
| 83 | + ctx: |
| 84 | + tensor_parallel_size: 4 |
| 85 | + pipeline_parallel_size: 1 |
| 86 | + max_batch_size: 4 |
| 87 | + max_num_tokens: 4608 |
| 88 | + enable_attention_dp: true |
| 89 | + # Additional context worker settings... |
| 90 | +``` |
115 | 91 |
|
116 | | -This Python script generates the server configuration YAML file that configures the `trtllm-serve` disaggregated server. It reads hostname information from the work directory and creates a configuration that specifies the URLs for context and generation servers. |
| 92 | +## Running the Benchmark |
117 | 93 |
|
118 | | -**Usage:** |
| 94 | +The benchmark system now uses a more streamlined approach with configuration defined in YAML and execution handled by Python scripts. |
119 | 95 |
|
120 | | -The script is called from within `start_server.sh`. It takes arguments for the number of context and generation servers and the work directory. |
| 96 | +### Step 1: Configure the Benchmark |
121 | 97 |
|
122 | | -### `start_worker.sh` |
| 98 | +Edit the `config.yaml` file to set up your benchmark parameters. The configuration is organized into logical sections: |
123 | 99 |
|
124 | | -This script starts a `trtllm-serve disaggregated_mpi_worker`. It is launched by `srun` from the `disaggr_torch.slurm` script on all allocated nodes. |
| 100 | +1. SLURM settings (partition, account, time limits) |
| 101 | +2. Hardware configuration (GPUs, server counts) |
| 102 | +3. Benchmark parameters (mode, sequence lengths, streaming) |
| 103 | +4. Environment settings (container, model paths) |
| 104 | +5. Worker configurations (parallelism, batch sizes, memory settings) |
125 | 105 |
|
126 | | -**Arguments:** |
| 106 | +### Step 2: Launch the Benchmark |
127 | 107 |
|
128 | | -1. `worker_type`: Either "CTX" or "GEN" to specify the worker type. |
129 | | -2. `worker_index`: Index of the worker instance. |
130 | | -3. `model_dir`: Path to the model directory. |
131 | | -4. `worker_port`: Port for the worker to listen on. |
132 | | -5. `enable_pdl`: `true` or `false` for enabling PDL. |
133 | | -6. `work_dir`: Work directory for logs and configuration. |
134 | | -7. `nsys_on`: Enable nsys profiling (true/false). |
| 108 | +The benchmark can be launched using the SLURM system: |
135 | 109 |
|
136 | | -### `start_server.sh` |
| 110 | +```bash |
| 111 | +sbatch disaggr_torch.slurm |
| 112 | +``` |
137 | 113 |
|
138 | | -This script starts the `trtllm-serve disaggregated` server. It first generates the server configuration using `gen_server_config.py`, then starts the server process. |
| 114 | +The SLURM script will: |
| 115 | +1. Read and validate the YAML configuration |
| 116 | +2. Set up the container environment |
| 117 | +3. Configure and start the workers and servers |
| 118 | +4. Execute the benchmark |
| 119 | +5. Collect and store metrics |
139 | 120 |
|
140 | | -**Arguments:** |
| 121 | +### Benchmark Modes |
141 | 122 |
|
142 | | -1. `num_ctx_servers`: Number of context servers. |
143 | | -2. `num_gen_servers`: Number of generation servers. |
144 | | -3. `work_dir`: Work directory for logs and configuration. |
145 | | -4. `script_dir`: Directory containing the scripts. |
| 123 | +The system supports two primary benchmark modes: |
146 | 124 |
|
147 | | -### `run_benchmark.sh` and `run_benchmark_nv_sa.sh` |
| 125 | +1. **End-to-End (e2e)**: Tests the complete pipeline including both context and generation phases |
| 126 | +2. **Generation Only (gen_only)**: Focuses on testing just the generation phase |
148 | 127 |
|
149 | | -The benchmark can be run using either the default benchmark script (`run_benchmark.sh`) or the NVIDIA SA benchmark script (`run_benchmark_nv_sa.sh`), controlled by the `use_nv_sa_benchmark` parameter. |
| 128 | +Configure the mode in the YAML file: |
| 129 | +```yaml |
| 130 | +benchmark: |
| 131 | + mode: "e2e" # or "gen_only" |
| 132 | +``` |
150 | 133 |
|
151 | | -**Default Benchmark Script Arguments (`run_benchmark.sh`):** |
| 134 | +### Metrics Collection |
152 | 135 |
|
153 | | -1. `model_name`: Path to the model directory. |
154 | | -2. `dataset_file`: Path to the dataset file for benchmarking. |
155 | | -3. `multi_round`: Number of rounds for the benchmark. |
156 | | -4. `num_gen_servers`: Number of generation servers. |
157 | | -5. `concurrency_list`: Space-separated list of concurrencies. |
158 | | -6. `streaming`: `true` or `false` for streaming mode. |
159 | | -7. `log_path`: Path to the log directory. |
| 136 | +The benchmark system collects various performance metrics: |
160 | 137 |
|
161 | | -The script supports various metrics collection including: |
162 | 138 | - TTFT (Time to First Token) |
163 | 139 | - TPOT (Throughput Over Time) |
164 | 140 | - ITL (Inter-Token Latency) |
165 | 141 | - E2EL (End-to-End Latency) |
166 | 142 |
|
167 | | -**NVIDIA SA Benchmark Script Arguments (`run_benchmark_nv_sa.sh`):** |
168 | | - |
169 | | -1. `model_name`: Path to the model directory. |
170 | | -2. `isl`: Input sequence length. |
171 | | -3. `osl`: Output sequence length. |
172 | | -4. `benchmark_ratio`: Ratio for benchmarking. |
173 | | -5. `multi_round`: Number of rounds for the benchmark. |
174 | | -6. `num_gen_servers`: Number of generation servers. |
175 | | -7. `concurrency_list`: Space-separated list of concurrencies. |
176 | | -8. `streaming`: `true` or `false` for streaming mode. |
177 | | -9. `log_path`: Path to the log directory. |
178 | | - |
179 | | -## Workflow |
180 | | - |
181 | | -1. Configure the parameters in `submit.sh` (e.g., SLURM settings, sequence lengths, dataset file, model path, container image). |
182 | | -2. The user runs `./submit.sh` with appropriate parameters for context and generation servers. |
183 | | -3. `submit.sh` calculates required nodes based on tensor/pipeline parallelism and submits the job to SLURM using `sbatch disaggr_torch.slurm`. |
184 | | -4. For each job, SLURM allocates resources and runs `disaggr_torch.slurm`. |
185 | | -5. `disaggr_torch.slurm` validates all required parameters. |
186 | | -6. `disaggr_torch.slurm` starts the container and optionally builds/installs TensorRT-LLM from source. |
187 | | -7. `disaggr_torch.slurm` runs `gen_worker_config.py` to create worker configuration files with tensor/pipeline parallelism settings. |
188 | | -8. `disaggr_torch.slurm` uses `srun` to launch `start_worker.sh` on allocated nodes for context and generation workers. |
189 | | -9. `disaggr_torch.slurm` generates server configuration using `gen_server_config.py` and starts the server with `start_server.sh`. |
190 | | -10. `disaggr_torch.slurm` runs either `run_benchmark.sh` or `run_benchmark_nv_sa.sh` based on `use_nv_sa_benchmark` setting. |
191 | | -11. The benchmark script executes the benchmark for each concurrency level, collecting various metrics. |
192 | | -12. After completion, processes are gracefully terminated and logs are stored in the specified log directory. |
| 143 | +Metrics are automatically collected and stored in the work directory specified in the configuration. |
| 144 | +
|
| 145 | +### Advanced Features |
| 146 | +
|
| 147 | +1. **NVIDIA SA Benchmark Integration** |
| 148 | + ```yaml |
| 149 | + benchmark: |
| 150 | + use_nv_sa_benchmark: true |
| 151 | + ``` |
| 152 | +
|
| 153 | +2. **Profiling Support** |
| 154 | + ```yaml |
| 155 | + profiling: |
| 156 | + nsys_on: true |
| 157 | + ``` |
| 158 | +
|
| 159 | +3. **Custom Worker Settings** |
| 160 | + The worker configuration section allows detailed customization of both context and generation workers, including: |
| 161 | + - Tensor and pipeline parallelism |
| 162 | + - Batch sizes and token limits |
| 163 | + - Memory management |
| 164 | + - Cache configuration |
| 165 | + - MoE settings (if applicable) |
| 166 | +
|
| 167 | +4. **Container and Build Options** |
| 168 | + ```yaml |
| 169 | + environment: |
| 170 | + build_wheel: true # Build TensorRT-LLM from source |
| 171 | + container_mount: "path1:path1,path2:path2" |
| 172 | + ``` |
| 173 | +
|
| 174 | +### Output and Logs |
| 175 | +
|
| 176 | +Benchmark results and logs are stored in the specified work directory, including: |
| 177 | +- Performance metrics |
| 178 | +- Worker and server logs |
| 179 | +- Profiling data (if enabled) |
| 180 | +- Error logs and diagnostics |
| 181 | +
|
| 182 | +The system automatically organizes outputs by benchmark run and configuration. |
0 commit comments