|
| 1 | +# TensorRT-LLM Wide-EP Benchmark Scripts |
| 2 | + |
| 3 | +This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler. |
| 4 | + |
| 5 | +## ⚠️ DISCLAIMER |
| 6 | + |
| 7 | +**These scripts are currently not QA'ed and are provided for demonstration purposes only.** |
| 8 | + |
| 9 | +Please note that: |
| 10 | + |
| 11 | +- These scripts have not undergone formal quality assurance testing |
| 12 | +- They are intended for demonstration and educational purposes |
| 13 | +- Use at your own risk in production environments |
| 14 | +- Always review and test scripts thoroughly before running in your specific environment |
| 15 | + |
| 16 | +## Scripts Overview |
| 17 | + |
| 18 | +### Core Scripts |
| 19 | + |
| 20 | +1. **`submit.sh`** - Main entry point for submitting benchmark jobs |
| 21 | +2. **`disaggr_torch.slurm`** - SLURM job script orchestrating the entire benchmark |
| 22 | +3. **`gen_yaml.py`** - Generates configuration files for serving setup |
| 23 | +4. **`start_server.sh`** - Starts the inference server |
| 24 | +5. **`start_worker.sh`** - Starts the worker processes |
| 25 | +6. **`run_benchmark.sh`** - Executes the benchmark workload |
| 26 | +7. **`process_gen_iterlog.py`** - Processes benchmark results and generates reports |
| 27 | + |
| 28 | +## Usage |
| 29 | + |
| 30 | +### Prerequisites |
| 31 | + |
| 32 | +Before running the scripts, ensure you have: |
| 33 | +- Access to a SLURM cluster |
| 34 | +- Container image with TensorRT-LLM installed |
| 35 | +- Model files accessible on the cluster |
| 36 | +- Required environment variables set |
| 37 | + |
| 38 | +### Configuration |
| 39 | + |
| 40 | +Edit the following variables in `submit.sh` and `disaggr_torch.slurm`: |
| 41 | + |
| 42 | +```bash |
| 43 | +# In disaggr_torch.slurm |
| 44 | +container_image=${container_image} # Your container image |
| 45 | +mount_dir=${mount_dir} # Mount directory path |
| 46 | +model_dir=${model_dir} # Model directory path |
| 47 | +``` |
| 48 | + |
| 49 | +### Running Benchmarks |
| 50 | + |
| 51 | +1. **Submit benchmark jobs**: |
| 52 | + ```bash |
| 53 | + ./submit.sh |
| 54 | + ``` |
| 55 | + |
| 56 | +2. **Monitor job progress**: |
| 57 | + ```bash |
| 58 | + squeue -u $USER |
| 59 | + ``` |
| 60 | + |
| 61 | +3. **View results**: |
| 62 | + Results are saved in `bm_20250703_deepseek-r1-{isl}-{osl}/` directory |
| 63 | + |
| 64 | +## Script Details |
| 65 | + |
| 66 | +### `submit.sh` |
| 67 | +Main entry script that submits multiple SLURM jobs with different configurations: |
| 68 | +- **DEP8**: 8-way parallelism for decode servers |
| 69 | +- **DEP16**: 16-way parallelism with different EPLB slot configurations |
| 70 | +- **DEP32**: 32-way parallelism for high-throughput scenarios |
| 71 | + |
| 72 | +Parameters tested: |
| 73 | +- Concurrency levels: 1x, 64x, 1024x multipliers |
| 74 | +- EPLB slots: 0, 256, 288 |
| 75 | +- Different parallelism sizes |
| 76 | + |
| 77 | +### `disaggr_torch.slurm` |
| 78 | +SLURM job script that: |
| 79 | +1. Sets up container environment |
| 80 | +2. Generates configuration files |
| 81 | +3. Starts server and workers |
| 82 | +4. Executes benchmarks |
| 83 | +5. Cleans up processes |
| 84 | + |
| 85 | +**Key parameters**: |
| 86 | +- `num_ctx_servers`: Number of context servers |
| 87 | +- `ctx_tp_size`: Tensor parallel size for context servers |
| 88 | +- `num_gen_servers`: Number of generation servers |
| 89 | +- `gen_tp_size`: Tensor parallel size for generation servers |
| 90 | +- `concurrency`: Number of concurrent requests |
| 91 | + |
| 92 | +### `gen_yaml.py` |
| 93 | +Generates YAML configuration files with: |
| 94 | +- Server topology and resource allocation |
| 95 | +- Network configuration (hostnames, ports) |
| 96 | +- Memory and batch size settings |
| 97 | +- Optimization parameters (CUDA graphs, KV cache) |
| 98 | + |
| 99 | +**Key features**: |
| 100 | +- Automatic node and task allocation |
| 101 | +- Support for attention data parallelism |
| 102 | +- MoE load balancing configuration |
| 103 | +- Speculative decoding (MTP) support |
| 104 | + |
| 105 | +### `start_server.sh` & `start_worker.sh` |
| 106 | +- **Server**: Starts the main inference server with API endpoint |
| 107 | +- **Workers**: Starts MPI workers for distributed processing |
| 108 | +- Support for profiling with NSight Systems |
| 109 | +- Environment variable configuration for optimizations |
| 110 | + |
| 111 | +### `run_benchmark.sh` |
| 112 | +Executes benchmarking using TensorRT-LLM's benchmark_serving tool: |
| 113 | +- Downloads ShareGPT dataset for realistic workloads |
| 114 | +- Waits for server health checks |
| 115 | +- Runs load testing with specified concurrency |
| 116 | +- Collects performance metrics |
| 117 | +- Gracefully shuts down services |
| 118 | + |
| 119 | +**Metrics collected**: |
| 120 | +- Throughput (tokens/second) |
| 121 | +- Latency (request completion time) |
| 122 | +- Context vs generation only statistics |
| 123 | + |
| 124 | +### `process_gen_iterlog.py` |
| 125 | +Post-processes benchmark results: |
| 126 | +- Parses iteration logs from workers |
| 127 | +- Calculates throughput metrics |
| 128 | +- Generates CSV reports |
| 129 | +- Supports MTP (Multi-Token Prediction) analysis |
0 commit comments