LarryXFly
diff --git a/‎examples/ep_load_balancer/README.md‎ ‎…mples/wide_ep/ep_load_balancer/README.md‎examples/ep_load_balancer/README.md renamed to examples/wide_ep/ep_load_balancer/README.md b/‎examples/ep_load_balancer/README.md‎ ‎…mples/wide_ep/ep_load_balancer/README.md‎examples/ep_load_balancer/README.md renamed to examples/wide_ep/ep_load_balancer/README.md
diff --git a/‎…ep_load_balancer/generate_eplb_config.py‎ ‎…ep_load_balancer/generate_eplb_config.py‎examples/ep_load_balancer/generate_eplb_config.py renamed to examples/wide_ep/ep_load_balancer/generate_eplb_config.py b/‎…ep_load_balancer/generate_eplb_config.py‎ ‎…ep_load_balancer/generate_eplb_config.py‎examples/ep_load_balancer/generate_eplb_config.py renamed to examples/wide_ep/ep_load_balancer/generate_eplb_config.py
diff --git a/‎…_load_balancer/report_load_statistics.py‎ ‎…_load_balancer/report_load_statistics.py‎examples/ep_load_balancer/report_load_statistics.py renamed to examples/wide_ep/ep_load_balancer/report_load_statistics.py b/‎…_load_balancer/report_load_statistics.py‎ ‎…_load_balancer/report_load_statistics.py‎examples/ep_load_balancer/report_load_statistics.py renamed to examples/wide_ep/ep_load_balancer/report_load_statistics.py
diff --git a/‎examples/ep_load_balancer/utils.py‎ ‎…amples/wide_ep/ep_load_balancer/utils.py‎examples/ep_load_balancer/utils.py renamed to examples/wide_ep/ep_load_balancer/utils.py b/‎examples/ep_load_balancer/utils.py‎ ‎…amples/wide_ep/ep_load_balancer/utils.py‎examples/ep_load_balancer/utils.py renamed to examples/wide_ep/ep_load_balancer/utils.py
diff --git a/‎examples/wide_ep/slurm_scripts/README.md‎
Lines changed: 129 additions & 0 deletions b/‎examples/wide_ep/slurm_scripts/README.md‎
Lines changed: 129 additions & 0 deletions
diff --git a/‎examples/wide_ep/slurm_scripts/disaggr_torch.slurm‎
Lines changed: 119 additions & 0 deletions b/‎examples/wide_ep/slurm_scripts/disaggr_torch.slurm‎
Lines changed: 119 additions & 0 deletions
@@ -0,0 +1,129 @@
+# TensorRT-LLM Wide-EP Benchmark Scripts
+
+This directory contains scripts for benchmarking TensorRT-LLM wide-ep performance using SLURM job scheduler.
+
+## ⚠️ DISCLAIMER
+
+**These scripts are currently not QA'ed and are provided for demonstration purposes only.**
+
+Please note that:
+
+- These scripts have not undergone formal quality assurance testing
+- They are intended for demonstration and educational purposes
+- Use at your own risk in production environments
+- Always review and test scripts thoroughly before running in your specific environment
+
+## Scripts Overview
+
+### Core Scripts
+
+1. **`submit.sh`** - Main entry point for submitting benchmark jobs
+2. **`disaggr_torch.slurm`** - SLURM job script orchestrating the entire benchmark
+3. **`gen_yaml.py`** - Generates configuration files for serving setup
+4. **`start_server.sh`** - Starts the inference server
+5. **`start_worker.sh`** - Starts the worker processes
+6. **`run_benchmark.sh`** - Executes the benchmark workload
+7. **`process_gen_iterlog.py`** - Processes benchmark results and generates reports
+
+## Usage
+
+### Prerequisites
+
+Before running the scripts, ensure you have:
+- Access to a SLURM cluster
+- Container image with TensorRT-LLM installed
+- Model files accessible on the cluster
+- Required environment variables set
+
+### Configuration
+
+Edit the following variables in `submit.sh` and `disaggr_torch.slurm`:
+
+```bash
+# In disaggr_torch.slurm
+container_image=${container_image}     # Your container image
+mount_dir=${mount_dir}                 # Mount directory path
+model_dir=${model_dir}                 # Model directory path
+```
+
+### Running Benchmarks
+
+1. **Submit benchmark jobs**:
+   ```bash
+   ./submit.sh
+   ```
+
+2. **Monitor job progress**:
+   ```bash
+   squeue -u $USER
+   ```
+
+3. **View results**:
+   Results are saved in `bm_20250703_deepseek-r1-{isl}-{osl}/` directory
+
+## Script Details
+
+### `submit.sh`
+Main entry script that submits multiple SLURM jobs with different configurations:
+- **DEP8**: 8-way parallelism for decode servers
+- **DEP16**: 16-way parallelism with different EPLB slot configurations
+- **DEP32**: 32-way parallelism for high-throughput scenarios
+
+Parameters tested:
+- Concurrency levels: 1x, 64x, 1024x multipliers
+- EPLB slots: 0, 256, 288
+- Different parallelism sizes
+
+### `disaggr_torch.slurm`
+SLURM job script that:
+1. Sets up container environment
+2. Generates configuration files
+3. Starts server and workers
+4. Executes benchmarks
+5. Cleans up processes
+
+**Key parameters**:
+- `num_ctx_servers`: Number of context servers
+- `ctx_tp_size`: Tensor parallel size for context servers
+- `num_gen_servers`: Number of generation servers
+- `gen_tp_size`: Tensor parallel size for generation servers
+- `concurrency`: Number of concurrent requests
+
+### `gen_yaml.py`
+Generates YAML configuration files with:
+- Server topology and resource allocation
+- Network configuration (hostnames, ports)
+- Memory and batch size settings
+- Optimization parameters (CUDA graphs, KV cache)
+
+**Key features**:
+- Automatic node and task allocation
+- Support for attention data parallelism
+- MoE load balancing configuration
+- Speculative decoding (MTP) support
+
+### `start_server.sh` & `start_worker.sh`
+- **Server**: Starts the main inference server with API endpoint
+- **Workers**: Starts MPI workers for distributed processing
+- Support for profiling with NSight Systems
+- Environment variable configuration for optimizations
+
+### `run_benchmark.sh`
+Executes benchmarking using TensorRT-LLM's benchmark_serving tool:
+- Downloads ShareGPT dataset for realistic workloads
+- Waits for server health checks
+- Runs load testing with specified concurrency
+- Collects performance metrics
+- Gracefully shuts down services
+
+**Metrics collected**:
+- Throughput (tokens/second)
+- Latency (request completion time)
+- Context vs generation only statistics
+
+### `process_gen_iterlog.py`
+Post-processes benchmark results:
+- Parses iteration logs from workers
+- Calculates throughput metrics
+- Generates CSV reports
+- Supports MTP (Multi-Token Prediction) analysis
@@ -0,0 +1,119 @@
+#!/bin/bash
+#SBATCH --nodes=2
+#SBATCH --ntasks=8
+#SBATCH --ntasks-per-node=4
+#SBATCH --partition=${partition} # add your partition here
+#SBATCH --account=${account} # add your account here
+#SBATCH --time=01:00:00
+#SBATCH --job-name=${job_name} # add your job name here
+
+isl=1024
+osl=1024
+multi_round=1
+gen_yaml_file=gen_yaml.py
+container_image=${container_image} # add your container image here
+mount_dir=${mount_dir} # add your mount directory here
+workdir=${mount_dir}/bench-large-ep/slurm_scripts/
+model_dir=${model_dir} # add your model directory here
+logdir=${workdir}/bm_20250703_deepseek-r1-${isl}-${osl}/
+streaming=false
+mkdir -p ${logdir}
+
+container_name=disaggr-test
+
+num_ctx_servers=$1
+ctx_tp_size=$2
+ctx_batch_size=$3
+ctx_max_num_tokens=$4
+ctx_enable_attention_dp=$5
+num_gen_servers=$6
+gen_tp_size=$7
+gen_batch_size=$8
+gen_max_num_tokens=$9
+gen_enable_attention_dp=${10}
+gen_gpu_memory_fraction=${11}
+eplb_num_slots=${12}
+mtp_size=${13}
+concurrency=${14}
+
+sub_dir=${logdir}/dep${gen_tp_size}_concurrency${concurrency}_eplb${eplb_num_slots}_mtp${mtp_size}
+
+ctx_gpus=$((num_ctx_servers * ctx_tp_size))
+gen_gpus=$((num_gen_servers * gen_tp_size))
+
+echo "enable_attention_dp: ${ctx_enable_attention_dp}, ${gen_enable_attention_dp}, gpu_memory_fraction: ${gen_gpu_memory_fraction}"
+
+enable_pdl=false
+if [ "${gen_enable_attention_dp}" = "false" ]; then
+    enable_pdl=true
+    echo "enable_pdl: ${enable_pdl}"
+    sub_dir=${logdir}/tep${gen_tp_size}_concurrency${concurrency}_eplb${eplb_num_slots}_mtp${mtp_size}
+fi
+
+full_logdir=${sub_dir}
+mkdir -p ${full_logdir}
+
+# start the container
+srun -l --container-image=${container_image} \
+        --container-name=${container_name} \
+        --container-mounts=${mount_dir}:${mount_dir} \
+        --mpi=pmix \
+        echo "Container up."
+
+# generate the yaml file
+srun -l --container-name=${container_name} \
+        --container-mounts=${mount_dir}:${mount_dir} \
+        --mpi=pmix --overlap \
+        python3 ${workdir}/${gen_yaml_file} --config ${full_logdir}/config.yaml \
+            --model ${model_dir} \
+            --num_ctx_servers ${num_ctx_servers} \
+            --ctx_tp_size ${ctx_tp_size} \
+            --ctx_batch_size ${ctx_batch_size} \
+            --ctx_max_num_tokens ${ctx_max_num_tokens} \
+            --num_gen_servers ${num_gen_servers} \
+            --gen_tp_size ${gen_tp_size} \
+            --gen_batch_size ${gen_batch_size} \
+            --gen_max_num_tokens ${gen_max_num_tokens} \
+            --gen_gpu_memory_fraction ${gen_gpu_memory_fraction} \
+            --eplb_num_slots ${eplb_num_slots} \
+            $(if [ "${gen_enable_attention_dp}" = "true" ]; then echo "--gen_enable_attention_dp"; fi) \
+            $(if [ "${ctx_enable_attention_dp}" = "true" ]; then echo "--ctx_enable_attention_dp"; fi) \
+            $(if [ "${mtp_size}" -gt 0 ]; then echo "--mtp_size ${mtp_size}"; fi)
+
+echo "YAML file generated."
+
+hostname_value=$(grep '^hostname:' ${full_logdir}/config.yaml | awk -F': ' '{print $2}')
+echo "server host name: $hostname_value"
+
+# try to kill the server and workers
+srun -l --container-name=${container_name} \
+        --container-mounts=${mount_dir}:${mount_dir} \
+        --mpi=pmix --overlap \
+        pkill -f "trtllm-serve" || true
+
+nsys_on=""
+# nsys_on=${full_logdir}
+
+# start the workers
+srun -l --container-name=${container_name} \
+        --container-mounts=${mount_dir}:${mount_dir} \
+    --mpi=pmix --overlap \
+    bash ${workdir}/start_worker.sh ${full_logdir}/config.yaml "${concurrency}" "${enable_pdl}" ${ctx_gpus} ${nsys_on} &> ${full_logdir}/output_workers.log &
+# start the server
+srun -l --container-name=${container_name} \
+        --container-mounts=${mount_dir}:${mount_dir} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+        -w ${hostname_value} \
+        bash ${workdir}/start_server.sh ${full_logdir}/config.yaml &> ${full_logdir}/output_server.log &
+# start benchmarking
+srun -l --container-name=${container_name} \
+        --container-mounts=${mount_dir}:${mount_dir} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+        bash ${workdir}/run_benchmark.sh ${isl} ${osl} ${multi_round} ${model_dir} "${concurrency}" ${streaming} ${full_logdir}/ > ${full_logdir}/benchmark.log 2>&1
+
+# try to kill the server and workers
+srun -l --container-name=${container_name} \
+        --container-mounts=${mount_dir}:${mount_dir} \
+        --mpi=pmix --overlap \
+        kill -9 $(ps aux | grep '[t]rtllm-serve' | awk '{print $2}') >/dev/null 2>&1 || true
+wait