This directory contains scripts for benchmarking TensorRT-LLM performance with Dynamo using SLURM job scheduler.
These scripts are currently not QA'ed and are provided for demonstration purposes only.
Please note that:
- These scripts have not undergone formal quality assurance testing
- These scripts were tested on GB200 systems. To run all configurations, you will need at least 16 nodes, with each node equipped with 4 GPUs.
- They are intended for demonstration and educational purposes
- Use at your own risk in production environments
- Always review and test scripts thoroughly before running in your specific environment
- In disaggregated mode, using
--exclusiveflag to launch worker processes can impact runtime performance. Hence, these scripts specify nodelist explicitly in srun call. - We are actively working on refining the configuration sweeps.
submit_disagg.sh- Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16.submit_agg.sh- Main entry point for submitting benchmark jobs for aggregated configurations.post_process.py- Scan the aiperf results to produce a json with entries to each config point.plot_performance_comparison.py- Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer multinode-examples.md. This guide shares similar assumption to the multinode examples guide.
Before running the scripts, ensure you have:
- Access to a SLURM cluster
- Container image of Dynamo with TensorRT-LLM built using instructions from here.
- Model files accessible on the cluster
- Required environment variables set
Within the login node of the cluster, set the following variables
# Set partition manually based on your slurm cluster's partition names
export SLURM_PARTITION=""
# Set account manually if this command doesn't work on your cluster
export SLURM_ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
# Set a job name for your benchmarking runs
export SLURM_JOB_NAME=""
# NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here:
# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
export IMAGE="<dynamo_trtllm_image>"
# NOTE: In general, Deepseek R1 is very large, so it is recommended to
# pre-download the model weights and save them in some shared location,
# NFS storage, HF_HOME, etc. and modify the `--model-path` below
# to reuse the pre-downloaded weights instead.
#
# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
#
# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
# https://huggingface.co/deepseek-ai/DeepSeek-R1
export MODEL_PATH="<path_to_model_weights>"
# The name the model will be served/queried under, matching what's
# returned by the /v1/models endpoint.
#
# By default this is inferred from MODEL_PATH, but when using locally downloaded
# model weights, it can be nice to have explicit control over the name.
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"# Queues the SLURM jobs for aggregated configurations for DeepSeek R1.
./submit_agg.sh# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 without MTP
./submit_disagg.sh mtp=off all# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 with MTP
./submit.sh mtp=on allThe above jobs use aiperf tool to benchmark each configuration point across different concurrency values. These get stored in dynamo_disagg-bm-8150-1024/<config-setup>/aiperf_artifacts and dynamo_agg-bm-8150-1024/<config-setup>/aiperf_artifacts for disaggregated and aggregated respectively.
After your benchmarking jobs have completed, you can use the post_process.py script to aggregate and summarize the results from the generated aiperf_artifacts.
To run the post-processing script, use:
python3 post_process.py dynamo_agg-bm-8150-1024 --output-file agg_result.jsonpython3 post_process.py dynamo_disagg-bm-8150-1024 --output-file disagg_result.jsonYou can now use the plot_performance_comparison.py like below to observe the performance.
python3 plot_performance_comparison.py dynamo_agg-bm-8150-1024/agg_result.json dynamo_disagg-bm-8150-1024/disagg_result.json -o performance_plot.pngThis script will produce a scatter plot of all the configuration points with each concurrency on a Output Throughput per GPU vs Output Throughput per User. It will also include the roofline pareto line for both aggregated and disaggregated setups.
Refer to Beyond the Buzz: A Pragmatic Take on Inference Disaggregation to learn how to interpret these plots.
- Some jobs may time out if aiperf requires more time to complete all concurrency levels.
- Workers may encounter out-of-memory (OOM) errors during inference, especially with larger configurations.
- Configurations affected by these issues will result in missing data points on the performance plot.