Skip to content

ONNX models with performance gaps [CUDA/TensorRT] #4539

@torsteingrindvik

Description

@torsteingrindvik

ONNX import via burn-onnx is able to tackle more and more models.
As support grows it unlocks seeing how well these models perform.
In the Nvidia ecosystem TensorRT is generally the fastest inference you get to my knowledge.

Tables

Model Shape GPU Burn (ms) TRT (ms) TRT speedup burn SHA burn-onnx SHA
RF-DETR (large) [1,3,560,560] RTX 4090 25.97 2.62 ~10x 8bfa8f75 1262150
RetinaFace [1,3,768,1024] RTX 4090 2.45 0.22 ~11x d63bd6a2 1262150
FCN-ResNet50 [1,3,520,924] RTX 4090 19.31 1.12 ~17x d63bd6a2 1262150

How to check Burn

This is more involved and hard to generalize, but here is a suggestion based on making a standalone crate.

cargo.toml

Details
[package]
name = "burn-bench"
edition = "2024"
publish = false

[dependencies]
burn = { git = "https://github.com/tracel-ai/burn", features = ["cuda"] }
burn-store = { git = "https://github.com/tracel-ai/burn" }
nvtx = "1.3" # for human-friendly report regions in profilers such as Nsight systems

[build-dependencies]
burn-onnx = { git = "https://github.com/tracel-ai/burn-onnx" }

build.rs, generates the model's Rust code and makes git revisions available for report printing

Details
use burn_onnx::ModelGen;
use std::path::Path;

fn main() {
    let onnx = std::env::var("ONNX_MODEL").expect("set ONNX_MODEL=/path/to/model.onnx");
    let stem = Path::new(&onnx)
        .file_stem()
        .unwrap()
        .to_str()
        .unwrap()
        .to_owned();

    println!("cargo:rerun-if-changed={onnx}");
    println!("cargo:rerun-if-changed=build.rs");
    println!("cargo:rerun-if-changed=Cargo.lock");
    println!("cargo:rustc-env=MODEL_STEM={stem}");

    // Extract git revisions from Cargo.lock so the binary can report them.
    let lock = std::fs::read_to_string("Cargo.lock").unwrap_or_default();
    println!(
        "cargo:rustc-env=BURN_REV={}",
        git_rev(&lock, "tracel-ai/burn")
    );
    println!(
        "cargo:rustc-env=BURN_ONNX_REV={}",
        git_rev(&lock, "tracel-ai/burn-onnx")
    );

    ModelGen::new()
        .input(&onnx)
        .out_dir("model/")
        .run_from_script();
}

/// Find the first `source = "git+https://.../<repo>#<rev>"` and return the short rev.
fn git_rev(lock: &str, repo: &str) -> String {
    for line in lock.lines() {
        let line = line.trim();
        if let Some(rest) = line.strip_prefix("source = \"git+https://github.com/") {
            if rest.starts_with(repo) {
                if let Some(hash) = rest.rsplit_once('#') {
                    let rev = hash.1.trim_end_matches('"');
                    return rev[..rev.len().min(10)].to_owned();
                }
            }
        }
    }
    "unknown".to_owned()
}

main.rs

Details
use burn::prelude::*;
use std::time::{Duration, Instant};

type B = burn::backend::Cuda;

#[allow(warnings)]
mod model {
    include!(concat!(
        env!("OUT_DIR"),
        "/model/",
        env!("MODEL_STEM"),
        ".rs"
    ));
}

// Adjust these to match your model's input shape.
const BATCH: usize = 1;
const CHANNELS: usize = 3;
const HEIGHT: usize = 560;
const WIDTH: usize = 560;

const WARMUP: usize = 3;
const ITERATIONS: usize = 20;

fn main() {
    let device = Default::default();
    let model = model::Model::<B>::from_file(
        concat!(env!("OUT_DIR"), "/model/", env!("MODEL_STEM"), ".bpk"),
        &device,
    );
    println!("warmup={WARMUP}  iterations={ITERATIONS}");

    for _ in 0..WARMUP {
        let input = Tensor::<B, 4>::zeros([BATCH, CHANNELS, HEIGHT, WIDTH], &device);
        let _ = model.forward(input);
        let _ = B::sync(&device);
    }

    let mut times = Vec::with_capacity(ITERATIONS);
    let _iter_range = nvtx::range!("iterations");
    for i in 0..ITERATIONS {
        let input = Tensor::<B, 4>::zeros([BATCH, CHANNELS, HEIGHT, WIDTH], &device);
        let start = Instant::now();
        let _fwd_range = nvtx::range!("forward i={i}");
        let _ = model.forward(input);
        let _ = B::sync(&device);
        drop(_fwd_range);
        times.push(start.elapsed());
    }
    drop(_iter_range);

    report(&times);
}

fn report(times: &[Duration]) {
    println!(
        "burn={} burn-onnx={}",
        env!("BURN_REV"),
        env!("BURN_ONNX_REV")
    );
    println!(
        "model={}  input=[{BATCH}, {CHANNELS}, {HEIGHT}, {WIDTH}]",
        env!("MODEL_STEM")
    );

    let mut sorted = times.to_vec();
    sorted.sort();
    let n = sorted.len();
    let median = sorted[n / 2];
    let mean = sorted.iter().sum::<Duration>() / n as u32;
    let min = sorted[0];
    let max = sorted[n - 1];
    println!("median  {median:>10.2?}");
    println!("mean    {mean:>10.2?}");
    println!("min     {min:>10.2?}");
    println!("max     {max:>10.2?}");
}

The above would report:

warmup=3  iterations=20
burn=502910e2a7 burn-onnx=19eedf7141
model=rf_detr  input=[1, 3, 560, 560]
median     25.97ms
mean       25.99ms
min        25.30ms
max        27.57ms

How to check TensorRT

trtexec --onnx=my_model.onnx --best --useCudaGraph will optimize and display bench numbers after a while.

  • my_model.onnx is the model you are interested in benchmarking
  • --best will enable TensorRT to optimize the model using any precision (fp16, bf16, int4, ..) appropriate
  • --useCudaGraph enables TensorRT to, once optimized, launch the entire model as a planned graph such that it doesn't need to communicate with the CPU side before done, eliminating many small launch overheads

Links

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions