ONNX models with performance gaps [CUDA/TensorRT]

ONNX import via `burn-onnx` is able to tackle more and more models.
As support grows it unlocks seeing how well these models perform.
In the Nvidia ecosystem TensorRT is generally the fastest inference you get to my knowledge.

# Tables

| Model | Shape | GPU | Burn (ms) | TRT (ms) | TRT speedup | burn SHA | burn-onnx SHA |
|-------|-------|-----|----------:|----------:|:-----------:|------|-----------|
| RF-DETR (large) | `[1,3,560,560]` | RTX 4090 | 25.97 | 2.62 | ~10x | `8bfa8f75` | `1262150` |
| RetinaFace | `[1,3,768,1024]` | RTX 4090 | 2.45 | 0.22 | ~11x | `d63bd6a2` | `1262150` |
| FCN-ResNet50 | `[1,3,520,924]` | RTX 4090 | 19.31 | 1.12 | ~17x | `d63bd6a2` | `1262150` |


# How to check Burn

This is more involved and hard to generalize, but here is a suggestion based on making a standalone crate.

`cargo.toml`

<details><summary>Details</summary>

```toml
[package]
name = "burn-bench"
edition = "2024"
publish = false

[dependencies]
burn = { git = "https://github.com/tracel-ai/burn", features = ["cuda"] }
burn-store = { git = "https://github.com/tracel-ai/burn" }
nvtx = "1.3" # for human-friendly report regions in profilers such as Nsight systems

[build-dependencies]
burn-onnx = { git = "https://github.com/tracel-ai/burn-onnx" }
``` 

</details> 

`build.rs`, generates the model's Rust code and makes git revisions available for report printing

<details><summary>Details</summary>

```rust
use burn_onnx::ModelGen;
use std::path::Path;

fn main() {
    let onnx = std::env::var("ONNX_MODEL").expect("set ONNX_MODEL=/path/to/model.onnx");
    let stem = Path::new(&onnx)
        .file_stem()
        .unwrap()
        .to_str()
        .unwrap()
        .to_owned();

    println!("cargo:rerun-if-changed={onnx}");
    println!("cargo:rerun-if-changed=build.rs");
    println!("cargo:rerun-if-changed=Cargo.lock");
    println!("cargo:rustc-env=MODEL_STEM={stem}");

    // Extract git revisions from Cargo.lock so the binary can report them.
    let lock = std::fs::read_to_string("Cargo.lock").unwrap_or_default();
    println!(
        "cargo:rustc-env=BURN_REV={}",
        git_rev(&lock, "tracel-ai/burn")
    );
    println!(
        "cargo:rustc-env=BURN_ONNX_REV={}",
        git_rev(&lock, "tracel-ai/burn-onnx")
    );

    ModelGen::new()
        .input(&onnx)
        .out_dir("model/")
        .run_from_script();
}

/// Find the first `source = "git+https://.../<repo>#<rev>"` and return the short rev.
fn git_rev(lock: &str, repo: &str) -> String {
    for line in lock.lines() {
        let line = line.trim();
        if let Some(rest) = line.strip_prefix("source = \"git+https://github.com/") {
            if rest.starts_with(repo) {
                if let Some(hash) = rest.rsplit_once('#') {
                    let rev = hash.1.trim_end_matches('"');
                    return rev[..rev.len().min(10)].to_owned();
                }
            }
        }
    }
    "unknown".to_owned()
}
``` 

</details> 

`main.rs`

<details><summary>Details</summary>

```rust
use burn::prelude::*;
use std::time::{Duration, Instant};

type B = burn::backend::Cuda;

#[allow(warnings)]
mod model {
    include!(concat!(
        env!("OUT_DIR"),
        "/model/",
        env!("MODEL_STEM"),
        ".rs"
    ));
}

// Adjust these to match your model's input shape.
const BATCH: usize = 1;
const CHANNELS: usize = 3;
const HEIGHT: usize = 560;
const WIDTH: usize = 560;

const WARMUP: usize = 3;
const ITERATIONS: usize = 20;

fn main() {
    let device = Default::default();
    let model = model::Model::<B>::from_file(
        concat!(env!("OUT_DIR"), "/model/", env!("MODEL_STEM"), ".bpk"),
        &device,
    );
    println!("warmup={WARMUP}  iterations={ITERATIONS}");

    for _ in 0..WARMUP {
        let input = Tensor::<B, 4>::zeros([BATCH, CHANNELS, HEIGHT, WIDTH], &device);
        let _ = model.forward(input);
        let _ = B::sync(&device);
    }

    let mut times = Vec::with_capacity(ITERATIONS);
    let _iter_range = nvtx::range!("iterations");
    for i in 0..ITERATIONS {
        let input = Tensor::<B, 4>::zeros([BATCH, CHANNELS, HEIGHT, WIDTH], &device);
        let start = Instant::now();
        let _fwd_range = nvtx::range!("forward i={i}");
        let _ = model.forward(input);
        let _ = B::sync(&device);
        drop(_fwd_range);
        times.push(start.elapsed());
    }
    drop(_iter_range);

    report(&times);
}

fn report(times: &[Duration]) {
    println!(
        "burn={} burn-onnx={}",
        env!("BURN_REV"),
        env!("BURN_ONNX_REV")
    );
    println!(
        "model={}  input=[{BATCH}, {CHANNELS}, {HEIGHT}, {WIDTH}]",
        env!("MODEL_STEM")
    );

    let mut sorted = times.to_vec();
    sorted.sort();
    let n = sorted.len();
    let median = sorted[n / 2];
    let mean = sorted.iter().sum::<Duration>() / n as u32;
    let min = sorted[0];
    let max = sorted[n - 1];
    println!("median  {median:>10.2?}");
    println!("mean    {mean:>10.2?}");
    println!("min     {min:>10.2?}");
    println!("max     {max:>10.2?}");
}
``` 
</details> 

The above would report:

```
warmup=3  iterations=20
burn=502910e2a7 burn-onnx=19eedf7141
model=rf_detr  input=[1, 3, 560, 560]
median     25.97ms
mean       25.99ms
min        25.30ms
max        27.57ms
```


# How to check TensorRT

`trtexec --onnx=my_model.onnx --best --useCudaGraph` will optimize and display bench numbers after a while.

- `my_model.onnx` is the model you are interested in benchmarking
- `--best` will enable TensorRT to optimize the model using any precision (fp16, bf16, int4, ..) appropriate 
- `--useCudaGraph` enables TensorRT to, once optimized, launch the entire model as a planned graph such that it doesn't need to communicate with the CPU side before done, eliminating many small launch overheads

## Links

- [RF-DETR (large) onnx export](https://rfdetr.roboflow.com/develop/learn/export/#basic-export)
- [RetinaFace from tiny-face-pytorch](https://github.com/yakhyo/tiny-face-pytorch)
- [FCN-ResNet50 from torchvision](https://pytorch.org/vision/stable/models/generated/torchvision.models.segmentation.fcn_resnet50.html) (pretrained `FCN_ResNet50_Weights.DEFAULT`, exported with `torch.onnx.export`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX models with performance gaps [CUDA/TensorRT] #4539

Tables

How to check Burn

How to check TensorRT

Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Shape	GPU	Burn (ms)	TRT (ms)	TRT speedup	burn SHA	burn-onnx SHA
RF-DETR (large)	`[1,3,560,560]`	RTX 4090	25.97	2.62	~10x	`8bfa8f75`	`1262150`
RetinaFace	`[1,3,768,1024]`	RTX 4090	2.45	0.22	~11x	`d63bd6a2`	`1262150`
FCN-ResNet50	`[1,3,520,924]`	RTX 4090	19.31	1.12	~17x	`d63bd6a2`	`1262150`

ONNX models with performance gaps [CUDA/TensorRT] #4539

Description

Tables

How to check Burn

How to check TensorRT

Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions