Skip to content

Commit b5cde72

Browse files
tianleiwuguschmue
authored andcommitted
Dynamo export and improve benchmark script for SAM2 encoder (#23887)
### Description * Add dynamo export for Sam2 image encoder * Verify fp32 onnx model with CPU EP (to avoid error message from TRT EP). * Update benchmark script: - output ORT profiling - output torch compiled code and unique kernel name for compiled kernel - add an option for nightly package installation - uninstall existing ort packages before installing The node metadata of dynamo exported model can help mapping node in onnx model back to pytorch modeling script. Currently, the graph optimization is not done on dynamo exported model, so it is experimental right now. ### Motivation and Context To support profiling of torch compiled CUDA kernel.
1 parent 19716b1 commit b5cde72

File tree

8 files changed

+315
-135
lines changed

8 files changed

+315
-135
lines changed

onnxruntime/python/tools/transformers/models/sam2/README.md

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -96,8 +96,7 @@ We can create a conda environment then run GPU benchmark like the following:
9696
conda create -n sam2_gpu python=3.11 -y
9797
conda activate sam2_gpu
9898
install_dir=$HOME
99-
profiling=true
100-
bash benchmark_sam2.sh $install_dir gpu $profiling
99+
bash benchmark_sam2.sh $install_dir gpu
101100
```
102101

103102
or create a new conda environment for CPU benchmark:
@@ -107,16 +106,28 @@ conda activate sam2_cpu
107106
bash benchmark_sam2.sh $HOME cpu
108107
```
109108

110-
The first parameter is a directory to clone git repositories or install CUDA/cuDNN for benchmark.
111-
The second parameter can be either "gpu" or "cpu", which indicates the device to run benchmark.
112-
The third parameter is optional. Value "true" will enable profiling after running benchmarking on GPU.
109+
The usage of the script like the following:
110+
```
111+
bash benchmark_sam2.sh <install_dir> <cpu_or_gpu> [profiling] [benchmarking] [nightly] [dynamo]
112+
```
113+
114+
| Parameter| Default | Description |
115+
|----------|----------| ------------|
116+
| install_dir | $HOME | a directory to clone git repositories or install CUDA/cuDNN for benchmark |
117+
| cpu_or_gpu | gpu | the device to run benchmark. The value can be either "gpu" or "cpu" |
118+
| profiling | false | run gpu profiling |
119+
| benchmarking | true | run benchmark |
120+
| nightly | false | install onnxruntime nightly or official release package |
121+
| dynamo | false | export image encoder using dynamo or not. |
113122

114-
The script will automatically install required packages in current conda environment, download checkpoints, export onnx,
115-
and run demo, benchmark and optionally run profiling.
123+
The dynamo export is experimental since graph optimization still need extra works for this model.
116124

117-
* The performance test result is in sam2_gpu.csv or sam2_cpu.csv, which can be loaded into Excel.
118-
* The demo output is sam2_demo_fp16_gpu.png or sam2_demo_fp32_cpu.png.
119-
* The profiling results are in *.nsys-rep or *.json files in current directory. Use Nvidia NSight System to view the *.nsys-rep file.
125+
Output files:
126+
* sam2_cpu_[timestamp].csv or sam2_gpu_[timestamp].csv has benchmark results. Use Excel to load the file to view it.
127+
* onnxruntime_image_[encoder|decoder].json has ONNX Runtime profiling results. Use `chrome://tracing` in Chrome browser to view it.
128+
* torch_image_[encoder|decoder].json has PyTorch profiling results. Use `chrome://tracing` in Chrome browser to view it.
129+
* sam2_fp16_profile_image_[encoder|decoder]_[ort|torch]_gpu.[nsys-rep|sqlite] has NVTX profiling. Use Nvidia NSight System to view it.
130+
* torch_image_encoder_compiled_code.txt has the compiled kernel code from Pytorch.
120131

121132
## Limitations
122133
- The exported image_decoder model does not support batch mode for now.

onnxruntime/python/tools/transformers/models/sam2/benchmark_sam2.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ def __init__(
4646
prefer_nhwc: bool = False,
4747
warm_up: int = 5,
4848
enable_nvtx_profile: bool = False,
49+
enable_ort_profile: bool = False,
4950
enable_torch_profile: bool = False,
5051
repeats: int = 1000,
5152
verbose: bool = False,
@@ -74,6 +75,7 @@ def __init__(
7475
self.prefer_nhwc = prefer_nhwc
7576
self.warm_up = warm_up
7677
self.enable_nvtx_profile = enable_nvtx_profile
78+
self.enable_ort_profile = enable_ort_profile
7779
self.enable_torch_profile = enable_torch_profile
7880
self.repeats = repeats
7981
self.verbose = verbose
@@ -317,6 +319,7 @@ def run_test(
317319
repeats=args.repeats,
318320
warm_up=args.warm_up,
319321
enable_nvtx_profile=args.enable_nvtx_profile,
322+
enable_ort_profile=args.enable_ort_profile,
320323
enable_torch_profile=args.enable_torch_profile,
321324
torch_compile_mode=args.torch_compile_mode,
322325
verbose=False,
@@ -325,7 +328,7 @@ def run_test(
325328
if args.engine == "ort":
326329
sess_options = SessionOptions()
327330
sess_options.intra_op_num_threads = args.intra_op_num_threads
328-
if config.enable_nvtx_profile:
331+
if config.enable_ort_profile:
329332
sess_options.enable_profiling = True
330333
sess_options.log_severity_level = 4
331334
sess_options.log_verbosity_level = 0
@@ -349,6 +352,8 @@ def run_test(
349352
with nvtx.annotate("one_run"):
350353
_ = session.infer(input_dict)
351354
cudart.cudaProfilerStop()
355+
356+
if config.enable_ort_profile:
352357
session.ort_session.end_profiling()
353358

354359
if repeats == 0:
@@ -554,6 +559,14 @@ def _parse_arguments():
554559
help="Enable nvtx profiling. It will add an extra run for profiling before performance test.",
555560
)
556561

562+
parser.add_argument(
563+
"--enable_ort_profile",
564+
required=False,
565+
default=False,
566+
action="store_true",
567+
help="Enable ORT profiling.",
568+
)
569+
557570
parser.add_argument(
558571
"--enable_torch_profile",
559572
required=False,

0 commit comments

Comments
 (0)