Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853)

BowenBao · web-flow · commit 742595b885d9 · 2024-03-12T09:41:11.000-07:00
### Description
Always set `use_io_binding=True` when using optimum.onnxruntime unless
there is a special case.


### Motivation and Context
By default, `ORTModel` under optimum.onnxruntime will choose the
appropriate `use_io_binding` value based on provider and use cases.

&gt;         use_io_binding (`Optional[bool]`, defaults to `None`):
&gt; Whether to use IOBinding during inference to avoid memory copy between
the host and device, or between numpy/torch tensors and ONNX Runtime
ORTValue. Defaults to
&gt; `True` if the execution provider is CUDAExecutionProvider. For
[~onnxruntime.ORTModelForCausalLM], defaults to `True` on
CPUExecutionProvider,
 &gt;           in all other cases defaults to `False`.

For Llama token benchmark, using iobinding yields almost 2x speedup,
even on CPU. This is because this particular model yields a large number
of outputs (&gt;60). Without iobinding, a copy is performed for each output
from ortvalue to numpy array. This adds significant overhead to the
overall run time.

```
Evaluating Llama2 `model(inputs)` step with past_key_values

Before, w/o iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps

After, w/ iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps
```
diff --git a/onnxruntime/python/tools/transformers/models/llama/benchmark.py b/onnxruntime/python/tools/transformers/models/llama/benchmark.py
@@ -243,7 +243,7 @@ def get_model(args: argparse.Namespace):
             decoder_file_name=decoder_file_name,
             decoder_with_past_file_name=decoder_with_past_file_name,
             use_auth_token=args.auth,
-            use_io_binding=(args.device != "cpu"),
+            use_io_binding=True,  # Large perf gain even for cpu due to avoiding output copy.
             use_merged=(True if decoder_file_name == "model.onnx" else None),
             provider=provider,
             provider_options=provider_options,
diff --git a/onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py b/onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py
@@ -315,29 +315,29 @@ def get_optimum_ort_pipeline(
                 directory,
                 provider=provider,
                 session_options=None,
-                use_io_binding=False,
+                use_io_binding=False,  # Not supported by Optimum version 1.17.1 at the time of verification.
             )
         else:
             pipeline = ORTStableDiffusionPipeline.from_pretrained(
                 directory,
                 provider=provider,
-                use_io_binding=False,
+                use_io_binding=False,  # Not supported by Optimum version 1.17.1 at the time of verification.
             )
     elif "xl" in model_name:
         pipeline = ORTStableDiffusionXLPipeline.from_pretrained(
             model_name,
             export=True,
             provider=provider,
             session_options=None,
-            use_io_binding=False,
+            use_io_binding=False,  # Not supported by Optimum version 1.17.1 at the time of verification.
         )
         pipeline.save_pretrained(directory)
     else:
         pipeline = ORTStableDiffusionPipeline.from_pretrained(
             model_name,
             export=True,
             provider=provider,
-            use_io_binding=False,
+            use_io_binding=False,  # Not supported by Optimum version 1.17.1 at the time of verification.
         )
         pipeline.save_pretrained(directory)
 
diff --git a/onnxruntime/python/tools/transformers/models/whisper/benchmark.py b/onnxruntime/python/tools/transformers/models/whisper/benchmark.py
@@ -145,10 +145,10 @@ def get_model(args: argparse.Namespace):
         start_time = time.time()
         model = ORTModelForSpeechSeq2Seq.from_pretrained(
             args.hf_ort_dir_path,
-            use_io_binding=(args.device != "cpu"),
             provider=provider,
             provider_options=provider_options,
             session_options=sess_options,
+            use_io_binding=True,  # Avoid memory copy overhead
         )
         end_time = time.time()