Commit 742595b
authored
Speedup Llama2 cpu throughput in bench by 1.69x with iobinding (#19853)
### Description
Always set `use_io_binding=True` when using optimum.onnxruntime unless
there is a special case.
### Motivation and Context
By default, `ORTModel` under optimum.onnxruntime will choose the
appropriate `use_io_binding` value based on provider and use cases.
> use_io_binding (`Optional[bool]`, defaults to `None`):
> Whether to use IOBinding during inference to avoid memory copy between
the host and device, or between numpy/torch tensors and ONNX Runtime
ORTValue. Defaults to
> `True` if the execution provider is CUDAExecutionProvider. For
[~onnxruntime.ORTModelForCausalLM], defaults to `True` on
CPUExecutionProvider,
> in all other cases defaults to `False`.
For Llama token benchmark, using iobinding yields almost 2x speedup,
even on CPU. This is because this particular model yields a large number
of outputs (>60). Without iobinding, a copy is performed for each output
from ortvalue to numpy array. This adds significant overhead to the
overall run time.
```
Evaluating Llama2 `model(inputs)` step with past_key_values
Before, w/o iobinding on cpu
Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps
After, w/ iobinding on cpu
Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps
```1 parent d4fa4f0 commit 742595b
File tree
3 files changed
+6
-6
lines changed- onnxruntime/python/tools/transformers/models
- llama
- stable_diffusion
- whisper
3 files changed
+6
-6
lines changedLines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
246 | | - | |
| 246 | + | |
247 | 247 | | |
248 | 248 | | |
249 | 249 | | |
| |||
Lines changed: 4 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
315 | 315 | | |
316 | 316 | | |
317 | 317 | | |
318 | | - | |
| 318 | + | |
319 | 319 | | |
320 | 320 | | |
321 | 321 | | |
322 | 322 | | |
323 | 323 | | |
324 | | - | |
| 324 | + | |
325 | 325 | | |
326 | 326 | | |
327 | 327 | | |
328 | 328 | | |
329 | 329 | | |
330 | 330 | | |
331 | 331 | | |
332 | | - | |
| 332 | + | |
333 | 333 | | |
334 | 334 | | |
335 | 335 | | |
336 | 336 | | |
337 | 337 | | |
338 | 338 | | |
339 | 339 | | |
340 | | - | |
| 340 | + | |
341 | 341 | | |
342 | 342 | | |
343 | 343 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
148 | | - | |
149 | 148 | | |
150 | 149 | | |
151 | 150 | | |
| 151 | + | |
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
| |||
0 commit comments