Commit caa0094
authored
backends/cuda: use async malloc/free (pytorch#14976)
Found device synchronize in aoti_torch_delete_tensor_object via Linux
perf. This change appears to significantly improve self-reported latency
from voxtral_runner as found in
https://github.com/pytorch/executorch/blob/main/.github/workflows/cuda.yml#L111-L172:
Baseline:
Run latency (ms):
audio_encoder: 575.797
token_embedding: 14.571
text_decoder: 3095.356
With this PR:
Run latency (ms):
audio_encoder: 175.807
token_embedding: 8.799
text_decoder: 344.3671 parent 7395999 commit caa0094
1 file changed
+8
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
225 | 225 | | |
226 | 226 | | |
227 | 227 | | |
228 | | - | |
| 228 | + | |
229 | 229 | | |
230 | 230 | | |
231 | 231 | | |
| |||
328 | 328 | | |
329 | 329 | | |
330 | 330 | | |
331 | | - | |
332 | | - | |
333 | | - | |
334 | | - | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
335 | 334 | | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
336 | 339 | | |
337 | 340 | | |
338 | 341 | | |
| |||
0 commit comments