Commit 0704ae3
authored
Add a fast path for _clone_dim_order (#15815)
### Summary
Add a direct memcpy fast path for the portable _clone_dim_order op, as
it can be a performance bottleneck. I'd like to more aggressively
optimize these out of the graph, but this fast path should reduce the
perf impact significantly.
### Test plan
Existing correctness tests for the _clone_dim_order implementation
should cover it.
For performance, I did a quick test with a default dim order (1, 128,
256, 256) element tensor on an x86 server. This is mainly intended as a
quick smoke test and not a proper benchmark. I included numbers for both
optimized and debug builds. Optimized matters more, but super long debug
runs can be painful for development.
[Optimized Build]
Before: 27.9 ms
After: 6.4 ms
[Debug Build]
Before: 5947.01 ms
After: 7.2 ms1 parent c247604 commit 0704ae3
1 file changed
+40
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
13 | 16 | | |
14 | 17 | | |
15 | 18 | | |
| |||
19 | 22 | | |
20 | 23 | | |
21 | 24 | | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
22 | 49 | | |
23 | 50 | | |
24 | 51 | | |
| |||
55 | 82 | | |
56 | 83 | | |
57 | 84 | | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
65 | 97 | | |
66 | 98 | | |
67 | 99 | | |
| |||
77 | 109 | | |
78 | 110 | | |
79 | 111 | | |
80 | | - | |
| 112 | + | |
0 commit comments