Commit f767983
perf: use bfi.b32 for unaligned Q6K loads — 24 fewer instructions per super-block (Refs GH-131)
Replace shl+or byte assembly in ld_global_u32_unaligned with bfi.b32 bit-field
insert. Each unaligned u32 load saves 6 instructions (9 → 3 for the packing step).
With 4 unaligned loads per Q6K super-block, this reduces instruction overhead by 24
per super-block on sm_87 Jetson Orin.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent c4d2bea commit f767983
File tree
4 files changed
+42
-13
lines changed- .pmat
- trueno-gpu/src
- kernels/quantize/q6k
- ptx/builder
4 files changed
+42
-13
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
376 | 376 | | |
377 | 377 | | |
378 | 378 | | |
| 379 | + | |
| 380 | + | |
379 | 381 | | |
380 | 382 | | |
381 | 383 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| |||
154 | 154 | | |
155 | 155 | | |
156 | 156 | | |
157 | | - | |
| 157 | + | |
158 | 158 | | |
159 | 159 | | |
160 | 160 | | |
161 | 161 | | |
162 | 162 | | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
163 | 167 | | |
164 | 168 | | |
165 | 169 | | |
| |||
173 | 177 | | |
174 | 178 | | |
175 | 179 | | |
176 | | - | |
| 180 | + | |
177 | 181 | | |
178 | 182 | | |
179 | 183 | | |
180 | 184 | | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | | - | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
189 | | - | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
190 | 191 | | |
191 | 192 | | |
192 | 193 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
167 | 167 | | |
168 | 168 | | |
169 | 169 | | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
170 | 196 | | |
171 | 197 | | |
172 | 198 | | |
| |||
0 commit comments