Commit 9724ea9
Attention mask tweaks for better long context performance (#825)
* Parallelize mask
We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.
* Whith FA on, create mask as f16 directly
* WIP
* Reduce KQ mask padding to 16
Why was it 64 in the first place?
I don't observe any issues, while TG performance
for long contexts improves by 2-4%.
---------
Co-authored-by: Iwan Kawrakow <[email protected]>1 parent 1db0c49 commit 9724ea9
3 files changed
+277
-25
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2235 | 2235 | | |
2236 | 2236 | | |
2237 | 2237 | | |
2238 | | - | |
| 2238 | + | |
2239 | 2239 | | |
2240 | 2240 | | |
2241 | 2241 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
276 | 276 | | |
277 | 277 | | |
278 | 278 | | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
279 | 285 | | |
280 | 286 | | |
281 | 287 | | |
| |||
287 | 293 | | |
288 | 294 | | |
289 | 295 | | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
290 | 302 | | |
291 | 303 | | |
292 | 304 | | |
| |||
0 commit comments