You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: train.py
+14-3Lines changed: 14 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -32,15 +32,16 @@
32
32
KV_HEADS=4
33
33
34
34
USE_SPARSE_ATTN=True
35
-
USE_FLEX_FOR_FINE_SELECTION=True# will push flex a bit, won't be efficient as each layer needs sparsity dynmically generated, but may be enough just to compare to full attention before going all-in on triton kernels
35
+
USE_TRITON_NSA=True
36
+
USE_FLEX_FOR_FINE_SELECTION=False# will push flex a bit, won't be efficient as each layer needs sparsity dynmically generated, but may be enough just to compare to full attention before going all-in on triton kernels
36
37
QUERY_HEADS_SHARE_SELECTION=False# if set to False, each query head can look at a different segment of their corresponding key / value head in GQA
0 commit comments