You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/llm/README.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,11 +74,14 @@ This codebase can be extended to
74
74
75
75
## Limitations
76
76
- We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet.
77
+
-**Flash Attention Limitation**: Some models (e.g., Eagle2-2B) internally use flash attention operations (`torch.ops.flash_attn._flash_attn_forward.default`) which require the `flash-attn` package to be installed. Without flash-attn, these models will fail to load or run properly.
# Get the max sequence length from the first key_cache node. The order of nodes is: input_ids, is_causal, key_cache1, value_cache1, key_cache2, value_cache2, ..
101
+
# Get the max sequence length from the first key_cache node. The order of nodes is: input_ids, position_ids, key_cache1, value_cache1, ...
0 commit comments