You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: enable trtllm-gen attn speculative decoding verify by decode (#1453)
<!-- .github/pull_request_template.md -->
## 📌 Description
decode with q_len > 1
## 🔍 Related Issues
<!-- Link any related issues here -->
## 🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.
### ✅ Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).
## 🧪 Tests
- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).
## Reviewer Notes
<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
---------
Co-authored-by: Zihao Ye <[email protected]>
r"""Compute batch decode attention between query and paged kv cache.
1151
1152
@@ -1183,6 +1184,8 @@ def run(
1183
1184
enable_pdl : bool
1184
1185
Whether to enable Programmatic Dependent Launch (PDL). See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programmatic-dependent-launch-and-synchronization
1185
1186
Only supported for >= sm90, and currently only for FA2 and CUDA core decode.
1187
+
q_len_per_req : int
1188
+
The number of query tokens per request, if not provided, will be set to ``1``.
0 commit comments