Speculative Decoding + ktransformers #1362

kingyaaayaaa · 2025-06-04T10:18:10Z

kingyaaayaaa
Jun 4, 2025

ktransformers has a low utilization rate on GPU. It is enough to store and run a small 7B model on a 24G GPU and use speculative decoding to achieve inference acceleration. Is this idea feasible? Currently, the ktransformers backend does not support batch inference, which will pose a challenge to the parallel verification of large models, but I don’t think this is an unsolvable problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speculative Decoding + ktransformers #1362

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Speculative Decoding + ktransformers #1362

Uh oh!

kingyaaayaaa Jun 4, 2025

Replies: 0 comments

kingyaaayaaa
Jun 4, 2025