Made an attempt at implementing Speculative cascades #466
fblissjr
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
RE: https://research.google/blog/speculative-cascades-a-hybrid-approach-for-smarter-faster-llm-inference/
Paper: https://arxiv.org/abs/2405.19261
my fork with the code: https://github.com/fblissjr/mlx-lm-playground/tree/spec-cascade
It's not at all performance optimized (which is the whole point of this) or faster than baseline, let alone speculative decoding, since I'm sure there's a ton of core stuff missing or flat out wrong, but thought I'd share in case anyone else is exploring it.
Baseline:
Baseline w/ Speculative Decoding:
w/ spec cascade @ alpha 0.2:
@ alpha 0.5:
@ alpha 0.9:
Beta Was this translation helpful? Give feedback.
All reactions