Skip to content

Latest commit

 

History

History
25 lines (19 loc) · 1004 Bytes

File metadata and controls

25 lines (19 loc) · 1004 Bytes

Accelerating LLM inference with SPS and BiLD

Speculative Decoding has emerged as a pivotal approach in enhancing the efficiency of Large Language Models (LLMs), addressing the critical challenge of inference latency primarily caused by memory-bound computation limitations. The motivation is to explore how Speculative Decoding can be effectively adapted across different model series and configurations.

This repo aims to implement two algorithms: (1) Deepmind's Algorithm: Speculative Sampling (SpS) with Auto-Regressive Target and Draft Models (2) Big Little Decoder Algorithm(BiLD) with Fallback and Rollback Policies

To run the experiment:

python benchmark.py    \
 --target_model_name facebook/opt-6.7b     \
 --approx_model_name facebook/opt-125m     \
 --temperature 0     \
 --max_tokens 30    \
 --fallback_thres 0.6 \
 --rollback_thres 3

This is a project repo for eecs598:llm course.

speed up plot