Skip to content

[RFC]: Implement Elastic Speculation: Adaptive Draft Length + Confidence-Based Early Exit #4203

@HF-001

Description

@HF-001

Motivation.

Elastic Speculation, an adaptive control layer for EAGLE speculative decoding that delivers double-digit latency improvements over fixed-length speculation while reducing KV-cache DRAM traffic. refer to: https://iluvatarlabs.com/blog/2025/11/elastic-speculation/#confidence-based-early-exit-cutting-speculative-kv-writes

Proposed Change.

Two independent features:

Adaptive Draft Length: Dynamically adjusts speculation depth based on acceptance rates
Confidence-Based Early Exit: Gates KV writes for low-confidence draft tokens

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For Comments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions