Skip to content

[AutoDeploy][Feature] Overlap scheduling + speculative decoding in AutoDeploy #9421

@govind-ramnarayan

Description

@govind-ramnarayan

🚀 The feature, motivation and pitch

Scoping out what it takes to support speculative decoding with overlap scheduling in AutoDeploy.

Alternatives

No response

Additional context

The overlap scheduler is currently enabled for speculative decoding with the PyTorch backend, when using the chain drafter and using a TRTLLM attention backend. AutoDeploy does not support the TRTLLM attention backend. We want to know: Can AutoDeploy be made to support overlap scheduler + speculative decoding, without needing to support the TRTLLM attention backend?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

AutoDeploy<NV> AutoDeploy BackendSpeculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterfeature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions