Custom kernel for CFG, parallel execution of sampling #8

vklimkov-nvidia · 2025-09-08T13:14:29Z

Speeding up the inference by speeding up the sampling for multi-vocab. Currently we are CPU bounded during sampling and biggest improvement comes from running single custom CFG kernel per decoder iteration instead of computing updated logits in each request separately.

Signed-off-by: Viacheslav Klimkov <[email protected]>

rmittal-github · 2025-09-09T05:12:51Z

@styagi130 @anand-nv to review the changes

Custom kernel for CFG, parallel execution of sampling

9f6d995

Signed-off-by: Viacheslav Klimkov <[email protected]>

vklimkov-nvidia requested review from anand-nv, rmittal-github, styagi130 and tijyojwad September 8, 2025 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom kernel for CFG, parallel execution of sampling #8

Custom kernel for CFG, parallel execution of sampling #8

Uh oh!

vklimkov-nvidia commented Sep 8, 2025

Uh oh!

rmittal-github commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Custom kernel for CFG, parallel execution of sampling #8

Are you sure you want to change the base?

Custom kernel for CFG, parallel execution of sampling #8

Uh oh!

Conversation

vklimkov-nvidia commented Sep 8, 2025

Uh oh!

rmittal-github commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants