Skip to content

Conversation

@vklimkov-nvidia
Copy link
Collaborator

Speeding up the inference by speeding up the sampling for multi-vocab. Currently we are CPU bounded during sampling and biggest improvement comes from running single custom CFG kernel per decoder iteration instead of computing updated logits in each request separately.

@rmittal-github
Copy link
Owner

@styagi130 @anand-nv to review the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants