Skip to content

Conversation

ikawrakow
Copy link
Owner

This PR is analogous to #702 and implements the optimization for the CUDA back-end.

Here performance comparisons between the main branch and this PR for Q4_0-quantized GPT-OSS-20B running on an RTX-4080 GPU:

Prompt processing

u2pp

Token generation

u2tg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant