Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON #553
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR corresponds to PRs #531, #533, #534, #546, #549, #550, #552, and applies the on-the-fly repacking technique to
the 1-bit quants
IQ1_S
andIQ1_M
onARM_NEON
.Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
IQ1_M
did not have a fasterIQK
implementation, so the 19 t/s is what one has within the standardggml
GEMM framework.