llama.cpp : fix --leave-output-tensor for llama-quantize. #9829

drollings · 2024-10-10T22:00:43Z

Tweaked llama-quantize's --leave-output-tensor parameter's impact on llama_model_quantize_internal() to exclude any tensor named "*output.weight" instead of just "output.weight".

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

* Tweaked llama-quantize's --leave-output-tensor parameter's impact on llama_model_quantize_internal() to exclude any tensor named "*output.weight" instead of just "output.weight".

ggerganov · 2024-10-11T06:11:16Z

Can you provide an example model where this makes a difference?

drollings · 2024-10-14T06:42:04Z

I mostly work on Llama 3.1-based models. In every case where I was quantizing, attempting to leave embedding and output tensors at Q8, they were all getting re-quantized. The models were being made mostly pure Q4_K_M. Most straightforward examples: https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite, https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite, https://huggingface.co/akjindal53244/Llama-3.1-Storm-8B, https://huggingface.co/ValiantLabs/Llama3.1-8B-ShiningValiant2.

* ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print

Signed-off-by: Xiaodong Ye <[email protected]>

…9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggml-org#9800) * gguf : deprecate old FIM token KVs

* server : remove legacy system_prompt feature ggml-ci * readme : update [no ci] * server : fix non-transformer logic + remove response from /props

* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci

ggml-ci

Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04) → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

ggml-ci

* server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci

* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

Co-authored-by: Gimling <[email protected]>

@slaren

* Initial XTC commit Adds XTC sampler, not activated by default, but recommended settings by default. * Cleanup * Simplified chances calculation To be more inline with the original implementation, chance is calculated once at the beginning. * First fixes by comments Still need to look into sorting * Fixed trailing backspaces * Fixed RNG to be reproduceable Thanks to @slaren for directions * Fixed forgotten header * Moved `min_keep` Moved from conditions to a simple check at the end. * Fixed broken randomization Thanks to @slaren for explanation * Swapped sorting for a custom algorithm Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable. * Algorithm rework 1. Scan token from top till the first non-penalizable 2. Remove the last captured token (the least probable above threshold) 3. Shift all tokens to override the remaining penalizable 4. Penalize and put them at the the bottom. * Added XTC to `test-sampling` * Simplified algorithm and more tests * Updated info in common and args * Merged back lost commits in common and arg * Update dump info in common * Fixed incorrect min_keep check * Added XTC to README * Renamed parameters, fixed info and defaults * probability is at 0 by default, but XTC is included in sampling queue * threshold higher than 0.5 switches XTC off * Initial server support * Added XTC to server UIs * Fixed labels in old server UI * Made algorithm safer and more readable * Removed xtc_threshold_max * Fixed arg after update * Quick fixes by comments * Simplified algorithm since threshold_max is removed * Renamed random distribution * Fixed tests and outdated README * Small fixes

ggml-ci

Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.

This commit removes the buffer_id field from the leaf_alloc struct. The motivation for is that this field is only written to and never read/used as far as I can tell. Each tensor_alloc has a buffer_id field and this is what caused me to look into this more closely, to understand what the buffer_id in leaf_alloc was used for.

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

Signed-off-by: Molly Sophia <[email protected]>

…org#9903) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" | "B" | "C" | "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space ```

* llama : suppress conversion from 'size_t' to 'int' This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC: ```console src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data ``` This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX. The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows. * squash! llama : suppress conversion from 'size_t' to 'int' Move cast into for loop.

…org#9875) * fix: use `vm_allocate` to allocate CPU backend buffer on macOS * fix: switch to `posix_memalign` to keep existing `free()` usages work * feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS * style: formatting * fix: move const outside of `#ifndef` * style: formatting * fix: unused var * fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h` * fix: unused var * fix: page align to `GGUF_DEFAULT_ALIGNMENT` * fix: page align to `TENSOR_ALIGNMENT` * fix: convert `TENSOR_ALIGNMENT` to a macro * fix: increase page size to `32` on iOS * fix: iOS page size * fix: `hbw_posix_memalign` alignment

* vulkan : add backend registry / device interfaces * llama : print devices used on model load

Co-authored-by: Tim Wang <[email protected]>

* llama : infill sampling handle very long tokens ggml-ci * cont : better indices ggml-ci

This commit addresses the TODO in the code to rename the `batch_all` parameter to `batch` in `llama_decode_internal`.

…l-org#9929) ggml-ci

add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend

…rg#9705) * implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp

* rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server

ggerganov · 2024-10-20T16:12:37Z

What name does the output tensor have for these models? Looking at the safetensors on HF, the original name is lm_head.weight so it should be converted to output.weight by the convert_hf_to_gguf.py script. So I don't see why this change would make a difference.

drollings · 2024-10-21T04:49:14Z

I found an example for you, from running llama-quantize on a GGUF of TechxGenus/CursorCore-QW2.5-1.5B. In this example, the name of the tensors you asked for is "attn_output", and the patch affects the behavior only when --leave-output-tensor is used. To my understanding, this patch is allowing behavior closer to expected from using --leave-output-tensor.

Without the patch:

$ llama-quantize --allow-requantize --output-tensor-type Q6_K --token-embedding-type Q6_K --leave-output-tensor  /tmp/cursorcore-qwen2.5-1.5b:q8_0.gguf /tmp/cursorcore-qwen2.5-1.5b:q6_k.gguf Q4_K_M 2>&1 | grep output
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[   9/ 338]             blk.0.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
[  21/ 338]             blk.1.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
[  33/ 338]            blk.10.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
[  45/ 338]            blk.11.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
...

With the patch:

$ llama-quantize --allow-requantize --output-tensor-type Q6_K --token-embedding-type Q6_K --leave-output-tensor  /tmp/cursorcore-qwen2.5-1.5b:q8_0.gguf /tmp/cursorcore-qwen2.5-1.5b:q6_k.gguf Q4_K_M 2>&1 | grep output
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[   9/ 338]             blk.0.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB
[  21/ 338]             blk.1.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB
[  33/ 338]            blk.10.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB
[  45/ 338]            blk.11.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB

ggerganov · 2024-10-21T06:29:56Z

I see now. So the flag --leave-output-tensor is intended only for the final output tensor at the end of the transformer, usually named "output.weight". The output tensors in the attention layers (such as "blk.0.attn_output.weight" are not supposed to be affected by this flag. So currently the flag is functioning correctly and the proposed change is not correct.

We can add functionality for excluding certain tensors in order to achieve what you are trying and there is already some partial works in the PRs. But for now, there is no such functionality exposed.

drollings · 2024-10-21T16:44:39Z

Thank you for your time and consideration. I look forward to the official features; in the meantime, I often quantize GGUFs with this patch, keeping all output and embedding tensors at Q8_0 while tuning others down to Q4_K_M or Q5_K_M. It will be interesting to see benchmarks on targeted quantizations like this.

llama.cpp : fix --leave-output-tensor for llama-quantize.

597bc15

* Tweaked llama-quantize's --leave-output-tensor parameter's impact on llama_model_quantize_internal() to exclude any tensor named "*output.weight" instead of just "output.weight".

slaren and others added 27 commits October 18, 2024 16:01

ggml : move more prints to the ggml log system (ggml-org#9839)

63654cd

* ggml : move more prints to the ggml log system * show BLAS OpenMP warnings in all builds using debug print

musa : update doc (ggml-org#9856)

bb832ca

Signed-off-by: Xiaodong Ye <[email protected]>

llama : improve infill support and special token detection (ggml-org#…

a0ddf1f

…9798) * llama : improve infill support ggml-ci * llama : add more FIM token strings ggml-ci * server : update prompt on slot restore (ggml-org#9800) * gguf : deprecate old FIM token KVs

server : remove legacy system_prompt feature (ggml-org#9857)

b7aea35

* server : remove legacy system_prompt feature ggml-ci * readme : update [no ci] * server : fix non-transformer logic + remove response from /props

server : remove self-extend features (ggml-org#9860)

0b2de92

* server : remove self-extend ggml-ci * server : fix context limit check to use slot.n_past ggml-ci

server : add option to time limit the generation phase (ggml-org#9865)

3fd4de0

ggml-ci

server : reuse cached context chunks (ggml-org#9866)

4682f73

ggml-ci

server : accept extra_context for the infill endpoint (ggml-org#9874)

3719bf6

* server : accept extra_context for the infill endpoint ggml-ci * server : update readme [no ci] * server : use repo-level FIM pattern if possible ggml-ci

server : handle "logprobs" field with false value (ggml-org#9871)

b9ebb55

Co-authored-by: Gimling <[email protected]>

readme : update bindings list (ggml-org#9889)

b9a3854

server : update preact (ggml-org#9895)

d84d567

server : improve infill context reuse (ggml-org#9894)

ed0843d

ggml-ci

llama : add infill sampler (ggml-org#9896)

06e841b

ggml-ci

[CANN] Fix cann compilation error (ggml-org#9891)

38f190c

Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.

sync : ggml

d2844e8

server : fix the disappearance of the end of the text (ggml-org#9867)

aca8957

* server: fix the disappearance of the end of the text when streaming with stop strings * simplify "send text" checks

llama : add tensor name for "result_norm" (ggml-org#9907)

50a8720

Signed-off-by: Molly Sophia <[email protected]>

llava : fix typo in error message [no ci] (ggml-org#9884)

10dbdc8

fix: allocating CPU buffer with size 0 (ggml-org#9917)

373ffc1

vulkan : add backend registry / device interfaces (ggml-org#9721)

b4e079a

* vulkan : add backend registry / device interfaces * llama : print devices used on model load

ShenghaiWang and others added 10 commits October 18, 2024 16:01

readme : update bindings list (ggml-org#9918)

cba857c

Co-authored-by: Tim Wang <[email protected]>

llama : infill sampling handle very long tokens (ggml-org#9924)

b440d03

* llama : infill sampling handle very long tokens ggml-ci * cont : better indices ggml-ci

llama : change warning to debug log

f7228b7

readme : remove --memory-f32 references (ggml-org#9925)

4860058

llama : rename batch_all to batch (ggml-org#8881)

14a199e

This commit addresses the TODO in the code to rename the `batch_all` parameter to `batch` in `llama_decode_internal`.

server : add n_indent parameter for line indentation requirement (ggm…

224720e

…l-org#9929) ggml-ci

[SYCL] Add SYCL Backend registry, device and Event Interfaces (ggml-o…

be19a9e

…rg#9705) * implemented missing SYCL event APIs * sycl : Added device and backend reg interfaces * Restructured ggml-sycl.cpp

rpc : backend refactoring (ggml-org#9912)

11c29ee

* rpc : refactor backend Use structs for RPC request/response messages * rpc : refactor server

Merge branch 'master' into fix_quantize_leave_output_tensor

2810c1b

drollings closed this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.cpp : fix --leave-output-tensor for llama-quantize. #9829

llama.cpp : fix --leave-output-tensor for llama-quantize. #9829

Uh oh!

drollings commented Oct 10, 2024

Uh oh!

ggerganov commented Oct 11, 2024

Uh oh!

drollings commented Oct 14, 2024

Uh oh!

ggerganov commented Oct 20, 2024 •

edited

Loading

Uh oh!

drollings commented Oct 21, 2024 •

edited

Loading

Uh oh!

ggerganov commented Oct 21, 2024

Uh oh!

drollings commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

llama.cpp : fix --leave-output-tensor for llama-quantize. #9829

llama.cpp : fix --leave-output-tensor for llama-quantize. #9829

Uh oh!

Conversation

drollings commented Oct 10, 2024

Uh oh!

ggerganov commented Oct 11, 2024

Uh oh!

drollings commented Oct 14, 2024

Uh oh!

ggerganov commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drollings commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 21, 2024

Uh oh!

drollings commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

ggerganov commented Oct 20, 2024 •

edited

Loading

drollings commented Oct 21, 2024 •

edited

Loading