Skip to content

Conversation

@drollings
Copy link

  • Tweaked llama-quantize's --leave-output-tensor parameter's impact on llama_model_quantize_internal() to exclude any tensor named "*output.weight" instead of just "output.weight".

* Tweaked llama-quantize's --leave-output-tensor parameter's impact on
  llama_model_quantize_internal() to exclude any tensor named "*output.weight"
  instead of just "output.weight".
@ggerganov
Copy link
Member

Can you provide an example model where this makes a difference?

@drollings
Copy link
Author

I mostly work on Llama 3.1-based models. In every case where I was quantizing, attempting to leave embedding and output tensors at Q8, they were all getting re-quantized. The models were being made mostly pure Q4_K_M. Most straightforward examples: https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite, https://huggingface.co/arcee-ai/Llama-3.1-SuperNova-Lite, https://huggingface.co/akjindal53244/Llama-3.1-Storm-8B, https://huggingface.co/ValiantLabs/Llama3.1-8B-ShiningValiant2.

slaren and others added 27 commits October 18, 2024 16:01
* ggml : move more prints to the ggml log system

* show BLAS OpenMP warnings in all builds using debug print
…9798)

* llama : improve infill support

ggml-ci

* llama : add more FIM token strings

ggml-ci

* server : update prompt on slot restore (ggml-org#9800)

* gguf : deprecate old FIM token KVs
* server : remove legacy system_prompt feature

ggml-ci

* readme : update [no ci]

* server : fix non-transformer logic + remove response from /props
* server : remove self-extend

ggml-ci

* server : fix context limit check to use slot.n_past

ggml-ci
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/bc947f541ae55e999ffdb4013441347d83b00feb?narHash=sha256-NOiTvBbRLIOe5F6RbHaAh6%2B%2BBNjsb149fGZd1T4%2BKBg%3D' (2024-10-04)
  → 'github:NixOS/nixpkgs/5633bcff0c6162b9e4b5f1264264611e950c8ec7?narHash=sha256-9UTxR8eukdg%2BXZeHgxW5hQA9fIKHsKCdOIUycTryeVw%3D' (2024-10-09)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* server : accept extra_context for the infill endpoint

ggml-ci

* server : update readme [no ci]

* server : use repo-level FIM pattern if possible

ggml-ci
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
Fix cann compilation error after merging llama.cpp supports dynamically loadable backends.
This commit removes the buffer_id field from the leaf_alloc struct.

The motivation for is that this field is only written to and never
read/used as far as I can tell. Each tensor_alloc has a buffer_id field
and this is what caused me to look into this more closely, to
understand what the buffer_id in leaf_alloc was used for.
* server: fix the disappearance of the end of the text when streaming with stop strings

* simplify "send text" checks
…org#9903)

Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.
…org#9875)

* fix: use `vm_allocate` to allocate CPU backend buffer on macOS

* fix: switch to `posix_memalign` to keep existing `free()` usages work

* feat: move `GGML_ALIGNED_MALLOC` to `ggml-backend-impl.h`, add support for `vm_allocate` on macOS

* style: formatting

* fix: move const outside of `#ifndef`

* style: formatting

* fix: unused var

* fix: transform `GGML_ALIGNED_MALLOC` and `GGML_ALIGNED_FREE` into functions and add them to `ggml-impl.h`

* fix: unused var

* fix: page align to `GGUF_DEFAULT_ALIGNMENT`

* fix: page align to `TENSOR_ALIGNMENT`

* fix: convert `TENSOR_ALIGNMENT` to a macro

* fix: increase page size to `32` on iOS

* fix: iOS page size

* fix: `hbw_posix_memalign` alignment
* vulkan : add backend registry / device interfaces

* llama : print devices used on model load
ShenghaiWang and others added 10 commits October 18, 2024 16:01
* llama : infill sampling handle very long tokens

ggml-ci

* cont : better indices

ggml-ci
This commit addresses the TODO in the code to rename the `batch_all`
parameter to `batch` in `llama_decode_internal`.
add intel amx isa detection

add vnni kernel for gemv cases

add vnni and amx kernel support for block_q8_0

code cleanup

fix packing B issue

enable openmp

fine tune amx kernel

switch to aten parallel pattern

add error message for nested parallelism

code cleanup

add f16 support in ggml-amx

add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS

update CMakeList

update README

fix some compilation warning

fix compiler warning when amx is not enabled

minor change

ggml-ci

move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp

ggml-ci

update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16

ggml-ci

add amx as an ggml-backend

update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h

minor change

update CMakeLists.txt

minor change

apply weight prepacking in set_tensor method in ggml-backend

fix compile error

ggml-ci

minor change

ggml-ci

update CMakeLists.txt

ggml-ci

add march dependency

minor change

ggml-ci

change ggml_backend_buffer_is_host to return false for amx backend

ggml-ci

fix supports_op

use device reg for AMX backend

ggml-ci

minor change

ggml-ci

minor change

fix rebase

set .buffer_from_host_ptr to be false for AMX backend
…rg#9705)

* implemented missing SYCL event APIs

* sycl : Added device and backend reg interfaces

* Restructured ggml-sycl.cpp
* rpc : refactor backend

Use structs for RPC request/response messages

* rpc : refactor server
@ggerganov
Copy link
Member

ggerganov commented Oct 20, 2024

What name does the output tensor have for these models? Looking at the safetensors on HF, the original name is lm_head.weight so it should be converted to output.weight by the convert_hf_to_gguf.py script. So I don't see why this change would make a difference.

@drollings
Copy link
Author

drollings commented Oct 21, 2024

I found an example for you, from running llama-quantize on a GGUF of TechxGenus/CursorCore-QW2.5-1.5B. In this example, the name of the tensors you asked for is "attn_output", and the patch affects the behavior only when --leave-output-tensor is used. To my understanding, this patch is allowing behavior closer to expected from using --leave-output-tensor.

Without the patch:

$ llama-quantize --allow-requantize --output-tensor-type Q6_K --token-embedding-type Q6_K --leave-output-tensor  /tmp/cursorcore-qwen2.5-1.5b:q8_0.gguf /tmp/cursorcore-qwen2.5-1.5b:q6_k.gguf Q4_K_M 2>&1 | grep output
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[   9/ 338]             blk.0.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
[  21/ 338]             blk.1.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
[  33/ 338]            blk.10.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
[  45/ 338]            blk.11.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, converting to q4_K .. size =     2.39 MiB ->     1.27 MiB
...

With the patch:

$ llama-quantize --allow-requantize --output-tensor-type Q6_K --token-embedding-type Q6_K --leave-output-tensor  /tmp/cursorcore-qwen2.5-1.5b:q8_0.gguf /tmp/cursorcore-qwen2.5-1.5b:q6_k.gguf Q4_K_M 2>&1 | grep output
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[   9/ 338]             blk.0.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB
[  21/ 338]             blk.1.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB
[  33/ 338]            blk.10.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB
[  45/ 338]            blk.11.attn_output.weight - [ 1536,  1536,     1,     1], type =   q8_0, size =    2.391 MB

@ggerganov
Copy link
Member

I see now. So the flag --leave-output-tensor is intended only for the final output tensor at the end of the transformer, usually named "output.weight". The output tensors in the attention layers (such as "blk.0.attn_output.weight" are not supposed to be affected by this flag. So currently the flag is functioning correctly and the proposed change is not correct.

We can add functionality for excluding certain tensors in order to achieve what you are trying and there is already some partial works in the PRs. But for now, there is no such functionality exposed.

@drollings drollings closed this Oct 21, 2024
@drollings
Copy link
Author

Thank you for your time and consideration. I look forward to the official features; in the meantime, I often quantize GGUFs with this patch, keeping all output and embedding tensors at Q8_0 while tuning others down to Q4_K_M or Q5_K_M. It will be interesting to see benchmarks on targeted quantizations like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.