Skip to content

Commit d1aa0cc

Browse files
EAddariocompilade
andauthored
imatrix: add option to display importance score statistics for a given imatrix file (ggml-org#12718)
* Add --show-statistics option * Add --show-statistics logic * Add tensor name parsing * Tidy output format * Fix typo in title * Improve tensor influence ranking * Add better statistics * Change statistics' sort order * Add Cosine Similarity * Add header search path * Change header search path to private * Add weighted statistics per layer * Update report title * Refactor compute_statistics out of main * Refactor compute_cossim out of load_imatrix * Refactor compute_statistics out of load_imatrix * Move imatrix statistics calculation into its own functions * Add checks and validations * Remove unnecessary include directory * Rename labels * Add m_stats getter and refactor compute_statistics out of load_imatrix * Refactor variable names * Minor cosmetic change * Retrigger checks (empty commit) * Rerun checks (empty commit) * Fix unnecessary type promotion Co-authored-by: compilade <[email protected]> * Reverting change to improve code readability * Rerun checks (empty commit) * Rerun checks (empty commit) * Rerun checks - third time's the Charm 🤞 (empty commit) * Minor cosmetic change * Update README * Fix typo * Update README * Rerun checks (empty commit) * Re-implement changes on top of ggml-org#9400 * Update README.md * Update README * Update README.md Co-authored-by: compilade <[email protected]> * Update README.md Co-authored-by: compilade <[email protected]> * Update README.md * Remove duplicate option in print_usage() * Update README.md * Update README.md Co-authored-by: compilade <[email protected]> * Update README.md Co-authored-by: compilade <[email protected]> * Remove input check * Remove commented out code --------- Co-authored-by: compilade <[email protected]>
1 parent c8ade30 commit d1aa0cc

File tree

4 files changed

+339
-22
lines changed

4 files changed

+339
-22
lines changed

common/arg.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2655,6 +2655,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
26552655
params.i_chunk = value;
26562656
}
26572657
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
2658+
add_opt(common_arg(
2659+
{"--show-statistics"},
2660+
string_format("show imatrix statistics and then exit (default: %s)", params.show_statistics ? "true" : "false"),
2661+
[](common_params & params) {
2662+
params.show_statistics = true;
2663+
}
2664+
).set_examples({LLAMA_EXAMPLE_IMATRIX}));
26582665
add_opt(common_arg(
26592666
{"--parse-special"},
26602667
string_format("prase special tokens (chat, tool, etc) (default: %s)", params.parse_special ? "true" : "false"),

common/common.h

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -432,9 +432,10 @@ struct common_params {
432432
int32_t n_save_freq = 0; // save the imatrix every n_save_freq iterations
433433
int32_t i_chunk = 0; // start processing from this chunk
434434

435-
bool process_output = false; // collect data for the output tensor
436-
bool compute_ppl = true; // whether to compute perplexity
437-
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
435+
bool process_output = false; // collect data for the output tensor
436+
bool compute_ppl = true; // whether to compute perplexity
437+
bool show_statistics = false; // show imatrix statistics per tensor
438+
bool parse_special = false; // whether to parse special tokens during imatrix tokenization
438439

439440
// cvector-generator params
440441
int n_pca_batch = 100;

tools/imatrix/README.md

Lines changed: 72 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,92 @@
11
# llama.cpp/tools/imatrix
22

33
Compute an importance matrix for a model and given text dataset. Can be used during quantization to enhance the quality of the quantized models.
4-
More information is available here: https://github.com/ggml-org/llama.cpp/pull/4861
4+
More information is available in <https://github.com/ggml-org/llama.cpp/pull/4861>.
55

66
## Usage
77

88
```
99
./llama-imatrix \
10-
-m model.gguf -f some-text.txt [-o imatrix.gguf] [--process-output] \
11-
[--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \
12-
[--in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf ...] \
13-
[--parse-special]
10+
-m model.gguf -f some-text.txt [-o imatrix.gguf] [--no-ppl] \
11+
[--process-output] [--chunk 123] [--save-frequency 0] [--output-frequency 10] \
12+
[--in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf ...] [--parse-special] \
13+
[--show-statistics] [...]
1414
```
1515

16-
Here `-m` with a model name and `-f` with a file containing training data (such as e.g. `wiki.train.raw`) are mandatory.
16+
Here `-m | --model` with a model name and `-f | --file` with a file containing calibration data (such as e.g. `wiki.train.raw`) are mandatory.
1717
The parameters in square brackets are optional and have the following meaning:
18-
* `-o` (or `--output-file`) specifies the name of the file where the computed data will be stored. If missing `imatrix.gguf` is used.
19-
* `--verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
20-
* `--output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
18+
19+
* `-h | --help` shows usage information and exits.
20+
* `-lv | --verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
21+
* `-o | --output-file` specifies the name of the file where the computed data will be stored. If missing `imatrix.gguf` is used.
22+
* `-ofreq | --output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
2123
* `--save-frequency` specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)
22-
* `--process-output` specifies if data will be collected for the `output.weight` tensor. My experience is that it is better to not utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
24+
* `--process-output` specifies if data will be collected for the `output.weight` tensor. Typically, it is better not to utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
25+
* `--in-file` one or more existing imatrix files to load and combine. Useful for merging files from multiple runs/datasets.
26+
* `--parse-special` enables parsing of special tokens (e.g., `<|im_start|>` in some models). Useful for models with custom tokenizers.
27+
* `--chunk | --from-chunk` to skip the first `n` chunks of tokens from the input data. Useful for resuming or skipping initial low-quality data.
28+
* `--chunks` maximum number of chunks to process. Default is -1 for all available chunks.
29+
* `--no-ppl` disables the calculation of perplexity for the processed chunks. Useful if you want to speed up the processing and do not care about perplexity.
30+
* `--show-statistics` displays imatrix file's statistics.
31+
32+
For faster computation, make sure to use GPU offloading via the `-ngl | --n-gpu-layers` argument.
2333

24-
For faster computation, make sure to use GPU offloading via the `-ngl` argument
34+
Recent versions of `llama-imatrix` store data in GGUF format by default. For the legacy format, use an extension other than `.gguf` when saving the output file. More information is available in <https://github.com/ggml-org/llama.cpp/pull/9400>.
2535

26-
## Example
36+
## Examples
2737

2838
```bash
29-
# generate importance matrix (imatrix.gguf)
30-
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99
39+
# generate importance matrix using default filename (imatrix.gguf), offloading 99 layers to GPU
40+
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99
3141

3242
# use the imatrix to perform a Q4_K_M quantization
3343
./llama-quantize --imatrix imatrix.gguf ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
3444
```
45+
46+
```bash
47+
# generate and save the imatrix using legacy format
48+
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -o imatrix-legcy-format.dat -ngl 99
49+
```
50+
51+
```bash
52+
# covert legacy (binary) imatrix format to new (GGUF) format
53+
./llama-imatrix --in-file imatrix-legacy-format.dat -o imatrix-new-format.gguf
54+
```
55+
56+
```bash
57+
# combine existing imatrices
58+
./llama-imatrix --in-file imatrix-prev-0.gguf --in-file imatrix-prev-1.gguf -o imatrix-combined.gguf
59+
```
60+
61+
```bash
62+
# skip first 5 chunks, save intermediates every 20 chunks and snapshots every 50, parsing special tokens
63+
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --chunk 5 --output-frequency 20 --save-frequency 50 --parse-special
64+
```
65+
66+
```bash
67+
# analyse imatrix file and display summary statistics instead of running inference
68+
./llama-imatrix --in-file imatrix.gguf --show-statistics
69+
```
70+
71+
`--show-statistics` will display the following statistics:
72+
73+
#### Per tensor
74+
75+
* Σ(Act²): sum of all squared activations (the importance scores)
76+
* Min & Max: minimum and maximum squared activations values
77+
* μ & σ: Squared activations' mean and standard deviation
78+
* % Active: proportion of elements whose average squared activation exceeds a small threshold (1e-5). Helpful to determine how alive/dormant the tensor is during inference
79+
* N: number of squared activations
80+
* Entropy: entropy of the squared activation distribution, in bits (standard Shannon entropy measurement) $S = -\sum_{i=1}^N p_i \log_2 p_i$
81+
* E (norm): Normalized entropy. $E(norm)=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}$. These two metrics can be used to determine how well a prompt "exercises" the model's capabilities
82+
* ZD Score: z-score distribution as described in _3.1 Layer Importance Scores_ of [Layer-Wise Quantization](https://arxiv.org/abs/2406.17415)
83+
* CosSim: cosine similarity with respect to the previous layer's tensor. Useful to determine how similar the squared activations of the current layer are to the previous layer's squared activations.
84+
85+
#### Per layer
86+
87+
Weighted averages of Σ(Act²), ZD Score and CosSim are also calculated.
88+
89+
#### Important note on the computed Statistics
90+
91+
When using these statistics, please note that they are computed on the squared activations, **not on the actual (raw) activations**.
92+
Whilst the results are still useful, they're less realiable than using the raw values, and in the case of the cosine similarity, could be misleading if the tensor contains opposite vectors.

0 commit comments

Comments
 (0)