Skip to content

Commit f9391bd

Browse files
committed
Update README
1 parent a3fdb2b commit f9391bd

File tree

1 file changed

+60
-13
lines changed

1 file changed

+60
-13
lines changed

tools/imatrix/README.md

Lines changed: 60 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,80 @@
11
# llama.cpp/tools/imatrix
22

33
Compute an importance matrix for a model and given text dataset. Can be used during quantization to enhance the quality of the quantized models.
4-
More information is available here: https://github.com/ggml-org/llama.cpp/pull/4861
4+
More information is [available here](https://github.com/ggml-org/llama.cpp/pull/4861)
55

66
## Usage
77

88
```
99
./llama-imatrix \
10-
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \
11-
[--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \
12-
[--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...]
10+
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] \
11+
[--chunk 123] [--output-frequency 10] [--save-frequency 0] [--show-statistics] \
12+
[--no-ppl] [--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...] \
13+
[--parse-special] [...]
1314
```
1415

15-
Here `-m` with a model name and `-f` with a file containing training data (such as e.g. `wiki.train.raw`) are mandatory.
16+
Here `-m | --model` with a model name and `-f | --file` with a file containing calibration data (such as e.g. `wiki.train.raw`) are mandatory.
1617
The parameters in square brackets are optional and have the following meaning:
17-
* `-o` (or `--output-file`) specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used.
18-
* `--verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
19-
* `--output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
18+
* `-h | --help` shows usage information and exits.
19+
* `-lv | --verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
20+
* `-o | --output-file` specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used.
21+
* `-ofreq | --output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
2022
* `--save-frequency` specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)
21-
* `--process-output` specifies if data will be collected for the `output.weight` tensor. My experience is that it is better to not utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
23+
* `--process-output` specifies if data will be collected for the `output.weight` tensor. Typically, it is better not to utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
24+
* `--in-file` one or more existing imatrix files to load and combine. Useful for merging files from multiple runs/datasets.
25+
* `--parse-special` enables parsing of special tokens (e.g., `<|im_start|>` in some models). Useful for models with custom tokenizers.
26+
* `--chunk` to skip the first `n` chunks of tokens from the input data. Useful for resuming or skipping initial low-quality data.
27+
* `-n | --n-chunks` maximum number of chunks to process. Default is -1 for all available chunks.
28+
* `--no-ppl` disables the calculation of perplexity for the processed chunks. Useful if you want to speed up the processing and do not care about perplexity.
29+
* `--show-statistics` displays imatrix file's statistics.
2230

23-
For faster computation, make sure to use GPU offloading via the `-ngl` argument
31+
For faster computation, make sure to use GPU offloading via the `-ngl | --n-gpu-layers` argument
2432

25-
## Example
33+
## Examples
2634

2735
```bash
28-
# generate importance matrix (imatrix.dat)
29-
./llama-imatrix -m ggml-model-f16.gguf -f train-data.txt -ngl 99
36+
# generate importance matrix using default filename (imatrix.dat), offloading 99 layers to GPU
37+
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt -ngl 99
3038

3139
# use the imatrix to perform a Q4_K_M quantization
3240
./llama-quantize --imatrix imatrix.dat ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m
3341
```
42+
43+
```bash
44+
# combine Existing imatrices
45+
./llama-imatrix --in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat -o imatrix-combined.dat
46+
```
47+
48+
```bash
49+
# skip first 5 chunks, save intermediates every 20 chunks and snapshots every 50, parsing special tokens
50+
./llama-imatrix -m ggml-model-f16.gguf -f calibration-data.txt --chunk 5 --output-frequency 20 --save-frequency 50 --parse-special
51+
```
52+
53+
```bash
54+
# analyse imatrix file and display summary statistics instead of running inference
55+
./llama-imatrix --imatrix imatrix.dat --show-statistics
56+
```
57+
58+
`--show-statistics` will display the following statistics:
59+
60+
#### Per tensor
61+
62+
* Σ(Act²): sum of all squared activations (the importance scores)
63+
* Min & Max: minimum and maximum squared activations values
64+
* μ & σ: Squared activations' mean and standard deviation
65+
* % Active: proportion of elements whose average squared activation exceeds a small threshold (1e-5). Helpful to determine how alive/dormant the tensor is during inference
66+
* N: number of squared activations
67+
* Entropy: entropy of the squared activation distribution, in bits (standard Shannon entropy measurement) $S = -\sum_{i=1}^N p_i \log_2 p_i$
68+
* E (norm): Normalized entropy. $E(norm)=\frac{-\sum_{i=1}^N p_i \log_2 p_i}{log_2 N}$. These two metrics can be used to determine how well a prompt "exercises" the model's capabilities
69+
* ZD Score: z-score distribution as described in _3.1 Layer Importance Scores_ of [Layer-Wise Quantization](https://arxiv.org/abs/2406.17415)
70+
* CosSim: cosine similarity with respect to the previous layer's tensor. Useful to determine how similar the squared activations of the current layer are to the previous layer's squared activations.
71+
72+
#### Per layer
73+
74+
Weighted averages of Σ(Act²), ZD Score and CosSim are also calculated.
75+
76+
#### Important note on the computed Statistics
77+
78+
When using these statistics, please note that they are computed on the squared activations, **not on the actual (raw) activations**.
79+
Whilst the results are still useful, they're less accurate than using the raw values, and in the case of the cosine similarity, could be misleading if the tensor contains opposite vectors.
80+
This limitation is due to the current implementation of the importance matrix, but a pull request ([use GGUF to store importance matrices](https://github.com/ggml-org/llama.cpp/pull/9400)) aims to address this.

0 commit comments

Comments
 (0)