|
| 1 | +[4mLLAMAFILE-QUANTIZE[24m(1) General Commands Manual [4mLLAMAFILE-QUANTIZE[24m(1) |
| 2 | + |
| 3 | +[1mNAME[0m |
| 4 | + llamafile-quantize — large language model quantizer |
| 5 | + |
| 6 | +[1mSYNOPSIS[0m |
| 7 | + [1mllamafile-quantize [22m[flags...] [4mmodel-f32.gguf[24m [[4mmodel-quant.gguf[24m] [4mtype[0m |
| 8 | + [[4mnthreads[24m] |
| 9 | + |
| 10 | +[1mDESCRIPTION[0m |
| 11 | + [1mllamafile-quantize [22mconverts large language model weights from the |
| 12 | + float32 or float16 formats into smaller data types from 2 to 8 bits in |
| 13 | + size. |
| 14 | + |
| 15 | +[1mOPTIONS[0m |
| 16 | + The following flags are available: |
| 17 | + |
| 18 | + [1m--allow-requantize[0m |
| 19 | + Allows requantizing tensors that have already been quantized. |
| 20 | + Warning: This can severely reduce quality compared to quantiz‐ |
| 21 | + ing from 16bit or 32bit |
| 22 | + |
| 23 | + [1m--leave-output-tensor[0m |
| 24 | + Will leave output.weight un(re)quantized. Increases model size |
| 25 | + but may also increase quality, especially when requantizing |
| 26 | + |
| 27 | + [1m--pure [22mDisable k-quant mixtures and quantize all tensors to the same |
| 28 | + type |
| 29 | + |
| 30 | +[1mARGUMENTS[0m |
| 31 | + The following positional arguments are accepted: |
| 32 | + |
| 33 | + [4mmodel-f32.gguf[0m |
| 34 | + Is the input file, which contains the unquantized model weights |
| 35 | + in either the float32 or float16 format. |
| 36 | + |
| 37 | + [4mmodel-quant.gguf[0m |
| 38 | + Is the output file, which will contain quantized weights in the |
| 39 | + desired format. If this path isn't specified, it'll default to |
| 40 | + [inp path]/ggml-model-[ftype].gguf. |
| 41 | + |
| 42 | + [4mtype[24m Is the desired quantization format, which may be the integer id |
| 43 | + of a supported quantization type, or its name. See the quanti‐ |
| 44 | + zation types section below for acceptable formats. |
| 45 | + |
| 46 | + [4mnthreads[0m |
| 47 | + Number of threads to use during computation (default: nproc/2) |
| 48 | + |
| 49 | +[1mQUANTIZATION TYPES[0m |
| 50 | + The following quantization types are available. This table shows the ID |
| 51 | + of the quantization format, its name, the file size of 7B model weights |
| 52 | + that use it, and finally the amount of quality badness it introduces as |
| 53 | + measured by the llamafile-perplexity tool averaged over 128 chunks with |
| 54 | + the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with |
| 55 | + how recommended the quantization format is for general usage. |
| 56 | + |
| 57 | + [1m- [22m18 Q6_K 5.6gb +0.0446 ppl (q6 kawrakow) |
| 58 | + [1m- [22m7 Q8_0 7.2gb +0.0022 ppl (q8 gerganov) |
| 59 | + [1m- [22m1 F16 14gb +0.0000 ppl (best but biggest) |
| 60 | + [1m- [22m8 Q5_0 4.7gb +0.0817 ppl (q5 gerganov zero) |
| 61 | + [1m- [22m17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium) |
| 62 | + [1m- [22m16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small) |
| 63 | + [1m- [22m15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium) |
| 64 | + [1m- [22m14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small) |
| 65 | + [1m- [22m13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large) |
| 66 | + [1m- [22m12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium) |
| 67 | + [1m- [22m11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small) |
| 68 | + [1m- [22m10 Q2_K 2.6gb +4.2359 ppl (tiniest hallucinates most) |
| 69 | + [1m- [22m32 BF16 14gb +0.0000 ppl (canonical but cpu/cuda only) |
| 70 | + [1m- [22m0 F32 27gb 9.0952 ppl (reference point) |
| 71 | + [1m- [22m2 Q4_0 3.9gb +0.3339 ppl (legacy) |
| 72 | + [1m- [22m3 Q4_1 4.3gb +0.4163 ppl (legacy) |
| 73 | + [1m- [22m9 Q5_1 5.1gb +0.1091 ppl (legacy) |
| 74 | + [1m- [22m12 Q3_K alias for Q3_K_M |
| 75 | + [1m- [22m15 Q4_K alias for Q4_K_M |
| 76 | + [1m- [22m17 Q5_K alias for Q5_K_M |
| 77 | + [1m- [22mCOPY Only copy tensors, no quantizing. |
| 78 | + |
| 79 | +[1mSEE ALSO[0m |
| 80 | + [4mllamafile[24m(1), [4mllamafile-imatrix[24m(1), [4mllamafile-perplexity[24m(1), |
| 81 | + [4mllava-quantize[24m(1), [4mzipalign[24m(1), [4munzip[24m(1) |
| 82 | + |
| 83 | +Llamafile Manual December 5, 2023 [4mLLAMAFILE-QUANTIZE[24m(1) |
0 commit comments