-
-
Notifications
You must be signed in to change notification settings - Fork 61
Description
I'm currently optimizing quantized models of GLM4.6 for 192GiB of VRAM.
I was trying to follow the steps outlined there for automatic quantization https://huggingface.co/turboderp/GLM-4.5-Air-exl3/discussions/2 however the results are not satisfactory or can even be a regression.
Base repo: https://huggingface.co/mratsim/glm-4.6-exl3
Quantized models and metrics:
| Quant | Size | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|---|---|---|---|---|---|---|---|---|---|
| 3bpw | 124 GiB | 0.32625636 | 0.30842110 | 4.36145115 | 0.8409 | 0.5497 | 0.3022 | 0.1527 | 0.0695 |
| 4bpw | 165 GiB | 0.15579397 | 0.15313307 | 4.64835933 | 0.8969 | 0.6892 | 0.4609 | 0.2840 | 0.1611 |
| 5bpw | 206 GiB | 0.11346048 | 0.10777174 | 4.46847223 | 0.9172 | 0.7553 | 0.5610 | 0.3868 | 0.2486 |
| 6bpw | 247 GiB | 0.08243355 | 0.07828716 | 4.46603787 | 0.9336 | 0.7970 | 0.6218 | 0.4600 | 0.3226 |
| 8bpw | 328 GiB | 0.06771311 | 0.06660905 | 4.61223994 | 0.9441 | 0.8221 | 0.6663 | 0.5155 | 0.3780 |
| FP16 | 656 GiB | 4.62864232 |
I created several measurement files with -l3 to experiment with the tooling:
- 3vs4vs5
-l3: json, markdown - 4vs5vs6
-l3: json, markdown - 6vs8
-l3: json - 3vs4vs5vs6vs8
-l3: json, markdown
However when targeting for example 4.16bpw using measurements 345 or 34568 led to worse KL-divergence than naive 4bpw
model_diff 345
❯ python eval/model_diff.py -ma ~/AI/local_models_exl/glm-4.6-exl3-4.16bpw-opt-auto345 -mb ~/AI/huggingface-hub/hub/models--zai-org--GLM-4.6/snapshots/be72194883d968d7923a07e2f61681ea9a2826d1 -r 10 -d 1
-- model.embed_tokens rfn_err: 0.000000 max_diff/norm: 0.000000 sqnr: 75.242439 cos_err: 0.000000
-- model.layers.0 rfn_err: 0.004801 max_diff/norm: 0.000056 sqnr: 47.673629 cos_err: 0.000010
-- model.layers.1 rfn_err: 0.006056 max_diff/norm: 0.000175 sqnr: 45.368287 cos_err: 0.000017
-- model.layers.2 rfn_err: 0.002490 max_diff/norm: 0.003875 sqnr: 41.604403 cos_err: 0.000038
-- model.layers.3 rfn_err: 0.002862 max_diff/norm: 0.003873 sqnr: 37.062941 cos_err: 0.000130
-- model.layers.4 rfn_err: 0.003530 max_diff/norm: 0.003871 sqnr: 33.206389 cos_err: 0.000314
-- model.layers.5 rfn_err: 0.004183 max_diff/norm: 0.003867 sqnr: 31.620251 cos_err: 0.000482
-- model.layers.6 rfn_err: 0.004809 max_diff/norm: 0.003864 sqnr: 30.557184 cos_err: 0.000632
-- model.layers.7 rfn_err: 0.005505 max_diff/norm: 0.003859 sqnr: 29.591156 cos_err: 0.000784
-- model.layers.8 rfn_err: 0.006521 max_diff/norm: 0.003852 sqnr: 28.351475 cos_err: 0.000972
-- model.layers.9 rfn_err: 0.007369 max_diff/norm: 0.003842 sqnr: 28.059322 cos_err: 0.001046
-- model.layers.10 rfn_err: 0.008496 max_diff/norm: 0.003829 sqnr: 27.504467 cos_err: 0.001168
-- model.layers.11 rfn_err: 0.009969 max_diff/norm: 0.003813 sqnr: 26.910608 cos_err: 0.001298
-- model.layers.12 rfn_err: 0.011207 max_diff/norm: 0.003771 sqnr: 27.692484 cos_err: 0.001035
-- model.layers.13 rfn_err: 0.012240 max_diff/norm: 0.003732 sqnr: 28.013939 cos_err: 0.000952
-- model.layers.14 rfn_err: 0.013278 max_diff/norm: 0.003706 sqnr: 27.859965 cos_err: 0.000987
-- model.layers.15 rfn_err: 0.014488 max_diff/norm: 0.003688 sqnr: 27.475070 cos_err: 0.001077
-- model.layers.16 rfn_err: 0.015545 max_diff/norm: 0.003668 sqnr: 27.225535 cos_err: 0.001141
-- model.layers.17 rfn_err: 0.016788 max_diff/norm: 0.003648 sqnr: 26.888089 cos_err: 0.001236
-- model.layers.18 rfn_err: 0.018247 max_diff/norm: 0.003633 sqnr: 26.393631 cos_err: 0.001389
-- model.layers.19 rfn_err: 0.020064 max_diff/norm: 0.003626 sqnr: 25.652910 cos_err: 0.001635
-- model.layers.20 rfn_err: 0.021852 max_diff/norm: 0.003616 sqnr: 25.046091 cos_err: 0.001879
-- model.layers.21 rfn_err: 0.023647 max_diff/norm: 0.003599 sqnr: 24.566580 cos_err: 0.002099
-- model.layers.22 rfn_err: 0.026345 max_diff/norm: 0.003854 sqnr: 23.869807 cos_err: 0.002444
-- model.layers.23 rfn_err: 0.028943 max_diff/norm: 0.003795 sqnr: 23.606185 cos_err: 0.002636
-- model.layers.24 rfn_err: 0.031321 max_diff/norm: 0.003825 sqnr: 23.569389 cos_err: 0.002709
-- model.layers.25 rfn_err: 0.034268 max_diff/norm: 0.003744 sqnr: 23.486427 cos_err: 0.002777
-- model.layers.26 rfn_err: 0.037171 max_diff/norm: 0.003817 sqnr: 23.350620 cos_err: 0.002905
-- model.layers.27 rfn_err: 0.040736 max_diff/norm: 0.003748 sqnr: 22.980581 cos_err: 0.003174
-- model.layers.28 rfn_err: 0.045640 max_diff/norm: 0.003676 sqnr: 22.612336 cos_err: 0.003525
-- model.layers.29 rfn_err: 0.050268 max_diff/norm: 0.003668 sqnr: 22.567825 cos_err: 0.003679
-- model.layers.30 rfn_err: 0.056543 max_diff/norm: 0.004978 sqnr: 22.120069 cos_err: 0.004103
-- model.layers.31 rfn_err: 0.061133 max_diff/norm: 0.005270 sqnr: 21.677616 cos_err: 0.004516
-- model.layers.32 rfn_err: 0.064945 max_diff/norm: 0.005726 sqnr: 21.529574 cos_err: 0.004632
-- model.layers.33 rfn_err: 0.071593 max_diff/norm: 0.005477 sqnr: 20.724902 cos_err: 0.005612
-- model.layers.34 rfn_err: 0.074257 max_diff/norm: 0.005108 sqnr: 20.685567 cos_err: 0.005853
-- model.layers.35 rfn_err: 0.075816 max_diff/norm: 0.004674 sqnr: 20.856596 cos_err: 0.005850
-- model.layers.36 rfn_err: 0.077740 max_diff/norm: 0.003859 sqnr: 20.873847 cos_err: 0.005943
-- model.layers.37 rfn_err: 0.080754 max_diff/norm: 0.003677 sqnr: 20.737010 cos_err: 0.006021
-- model.layers.38 rfn_err: 0.082001 max_diff/norm: 0.003974 sqnr: 20.769845 cos_err: 0.006121
-- model.layers.39 rfn_err: 0.083756 max_diff/norm: 0.004280 sqnr: 20.915294 cos_err: 0.006209
-- model.layers.40 rfn_err: 0.086298 max_diff/norm: 0.005391 sqnr: 20.792369 cos_err: 0.006650
-- model.layers.41 rfn_err: 0.089854 max_diff/norm: 0.004608 sqnr: 20.584245 cos_err: 0.006890
-- model.layers.42 rfn_err: 0.093056 max_diff/norm: 0.005979 sqnr: 20.474941 cos_err: 0.007519
-- model.layers.43 rfn_err: 0.098234 max_diff/norm: 0.007441 sqnr: 20.304377 cos_err: 0.008440
-- model.layers.44 rfn_err: 0.103605 max_diff/norm: 0.007659 sqnr: 20.120711 cos_err: 0.009663
-- model.layers.45 rfn_err: 0.118800 max_diff/norm: 0.008464 sqnr: 19.820752 cos_err: 0.011479
-- model.layers.46 rfn_err: 0.126195 max_diff/norm: 0.008206 sqnr: 19.493114 cos_err: 0.012278
-- model.layers.47 rfn_err: 0.136338 max_diff/norm: 0.008316 sqnr: 19.135581 cos_err: 0.013439
-- model.layers.48 rfn_err: 0.140041 max_diff/norm: 0.011986 sqnr: 19.004967 cos_err: 0.013694
-- model.layers.49 rfn_err: 0.144284 max_diff/norm: 0.012071 sqnr: 18.764535 cos_err: 0.014448
-- model.layers.50 rfn_err: 0.151277 max_diff/norm: 0.013081 sqnr: 18.657672 cos_err: 0.015334
-- model.layers.51 rfn_err: 0.156728 max_diff/norm: 0.013309 sqnr: 18.483792 cos_err: 0.016585
-- model.layers.52 rfn_err: 0.165713 max_diff/norm: 0.029048 sqnr: 18.301432 cos_err: 0.017403
-- model.layers.53 rfn_err: 0.173107 max_diff/norm: 0.029158 sqnr: 18.150312 cos_err: 0.018510
-- model.layers.54 rfn_err: 0.177187 max_diff/norm: 0.030796 sqnr: 17.973192 cos_err: 0.018516
-- model.layers.55 rfn_err: 0.179450 max_diff/norm: 0.031450 sqnr: 17.971844 cos_err: 0.018647
-- model.layers.56 rfn_err: 0.183399 max_diff/norm: 0.031187 sqnr: 17.854060 cos_err: 0.019288
-- model.layers.57 rfn_err: 0.187285 max_diff/norm: 0.029505 sqnr: 17.642630 cos_err: 0.019657
-- model.layers.58 rfn_err: 0.192427 max_diff/norm: 0.030990 sqnr: 17.447164 cos_err: 0.021016
-- model.layers.59 rfn_err: 0.200882 max_diff/norm: 0.035351 sqnr: 17.230678 cos_err: 0.022081
-- model.layers.60 rfn_err: 0.205224 max_diff/norm: 0.038563 sqnr: 17.251613 cos_err: 0.021314
-- model.layers.61 rfn_err: 0.209168 max_diff/norm: 0.039627 sqnr: 17.034648 cos_err: 0.022098
-- model.layers.62 rfn_err: 0.212013 max_diff/norm: 0.040128 sqnr: 16.941532 cos_err: 0.022839
-- model.layers.63 rfn_err: 0.215969 max_diff/norm: 0.039015 sqnr: 16.759576 cos_err: 0.023463
-- model.layers.64 rfn_err: 0.220078 max_diff/norm: 0.041045 sqnr: 16.610650 cos_err: 0.024294
-- model.layers.65 rfn_err: 0.224325 max_diff/norm: 0.038901 sqnr: 16.392403 cos_err: 0.025255
-- model.layers.66 rfn_err: 0.227650 max_diff/norm: 0.041243 sqnr: 16.275767 cos_err: 0.025971
-- model.layers.67 rfn_err: 0.232001 max_diff/norm: 0.041406 sqnr: 16.051039 cos_err: 0.026556
-- model.layers.68 rfn_err: 0.235720 max_diff/norm: 0.043367 sqnr: 15.954840 cos_err: 0.026810
-- model.layers.69 rfn_err: 0.238157 max_diff/norm: 0.044546 sqnr: 15.872423 cos_err: 0.027087
-- model.layers.70 rfn_err: 0.240316 max_diff/norm: 0.041896 sqnr: 15.793191 cos_err: 0.027647
-- model.layers.71 rfn_err: 0.245099 max_diff/norm: 0.041933 sqnr: 15.583833 cos_err: 0.029003
-- model.layers.72 rfn_err: 0.250010 max_diff/norm: 0.041641 sqnr: 15.329537 cos_err: 0.030490
-- model.layers.73 rfn_err: 0.251969 max_diff/norm: 0.041205 sqnr: 15.246772 cos_err: 0.030898
-- model.layers.74 rfn_err: 0.255645 max_diff/norm: 0.040239 sqnr: 15.084509 cos_err: 0.031850
-- model.layers.75 rfn_err: 0.259927 max_diff/norm: 0.034118 sqnr: 14.850036 cos_err: 0.033164
-- model.layers.76 rfn_err: 0.263822 max_diff/norm: 0.033843 sqnr: 14.642419 cos_err: 0.034293
-- model.layers.77 rfn_err: 0.267549 max_diff/norm: 0.030560 sqnr: 14.446626 cos_err: 0.035578
-- model.layers.78 rfn_err: 0.268621 max_diff/norm: 0.024887 sqnr: 14.223235 cos_err: 0.036587
-- model.layers.79 rfn_err: 0.268082 max_diff/norm: 0.021839 sqnr: 14.077415 cos_err: 0.037175
-- model.layers.80 rfn_err: 0.271538 max_diff/norm: 0.020229 sqnr: 13.798549 cos_err: 0.038917
-- model.layers.81 rfn_err: 0.271883 max_diff/norm: 0.020948 sqnr: 13.608739 cos_err: 0.039357
-- model.layers.82 rfn_err: 0.272738 max_diff/norm: 0.018784 sqnr: 13.403214 cos_err: 0.040029
-- model.layers.83 rfn_err: 0.269626 max_diff/norm: 0.015710 sqnr: 13.318966 cos_err: 0.039273
-- model.layers.84 rfn_err: 0.271956 max_diff/norm: 0.010817 sqnr: 13.036554 cos_err: 0.040091
-- model.layers.85 rfn_err: 0.267653 max_diff/norm: 0.011755 sqnr: 12.957357 cos_err: 0.039251
-- model.layers.86 rfn_err: 0.262513 max_diff/norm: 0.010979 sqnr: 13.069283 cos_err: 0.037227
-- model.layers.87 rfn_err: 0.255142 max_diff/norm: 0.008756 sqnr: 13.163108 cos_err: 0.035242
-- model.layers.88 rfn_err: 0.247271 max_diff/norm: 0.006807 sqnr: 13.368607 cos_err: 0.032974
-- model.layers.89 rfn_err: 0.240385 max_diff/norm: 0.006985 sqnr: 13.581239 cos_err: 0.030954
-- model.layers.90 rfn_err: 0.244142 max_diff/norm: 0.021170 sqnr: 13.559732 cos_err: 0.031510
-- model.layers.91 rfn_err: 0.248637 max_diff/norm: 0.035771 sqnr: 13.362391 cos_err: 0.034299
-- model.norm rfn_err: 0.295658 max_diff/norm: 0.007286 sqnr: 12.223926 cos_err: 0.042538
-- A perplexity: 4.80377399
-- B perplexity: 4.62864232
-- A label in top-K:
K = 1: 0.6764
K = 2: 0.7846
K = 3: 0.8261
K = 4: 0.8526
K = 5: 0.8690
-- B label in top-K:
K = 1: 0.6833
K = 2: 0.7913
K = 3: 0.8322
K = 4: 0.8564
K = 5: 0.8715
-- Top-K agreement, A vs B:
K = 1: 0.8893
K = 2: 0.6739
K = 3: 0.4480
K = 4: 0.2723
K = 5: 0.1527
-- KL divergence (A, B): 0.18041060
-- KL divergence (B, A): 0.17299221
model_diff 456
❯ python eval/model_diff.py -ma ~/AI/local_models_exl/glm-4.6-exl3-4.16bpw-opt-auto456 -mb ~/AI/huggingface-hub/hub/models--zai-org--GLM-4.6/snapshots/be72194883d968d7923a07e2f61681ea9a2826d1 -r 10 -d 1
-- model.embed_tokens rfn_err: 0.000000 max_diff/norm: 0.000000 sqnr: 75.242439 cos_err: 0.000000
-- model.layers.0 rfn_err: 0.002532 max_diff/norm: 0.000049 sqnr: 53.135830 cos_err: 0.000003
-- model.layers.1 rfn_err: 0.003262 max_diff/norm: 0.000101 sqnr: 50.621208 cos_err: 0.000005
-- model.layers.2 rfn_err: 0.001239 max_diff/norm: 0.002313 sqnr: 47.223287 cos_err: 0.000010
-- model.layers.3 rfn_err: 0.001646 max_diff/norm: 0.002312 sqnr: 40.946838 cos_err: 0.000063
-- model.layers.4 rfn_err: 0.002302 max_diff/norm: 0.002310 sqnr: 36.608634 cos_err: 0.000175
-- model.layers.5 rfn_err: 0.002931 max_diff/norm: 0.002308 sqnr: 34.807614 cos_err: 0.000289
-- model.layers.6 rfn_err: 0.003636 max_diff/norm: 0.002306 sqnr: 33.145464 cos_err: 0.000426
-- model.layers.7 rfn_err: 0.004241 max_diff/norm: 0.002303 sqnr: 32.192519 cos_err: 0.000517
-- model.layers.8 rfn_err: 0.005101 max_diff/norm: 0.002299 sqnr: 30.917258 cos_err: 0.000640
-- model.layers.9 rfn_err: 0.005990 max_diff/norm: 0.002293 sqnr: 30.249710 cos_err: 0.000732
-- model.layers.10 rfn_err: 0.007019 max_diff/norm: 0.002286 sqnr: 29.545839 cos_err: 0.000824
-- model.layers.11 rfn_err: 0.008170 max_diff/norm: 0.002276 sqnr: 29.092319 cos_err: 0.000887
-- model.layers.12 rfn_err: 0.009106 max_diff/norm: 0.002249 sqnr: 29.963629 cos_err: 0.000690
-- model.layers.13 rfn_err: 0.010230 max_diff/norm: 0.002225 sqnr: 29.918137 cos_err: 0.000668
-- model.layers.14 rfn_err: 0.011259 max_diff/norm: 0.002210 sqnr: 29.628356 cos_err: 0.000711
-- model.layers.15 rfn_err: 0.012104 max_diff/norm: 0.002199 sqnr: 29.397842 cos_err: 0.000756
-- model.layers.16 rfn_err: 0.013175 max_diff/norm: 0.002187 sqnr: 29.014820 cos_err: 0.000820
-- model.layers.17 rfn_err: 0.014128 max_diff/norm: 0.002179 sqnr: 28.731121 cos_err: 0.000876
-- model.layers.18 rfn_err: 0.015550 max_diff/norm: 0.002172 sqnr: 28.092148 cos_err: 0.001006
-- model.layers.19 rfn_err: 0.016771 max_diff/norm: 0.002169 sqnr: 27.527229 cos_err: 0.001140
-- model.layers.20 rfn_err: 0.018327 max_diff/norm: 0.002162 sqnr: 26.867636 cos_err: 0.001319
-- model.layers.21 rfn_err: 0.020275 max_diff/norm: 0.002153 sqnr: 26.163940 cos_err: 0.001530
-- model.layers.22 rfn_err: 0.022516 max_diff/norm: 0.002310 sqnr: 25.514370 cos_err: 0.001758
-- model.layers.23 rfn_err: 0.024906 max_diff/norm: 0.002279 sqnr: 25.152491 cos_err: 0.001927
-- model.layers.24 rfn_err: 0.026701 max_diff/norm: 0.002259 sqnr: 25.221756 cos_err: 0.001960
-- model.layers.25 rfn_err: 0.029212 max_diff/norm: 0.002225 sqnr: 25.158134 cos_err: 0.002011
-- model.layers.26 rfn_err: 0.031587 max_diff/norm: 0.002192 sqnr: 25.047617 cos_err: 0.002098
-- model.layers.27 rfn_err: 0.034410 max_diff/norm: 0.002152 sqnr: 24.736043 cos_err: 0.002268
-- model.layers.28 rfn_err: 0.038380 max_diff/norm: 0.002098 sqnr: 24.358263 cos_err: 0.002505
-- model.layers.29 rfn_err: 0.041413 max_diff/norm: 0.002109 sqnr: 24.267287 cos_err: 0.002636
-- model.layers.30 rfn_err: 0.046239 max_diff/norm: 0.003405 sqnr: 23.912871 cos_err: 0.002900
-- model.layers.31 rfn_err: 0.049902 max_diff/norm: 0.003869 sqnr: 23.510458 cos_err: 0.003149
-- model.layers.32 rfn_err: 0.053044 max_diff/norm: 0.004326 sqnr: 23.385095 cos_err: 0.003226
-- model.layers.33 rfn_err: 0.057728 max_diff/norm: 0.003854 sqnr: 22.744909 cos_err: 0.003799
-- model.layers.34 rfn_err: 0.060344 max_diff/norm: 0.003758 sqnr: 22.662767 cos_err: 0.004014
-- model.layers.35 rfn_err: 0.062734 max_diff/norm: 0.003021 sqnr: 22.666019 cos_err: 0.004134
-- model.layers.36 rfn_err: 0.065708 max_diff/norm: 0.002442 sqnr: 22.512621 cos_err: 0.004308
-- model.layers.37 rfn_err: 0.068263 max_diff/norm: 0.001978 sqnr: 22.428269 cos_err: 0.004329
-- model.layers.38 rfn_err: 0.069652 max_diff/norm: 0.002430 sqnr: 22.391950 cos_err: 0.004446
-- model.layers.39 rfn_err: 0.071744 max_diff/norm: 0.004861 sqnr: 22.449684 cos_err: 0.004577
-- model.layers.40 rfn_err: 0.074143 max_diff/norm: 0.006124 sqnr: 22.333590 cos_err: 0.004949
-- model.layers.41 rfn_err: 0.075263 max_diff/norm: 0.004965 sqnr: 22.454522 cos_err: 0.004969
-- model.layers.42 rfn_err: 0.078318 max_diff/norm: 0.006107 sqnr: 22.280354 cos_err: 0.005505
-- model.layers.43 rfn_err: 0.082460 max_diff/norm: 0.007393 sqnr: 22.169781 cos_err: 0.006203
-- model.layers.44 rfn_err: 0.087518 max_diff/norm: 0.007651 sqnr: 21.980062 cos_err: 0.007232
-- model.layers.45 rfn_err: 0.101481 max_diff/norm: 0.008059 sqnr: 21.594333 cos_err: 0.008666
-- model.layers.46 rfn_err: 0.107300 max_diff/norm: 0.008223 sqnr: 21.367750 cos_err: 0.009180
-- model.layers.47 rfn_err: 0.116618 max_diff/norm: 0.007729 sqnr: 20.984628 cos_err: 0.010129
-- model.layers.48 rfn_err: 0.119954 max_diff/norm: 0.010136 sqnr: 20.803226 cos_err: 0.010325
-- model.layers.49 rfn_err: 0.123780 max_diff/norm: 0.011369 sqnr: 20.522477 cos_err: 0.010915
-- model.layers.50 rfn_err: 0.130849 max_diff/norm: 0.010430 sqnr: 20.358033 cos_err: 0.011756
-- model.layers.51 rfn_err: 0.136464 max_diff/norm: 0.012979 sqnr: 20.142103 cos_err: 0.012876
-- model.layers.52 rfn_err: 0.145445 max_diff/norm: 0.032631 sqnr: 19.989790 cos_err: 0.013518
-- model.layers.53 rfn_err: 0.152180 max_diff/norm: 0.033440 sqnr: 19.842374 cos_err: 0.014425
-- model.layers.54 rfn_err: 0.155730 max_diff/norm: 0.031477 sqnr: 19.661406 cos_err: 0.014340
-- model.layers.55 rfn_err: 0.158155 max_diff/norm: 0.030523 sqnr: 19.587984 cos_err: 0.014502
-- model.layers.56 rfn_err: 0.161787 max_diff/norm: 0.030526 sqnr: 19.442020 cos_err: 0.015048
-- model.layers.57 rfn_err: 0.164469 max_diff/norm: 0.028993 sqnr: 19.327276 cos_err: 0.015214
-- model.layers.58 rfn_err: 0.169204 max_diff/norm: 0.030337 sqnr: 19.129551 cos_err: 0.016320
-- model.layers.59 rfn_err: 0.177091 max_diff/norm: 0.034759 sqnr: 18.963173 cos_err: 0.017141
-- model.layers.60 rfn_err: 0.181524 max_diff/norm: 0.038588 sqnr: 18.934721 cos_err: 0.016556
-- model.layers.61 rfn_err: 0.184678 max_diff/norm: 0.038930 sqnr: 18.771039 cos_err: 0.017067
-- model.layers.62 rfn_err: 0.187594 max_diff/norm: 0.039985 sqnr: 18.643396 cos_err: 0.017719
-- model.layers.63 rfn_err: 0.190932 max_diff/norm: 0.038881 sqnr: 18.499075 cos_err: 0.018133
-- model.layers.64 rfn_err: 0.194831 max_diff/norm: 0.040846 sqnr: 18.309288 cos_err: 0.018819
-- model.layers.65 rfn_err: 0.198151 max_diff/norm: 0.040010 sqnr: 18.138121 cos_err: 0.019475
-- model.layers.66 rfn_err: 0.201245 max_diff/norm: 0.041287 sqnr: 17.998587 cos_err: 0.020078
-- model.layers.67 rfn_err: 0.204788 max_diff/norm: 0.041653 sqnr: 17.819825 cos_err: 0.020414
-- model.layers.68 rfn_err: 0.208380 max_diff/norm: 0.043768 sqnr: 17.684838 cos_err: 0.020661
-- model.layers.69 rfn_err: 0.210835 max_diff/norm: 0.044617 sqnr: 17.564163 cos_err: 0.020916
-- model.layers.70 rfn_err: 0.213159 max_diff/norm: 0.042813 sqnr: 17.451655 cos_err: 0.021447
-- model.layers.71 rfn_err: 0.217505 max_diff/norm: 0.042430 sqnr: 17.229203 cos_err: 0.022557
-- model.layers.72 rfn_err: 0.221413 max_diff/norm: 0.043271 sqnr: 17.025484 cos_err: 0.023622
-- model.layers.73 rfn_err: 0.223451 max_diff/norm: 0.042767 sqnr: 16.915597 cos_err: 0.023986
-- model.layers.74 rfn_err: 0.226731 max_diff/norm: 0.042294 sqnr: 16.730010 cos_err: 0.024748
-- model.layers.75 rfn_err: 0.230782 max_diff/norm: 0.035955 sqnr: 16.453500 cos_err: 0.025837
-- model.layers.76 rfn_err: 0.233688 max_diff/norm: 0.035002 sqnr: 16.281798 cos_err: 0.026598
-- model.layers.77 rfn_err: 0.236962 max_diff/norm: 0.030444 sqnr: 16.046927 cos_err: 0.027641
-- model.layers.78 rfn_err: 0.237959 max_diff/norm: 0.024984 sqnr: 15.775886 cos_err: 0.028450
-- model.layers.79 rfn_err: 0.237527 max_diff/norm: 0.022790 sqnr: 15.585866 cos_err: 0.028972
-- model.layers.80 rfn_err: 0.240708 max_diff/norm: 0.022630 sqnr: 15.287847 cos_err: 0.030421
-- model.layers.81 rfn_err: 0.241317 max_diff/norm: 0.023039 sqnr: 15.062294 cos_err: 0.030797
-- model.layers.82 rfn_err: 0.241877 max_diff/norm: 0.021682 sqnr: 14.841874 cos_err: 0.031316
-- model.layers.83 rfn_err: 0.239409 max_diff/norm: 0.019046 sqnr: 14.696652 cos_err: 0.030867
-- model.layers.84 rfn_err: 0.240849 max_diff/norm: 0.015107 sqnr: 14.417818 cos_err: 0.031428
-- model.layers.85 rfn_err: 0.236420 max_diff/norm: 0.015809 sqnr: 14.338910 cos_err: 0.030688
-- model.layers.86 rfn_err: 0.232031 max_diff/norm: 0.011901 sqnr: 14.424759 cos_err: 0.029161
-- model.layers.87 rfn_err: 0.224739 max_diff/norm: 0.010280 sqnr: 14.536193 cos_err: 0.027461
-- model.layers.88 rfn_err: 0.217719 max_diff/norm: 0.009120 sqnr: 14.719769 cos_err: 0.025702
-- model.layers.89 rfn_err: 0.211519 max_diff/norm: 0.007884 sqnr: 14.905172 cos_err: 0.024124
-- model.layers.90 rfn_err: 0.214941 max_diff/norm: 0.020400 sqnr: 14.854159 cos_err: 0.024660
-- model.layers.91 rfn_err: 0.218911 max_diff/norm: 0.033286 sqnr: 14.630923 cos_err: 0.027068
-- model.norm rfn_err: 0.262886 max_diff/norm: 0.007201 sqnr: 13.483061 cos_err: 0.033412
-- A perplexity: 4.73691167
-- B perplexity: 4.62864232
-- A label in top-K:
K = 1: 0.6770
K = 2: 0.7848
K = 3: 0.8273
K = 4: 0.8511
K = 5: 0.8695
-- B label in top-K:
K = 1: 0.6833
K = 2: 0.7913
K = 3: 0.8322
K = 4: 0.8564
K = 5: 0.8715
-- Top-K agreement, A vs B:
K = 1: 0.9033
K = 2: 0.7071
K = 3: 0.4890
K = 4: 0.3108
K = 5: 0.1849
-- KL divergence (A, B): 0.14056166
-- KL divergence (B, A): 0.14544851
model_diff 34568
❯ python eval/model_diff.py -ma ~/AI/local_models_exl/glm-4.6-exl3-4.16bpw-opt-auto34568 -mb ~/AI/huggingface-hub/hub/models--zai-org--GLM-4.6/snapshots/be72194883d968d7923a07e2f61681ea9a2826d1 -r 10 -d 0
-- model.embed_tokens rfn_err: 0.000000 max_diff/norm: 0.000000 sqnr: 75.242439 cos_err: 0.000000
-- model.layers.0 rfn_err: 0.004801 max_diff/norm: 0.000062 sqnr: 47.671122 cos_err: 0.000010
-- model.layers.1 rfn_err: 0.004016 max_diff/norm: 0.000087 sqnr: 49.378177 cos_err: 0.000007
-- model.layers.2 rfn_err: 0.001334 max_diff/norm: 0.002192 sqnr: 45.442032 cos_err: 0.000016
-- model.layers.3 rfn_err: 0.001858 max_diff/norm: 0.002191 sqnr: 38.377479 cos_err: 0.000102
-- model.layers.4 rfn_err: 0.002727 max_diff/norm: 0.002190 sqnr: 33.760068 cos_err: 0.000285
-- model.layers.5 rfn_err: 0.003467 max_diff/norm: 0.002188 sqnr: 32.155714 cos_err: 0.000442
-- model.layers.6 rfn_err: 0.004160 max_diff/norm: 0.002186 sqnr: 31.058913 cos_err: 0.000585
-- model.layers.7 rfn_err: 0.004934 max_diff/norm: 0.002184 sqnr: 29.988504 cos_err: 0.000735
-- model.layers.8 rfn_err: 0.006047 max_diff/norm: 0.002181 sqnr: 28.591395 cos_err: 0.000936
-- model.layers.9 rfn_err: 0.006868 max_diff/norm: 0.002176 sqnr: 28.399268 cos_err: 0.000991
-- model.layers.10 rfn_err: 0.008019 max_diff/norm: 0.002171 sqnr: 27.801595 cos_err: 0.001108
-- model.layers.11 rfn_err: 0.009546 max_diff/norm: 0.002165 sqnr: 27.133882 cos_err: 0.001246
-- model.layers.12 rfn_err: 0.010652 max_diff/norm: 0.002143 sqnr: 28.040759 cos_err: 0.000970
-- model.layers.13 rfn_err: 0.011709 max_diff/norm: 0.002124 sqnr: 28.326563 cos_err: 0.000899
-- model.layers.14 rfn_err: 0.012670 max_diff/norm: 0.002111 sqnr: 28.231108 cos_err: 0.000924
-- model.layers.15 rfn_err: 0.013916 max_diff/norm: 0.002101 sqnr: 27.798158 cos_err: 0.001019
-- model.layers.16 rfn_err: 0.014996 max_diff/norm: 0.002092 sqnr: 27.523086 cos_err: 0.001087
-- model.layers.17 rfn_err: 0.016273 max_diff/norm: 0.002157 sqnr: 27.159690 cos_err: 0.001186
-- model.layers.18 rfn_err: 0.017481 max_diff/norm: 0.002381 sqnr: 26.785971 cos_err: 0.001300
-- model.layers.19 rfn_err: 0.019318 max_diff/norm: 0.002553 sqnr: 25.996163 cos_err: 0.001543
-- model.layers.20 rfn_err: 0.021092 max_diff/norm: 0.002635 sqnr: 25.372894 cos_err: 0.001777
-- model.layers.21 rfn_err: 0.022896 max_diff/norm: 0.002310 sqnr: 24.867356 cos_err: 0.001995
-- model.layers.22 rfn_err: 0.025654 max_diff/norm: 0.002476 sqnr: 24.108428 cos_err: 0.002345
-- model.layers.23 rfn_err: 0.028275 max_diff/norm: 0.002308 sqnr: 23.814350 cos_err: 0.002542
-- model.layers.24 rfn_err: 0.030284 max_diff/norm: 0.002365 sqnr: 23.913594 cos_err: 0.002559
-- model.layers.25 rfn_err: 0.033280 max_diff/norm: 0.002338 sqnr: 23.783551 cos_err: 0.002638
-- model.layers.26 rfn_err: 0.036182 max_diff/norm: 0.002460 sqnr: 23.619819 cos_err: 0.002769
-- model.layers.27 rfn_err: 0.039503 max_diff/norm: 0.002432 sqnr: 23.293971 cos_err: 0.002997
-- model.layers.28 rfn_err: 0.044661 max_diff/norm: 0.002407 sqnr: 22.886405 cos_err: 0.003351
-- model.layers.29 rfn_err: 0.049707 max_diff/norm: 0.002701 sqnr: 22.810929 cos_err: 0.003526
-- model.layers.30 rfn_err: 0.055282 max_diff/norm: 0.005096 sqnr: 22.545093 cos_err: 0.003807
-- model.layers.31 rfn_err: 0.059929 max_diff/norm: 0.005373 sqnr: 22.005276 cos_err: 0.004237
-- model.layers.32 rfn_err: 0.063583 max_diff/norm: 0.005763 sqnr: 21.835262 cos_err: 0.004354
-- model.layers.33 rfn_err: 0.070059 max_diff/norm: 0.005580 sqnr: 21.004716 cos_err: 0.005292
-- model.layers.34 rfn_err: 0.072313 max_diff/norm: 0.005106 sqnr: 21.007866 cos_err: 0.005481
-- model.layers.35 rfn_err: 0.074280 max_diff/norm: 0.004701 sqnr: 21.088889 cos_err: 0.005552
-- model.layers.36 rfn_err: 0.075936 max_diff/norm: 0.004034 sqnr: 21.102996 cos_err: 0.005641
-- model.layers.37 rfn_err: 0.078887 max_diff/norm: 0.002713 sqnr: 20.932547 cos_err: 0.005731
-- model.layers.38 rfn_err: 0.080570 max_diff/norm: 0.003048 sqnr: 20.898941 cos_err: 0.005885
-- model.layers.39 rfn_err: 0.082182 max_diff/norm: 0.004528 sqnr: 21.057416 cos_err: 0.005952
-- model.layers.40 rfn_err: 0.084676 max_diff/norm: 0.005755 sqnr: 20.942988 cos_err: 0.006384
-- model.layers.41 rfn_err: 0.088541 max_diff/norm: 0.006502 sqnr: 20.700569 cos_err: 0.006700
-- model.layers.42 rfn_err: 0.091588 max_diff/norm: 0.007034 sqnr: 20.623119 cos_err: 0.007307
-- model.layers.43 rfn_err: 0.096719 max_diff/norm: 0.007337 sqnr: 20.485011 cos_err: 0.008261
-- model.layers.44 rfn_err: 0.102608 max_diff/norm: 0.007284 sqnr: 20.287901 cos_err: 0.009546
-- model.layers.45 rfn_err: 0.120273 max_diff/norm: 0.007885 sqnr: 19.981555 cos_err: 0.011458
-- model.layers.46 rfn_err: 0.127702 max_diff/norm: 0.008090 sqnr: 19.646974 cos_err: 0.012262
-- model.layers.47 rfn_err: 0.138129 max_diff/norm: 0.007496 sqnr: 19.283837 cos_err: 0.013442
-- model.layers.48 rfn_err: 0.141686 max_diff/norm: 0.009767 sqnr: 19.136065 cos_err: 0.013699
-- model.layers.49 rfn_err: 0.145800 max_diff/norm: 0.010983 sqnr: 18.885226 cos_err: 0.014449
-- model.layers.50 rfn_err: 0.153248 max_diff/norm: 0.009922 sqnr: 18.772170 cos_err: 0.015318
-- model.layers.51 rfn_err: 0.159106 max_diff/norm: 0.012947 sqnr: 18.588662 cos_err: 0.016536
-- model.layers.52 rfn_err: 0.169186 max_diff/norm: 0.029282 sqnr: 18.399767 cos_err: 0.017261
-- model.layers.53 rfn_err: 0.175998 max_diff/norm: 0.029242 sqnr: 18.306757 cos_err: 0.018261
-- model.layers.54 rfn_err: 0.179752 max_diff/norm: 0.028895 sqnr: 18.145131 cos_err: 0.018136
-- model.layers.55 rfn_err: 0.181809 max_diff/norm: 0.029798 sqnr: 18.126870 cos_err: 0.018249
-- model.layers.56 rfn_err: 0.185694 max_diff/norm: 0.029619 sqnr: 17.997456 cos_err: 0.018889
-- model.layers.57 rfn_err: 0.189593 max_diff/norm: 0.028585 sqnr: 17.758422 cos_err: 0.019273
-- model.layers.58 rfn_err: 0.194507 max_diff/norm: 0.030761 sqnr: 17.603675 cos_err: 0.020593
-- model.layers.59 rfn_err: 0.203304 max_diff/norm: 0.033140 sqnr: 17.375347 cos_err: 0.021648
-- model.layers.60 rfn_err: 0.208379 max_diff/norm: 0.039284 sqnr: 17.388323 cos_err: 0.020871
-- model.layers.61 rfn_err: 0.212062 max_diff/norm: 0.039628 sqnr: 17.168210 cos_err: 0.021596
-- model.layers.62 rfn_err: 0.214769 max_diff/norm: 0.041122 sqnr: 17.074055 cos_err: 0.022305
-- model.layers.63 rfn_err: 0.218505 max_diff/norm: 0.040156 sqnr: 16.887483 cos_err: 0.022940
-- model.layers.64 rfn_err: 0.222390 max_diff/norm: 0.042293 sqnr: 16.750882 cos_err: 0.023703
-- model.layers.65 rfn_err: 0.226219 max_diff/norm: 0.040061 sqnr: 16.523288 cos_err: 0.024654
-- model.layers.66 rfn_err: 0.229653 max_diff/norm: 0.042106 sqnr: 16.400857 cos_err: 0.025384
-- model.layers.67 rfn_err: 0.233888 max_diff/norm: 0.042852 sqnr: 16.173373 cos_err: 0.025956
-- model.layers.68 rfn_err: 0.238147 max_diff/norm: 0.044316 sqnr: 16.005759 cos_err: 0.026386
-- model.layers.69 rfn_err: 0.240992 max_diff/norm: 0.045643 sqnr: 15.866993 cos_err: 0.026807
-- model.layers.70 rfn_err: 0.242853 max_diff/norm: 0.042445 sqnr: 15.782427 cos_err: 0.027416
-- model.layers.71 rfn_err: 0.247450 max_diff/norm: 0.042845 sqnr: 15.575218 cos_err: 0.028779
-- model.layers.72 rfn_err: 0.252320 max_diff/norm: 0.041638 sqnr: 15.320354 cos_err: 0.030286
-- model.layers.73 rfn_err: 0.253841 max_diff/norm: 0.040118 sqnr: 15.239062 cos_err: 0.030684
-- model.layers.74 rfn_err: 0.257439 max_diff/norm: 0.039201 sqnr: 15.072786 cos_err: 0.031639
-- model.layers.75 rfn_err: 0.260855 max_diff/norm: 0.033231 sqnr: 14.843467 cos_err: 0.032934
-- model.layers.76 rfn_err: 0.264386 max_diff/norm: 0.032177 sqnr: 14.637920 cos_err: 0.034035
-- model.layers.77 rfn_err: 0.268455 max_diff/norm: 0.029274 sqnr: 14.373125 cos_err: 0.035574
-- model.layers.78 rfn_err: 0.269103 max_diff/norm: 0.022844 sqnr: 14.146247 cos_err: 0.036600
-- model.layers.79 rfn_err: 0.268272 max_diff/norm: 0.019497 sqnr: 14.004250 cos_err: 0.037214
-- model.layers.80 rfn_err: 0.271593 max_diff/norm: 0.019457 sqnr: 13.727531 cos_err: 0.038988
-- model.layers.81 rfn_err: 0.272284 max_diff/norm: 0.020697 sqnr: 13.532661 cos_err: 0.039480
-- model.layers.82 rfn_err: 0.273023 max_diff/norm: 0.017999 sqnr: 13.334535 cos_err: 0.040139
-- model.layers.83 rfn_err: 0.269905 max_diff/norm: 0.016367 sqnr: 13.260221 cos_err: 0.039398
-- model.layers.84 rfn_err: 0.272259 max_diff/norm: 0.012739 sqnr: 12.985926 cos_err: 0.040222
-- model.layers.85 rfn_err: 0.267887 max_diff/norm: 0.012591 sqnr: 12.916644 cos_err: 0.039367
-- model.layers.86 rfn_err: 0.262563 max_diff/norm: 0.012338 sqnr: 13.037847 cos_err: 0.037329
-- model.layers.87 rfn_err: 0.253985 max_diff/norm: 0.008802 sqnr: 13.196373 cos_err: 0.035012
-- model.layers.88 rfn_err: 0.246187 max_diff/norm: 0.007470 sqnr: 13.400582 cos_err: 0.032764
-- model.layers.89 rfn_err: 0.239511 max_diff/norm: 0.007428 sqnr: 13.604201 cos_err: 0.030782
-- model.layers.90 rfn_err: 0.243755 max_diff/norm: 0.021364 sqnr: 13.574289 cos_err: 0.031420
-- model.layers.91 rfn_err: 0.248954 max_diff/norm: 0.037087 sqnr: 13.363897 cos_err: 0.034413
-- model.norm rfn_err: 0.295701 max_diff/norm: 0.007558 sqnr: 12.214710 cos_err: 0.042486
-- A perplexity: 4.72839006
-- B perplexity: 4.62864232
-- A label in top-K:
K = 1: 0.6799
K = 2: 0.7870
K = 3: 0.8288
K = 4: 0.8525
K = 5: 0.8699
-- B label in top-K:
K = 1: 0.6833
K = 2: 0.7913
K = 3: 0.8322
K = 4: 0.8564
K = 5: 0.8715
-- Top-K agreement, A vs B:
K = 1: 0.8874
K = 2: 0.6695
K = 3: 0.4413
K = 4: 0.2641
K = 5: 0.1494
-- KL divergence (A, B): 0.17907805
-- KL divergence (B, A): 0.17151952
Manual tuning
❯ python eval/model_diff.py -ma ~/AI/local_models_exl/glm-4.6-exl3-4bpw -mb ~/AI/huggingface-hub/hub/models--zai-org--GLM-4.6/snapshots/be72194883d968d7923a07e2f61681ea9a2826d1 -r 10 -d 1 -or /home/beta/AI/local_models_exl/glm-4.6-overrides16.yaml
-- Overriding from: ~/AI/local_models_exl/glm-4.6-exl3-6bpw:
model.layers.*.self_attn.q_proj.*
model.layers.90.*.down_proj
model.layers.91.*.o_proj
model.layers.91.*.down_proj
-- Overriding from: ~/AI/local_models_exl/glm-4.6-exl3-8bpw:
model.layers.*.self_attn.k_proj.*
model.layers.*.self_attn.v_proj.*
model.layers.*.self_attn.o_proj.*
lm_head.*
model.layers.*.mlp.gate.*
model.layers.*.mlp.shared_experts.*
model.layers.*.input_layernorm.*
model.layers.*.post_attention_layernorm.*
model.layers.0.*
model.layers.1.*
model.layers.2.*
-- Overriding from: ~/AI/local_models/GLM-4.6:
model.embed_tokens.*
model.norm.*
-- model.embed_tokens rfn_err: 0.000000 max_diff/norm: 0.000000 sqnr: 75.242439 cos_err: 0.000000
-- model.layers.0 rfn_err: 0.000922 max_diff/norm: 0.000012 sqnr: 60.968014 cos_err: 0.000000
-- model.layers.1 rfn_err: 0.001415 max_diff/norm: 0.000037 sqnr: 57.364590 cos_err: 0.000001
-- model.layers.2 rfn_err: 0.000454 max_diff/norm: 0.000977 sqnr: 56.145621 cos_err: 0.000001
-- model.layers.3 rfn_err: 0.000944 max_diff/norm: 0.000977 sqnr: 44.181575 cos_err: 0.000037
-- model.layers.4 rfn_err: 0.001584 max_diff/norm: 0.000977 sqnr: 39.743400 cos_err: 0.000110
-- model.layers.5 rfn_err: 0.002195 max_diff/norm: 0.000976 sqnr: 37.732329 cos_err: 0.000191
-- model.layers.6 rfn_err: 0.002899 max_diff/norm: 0.000975 sqnr: 35.639971 cos_err: 0.000301
-- model.layers.7 rfn_err: 0.003581 max_diff/norm: 0.000973 sqnr: 34.182683 cos_err: 0.000388
-- model.layers.8 rfn_err: 0.004360 max_diff/norm: 0.000972 sqnr: 32.898145 cos_err: 0.000483
-- model.layers.9 rfn_err: 0.005223 max_diff/norm: 0.000969 sqnr: 31.969753 cos_err: 0.000569
-- model.layers.10 rfn_err: 0.006153 max_diff/norm: 0.000966 sqnr: 31.267616 cos_err: 0.000642
-- model.layers.11 rfn_err: 0.007325 max_diff/norm: 0.000962 sqnr: 30.534345 cos_err: 0.000714
-- model.layers.12 rfn_err: 0.008354 max_diff/norm: 0.000950 sqnr: 31.079183 cos_err: 0.000580
-- model.layers.13 rfn_err: 0.009427 max_diff/norm: 0.000941 sqnr: 30.953198 cos_err: 0.000566
-- model.layers.14 rfn_err: 0.010569 max_diff/norm: 0.000935 sqnr: 30.440350 cos_err: 0.000624
-- model.layers.15 rfn_err: 0.011708 max_diff/norm: 0.000930 sqnr: 29.920236 cos_err: 0.000700
-- model.layers.16 rfn_err: 0.012932 max_diff/norm: 0.000925 sqnr: 29.410718 cos_err: 0.000779
-- model.layers.17 rfn_err: 0.014147 max_diff/norm: 0.000921 sqnr: 28.946627 cos_err: 0.000859
-- model.layers.18 rfn_err: 0.015539 max_diff/norm: 0.000917 sqnr: 28.321188 cos_err: 0.000980
-- model.layers.19 rfn_err: 0.017031 max_diff/norm: 0.000915 sqnr: 27.600716 cos_err: 0.001143
-- model.layers.20 rfn_err: 0.018810 max_diff/norm: 0.000912 sqnr: 26.832355 cos_err: 0.001343
-- model.layers.21 rfn_err: 0.020567 max_diff/norm: 0.000907 sqnr: 26.246714 cos_err: 0.001518
-- model.layers.22 rfn_err: 0.022336 max_diff/norm: 0.000757 sqnr: 25.820743 cos_err: 0.001671
-- model.layers.23 rfn_err: 0.024470 max_diff/norm: 0.000746 sqnr: 25.525657 cos_err: 0.001801
-- model.layers.24 rfn_err: 0.026924 max_diff/norm: 0.000678 sqnr: 25.273643 cos_err: 0.001929
-- model.layers.25 rfn_err: 0.029546 max_diff/norm: 0.000745 sqnr: 25.167650 cos_err: 0.001997
-- model.layers.26 rfn_err: 0.031937 max_diff/norm: 0.000853 sqnr: 25.033906 cos_err: 0.002082
-- model.layers.27 rfn_err: 0.034765 max_diff/norm: 0.000776 sqnr: 24.707451 cos_err: 0.002252
-- model.layers.28 rfn_err: 0.037870 max_diff/norm: 0.000717 sqnr: 24.527701 cos_err: 0.002394
-- model.layers.29 rfn_err: 0.041269 max_diff/norm: 0.001308 sqnr: 24.316615 cos_err: 0.002560
-- model.layers.30 rfn_err: 0.045394 max_diff/norm: 0.002137 sqnr: 24.155106 cos_err: 0.002725
-- model.layers.31 rfn_err: 0.047965 max_diff/norm: 0.002808 sqnr: 23.986142 cos_err: 0.002851
-- model.layers.32 rfn_err: 0.050378 max_diff/norm: 0.004302 sqnr: 23.994594 cos_err: 0.002871
-- model.layers.33 rfn_err: 0.054449 max_diff/norm: 0.003877 sqnr: 23.418021 cos_err: 0.003344
-- model.layers.34 rfn_err: 0.057369 max_diff/norm: 0.004181 sqnr: 23.222167 cos_err: 0.003592
-- model.layers.35 rfn_err: 0.060328 max_diff/norm: 0.003586 sqnr: 23.067767 cos_err: 0.003784
-- model.layers.36 rfn_err: 0.062923 max_diff/norm: 0.002658 sqnr: 22.937624 cos_err: 0.003935
-- model.layers.37 rfn_err: 0.065036 max_diff/norm: 0.001946 sqnr: 22.917412 cos_err: 0.003923
-- model.layers.38 rfn_err: 0.066213 max_diff/norm: 0.002444 sqnr: 22.898321 cos_err: 0.004007
-- model.layers.39 rfn_err: 0.068519 max_diff/norm: 0.005002 sqnr: 22.891829 cos_err: 0.004148
-- model.layers.40 rfn_err: 0.071642 max_diff/norm: 0.006128 sqnr: 22.599321 cos_err: 0.004555
-- model.layers.41 rfn_err: 0.072421 max_diff/norm: 0.004967 sqnr: 22.827911 cos_err: 0.004509
-- model.layers.42 rfn_err: 0.075582 max_diff/norm: 0.005999 sqnr: 22.596518 cos_err: 0.004964
-- model.layers.43 rfn_err: 0.080700 max_diff/norm: 0.007218 sqnr: 22.346673 cos_err: 0.005688
-- model.layers.44 rfn_err: 0.085216 max_diff/norm: 0.007340 sqnr: 22.154708 cos_err: 0.006574
-- model.layers.45 rfn_err: 0.100086 max_diff/norm: 0.007178 sqnr: 21.748468 cos_err: 0.008123
-- model.layers.46 rfn_err: 0.106251 max_diff/norm: 0.007515 sqnr: 21.487101 cos_err: 0.008683
-- model.layers.47 rfn_err: 0.115341 max_diff/norm: 0.007114 sqnr: 21.109852 cos_err: 0.009572
-- model.layers.48 rfn_err: 0.118420 max_diff/norm: 0.009359 sqnr: 20.932110 cos_err: 0.009783
-- model.layers.49 rfn_err: 0.122065 max_diff/norm: 0.010556 sqnr: 20.663009 cos_err: 0.010339
-- model.layers.50 rfn_err: 0.128889 max_diff/norm: 0.009785 sqnr: 20.490835 cos_err: 0.011111
-- model.layers.51 rfn_err: 0.134414 max_diff/norm: 0.015034 sqnr: 20.265464 cos_err: 0.012154
-- model.layers.52 rfn_err: 0.142885 max_diff/norm: 0.031147 sqnr: 20.142341 cos_err: 0.012748
-- model.layers.53 rfn_err: 0.149482 max_diff/norm: 0.032118 sqnr: 20.008559 cos_err: 0.013626
-- model.layers.54 rfn_err: 0.152856 max_diff/norm: 0.030293 sqnr: 19.844967 cos_err: 0.013526
-- model.layers.55 rfn_err: 0.155421 max_diff/norm: 0.029456 sqnr: 19.756365 cos_err: 0.013712
-- model.layers.56 rfn_err: 0.159261 max_diff/norm: 0.029467 sqnr: 19.609270 cos_err: 0.014269
-- model.layers.57 rfn_err: 0.162118 max_diff/norm: 0.028479 sqnr: 19.486451 cos_err: 0.014453
-- model.layers.58 rfn_err: 0.166724 max_diff/norm: 0.029727 sqnr: 19.303817 cos_err: 0.015534
-- model.layers.59 rfn_err: 0.174411 max_diff/norm: 0.033494 sqnr: 19.132267 cos_err: 0.016314
-- model.layers.60 rfn_err: 0.178963 max_diff/norm: 0.041998 sqnr: 19.091227 cos_err: 0.015774
-- model.layers.61 rfn_err: 0.182064 max_diff/norm: 0.043140 sqnr: 18.916406 cos_err: 0.016259
-- model.layers.62 rfn_err: 0.184897 max_diff/norm: 0.044360 sqnr: 18.781384 cos_err: 0.016903
-- model.layers.63 rfn_err: 0.188195 max_diff/norm: 0.044802 sqnr: 18.633078 cos_err: 0.017350
-- model.layers.64 rfn_err: 0.191926 max_diff/norm: 0.045659 sqnr: 18.442857 cos_err: 0.018003
-- model.layers.65 rfn_err: 0.195238 max_diff/norm: 0.044285 sqnr: 18.269418 cos_err: 0.018676
-- model.layers.66 rfn_err: 0.198210 max_diff/norm: 0.045031 sqnr: 18.127506 cos_err: 0.019266
-- model.layers.67 rfn_err: 0.201853 max_diff/norm: 0.046032 sqnr: 17.947338 cos_err: 0.019628
-- model.layers.68 rfn_err: 0.205539 max_diff/norm: 0.048573 sqnr: 17.801687 cos_err: 0.019904
-- model.layers.69 rfn_err: 0.207829 max_diff/norm: 0.050512 sqnr: 17.672481 cos_err: 0.020172
-- model.layers.70 rfn_err: 0.210149 max_diff/norm: 0.047816 sqnr: 17.555425 cos_err: 0.020708
-- model.layers.71 rfn_err: 0.214243 max_diff/norm: 0.048081 sqnr: 17.333431 cos_err: 0.021768
-- model.layers.72 rfn_err: 0.218078 max_diff/norm: 0.047687 sqnr: 17.125988 cos_err: 0.022842
-- model.layers.73 rfn_err: 0.219998 max_diff/norm: 0.046232 sqnr: 17.017382 cos_err: 0.023200
-- model.layers.74 rfn_err: 0.223214 max_diff/norm: 0.044563 sqnr: 16.828606 cos_err: 0.023952
-- model.layers.75 rfn_err: 0.226931 max_diff/norm: 0.039279 sqnr: 16.565793 cos_err: 0.025023
-- model.layers.76 rfn_err: 0.229737 max_diff/norm: 0.038378 sqnr: 16.391543 cos_err: 0.025766
-- model.layers.77 rfn_err: 0.233201 max_diff/norm: 0.035251 sqnr: 16.154316 cos_err: 0.026848
-- model.layers.78 rfn_err: 0.233974 max_diff/norm: 0.027466 sqnr: 15.894395 cos_err: 0.027643
-- model.layers.79 rfn_err: 0.233457 max_diff/norm: 0.021765 sqnr: 15.712711 cos_err: 0.028180
-- model.layers.80 rfn_err: 0.236190 max_diff/norm: 0.020170 sqnr: 15.436672 cos_err: 0.029506
-- model.layers.81 rfn_err: 0.236443 max_diff/norm: 0.021151 sqnr: 15.230884 cos_err: 0.029833
-- model.layers.82 rfn_err: 0.236501 max_diff/norm: 0.019090 sqnr: 15.037964 cos_err: 0.030261
-- model.layers.83 rfn_err: 0.233880 max_diff/norm: 0.015689 sqnr: 14.922484 cos_err: 0.029713
-- model.layers.84 rfn_err: 0.234975 max_diff/norm: 0.010838 sqnr: 14.664852 cos_err: 0.030163
-- model.layers.85 rfn_err: 0.230226 max_diff/norm: 0.010916 sqnr: 14.602165 cos_err: 0.029352
-- model.layers.86 rfn_err: 0.225762 max_diff/norm: 0.012293 sqnr: 14.692855 cos_err: 0.027846
-- model.layers.87 rfn_err: 0.217746 max_diff/norm: 0.008983 sqnr: 14.855135 cos_err: 0.026032
-- model.layers.88 rfn_err: 0.210910 max_diff/norm: 0.007671 sqnr: 15.050432 cos_err: 0.024335
-- model.layers.89 rfn_err: 0.205014 max_diff/norm: 0.008238 sqnr: 15.238875 cos_err: 0.022822
-- model.layers.90 rfn_err: 0.208433 max_diff/norm: 0.021522 sqnr: 15.182524 cos_err: 0.023324
-- model.layers.91 rfn_err: 0.212039 max_diff/norm: 0.035094 sqnr: 14.952269 cos_err: 0.025479
-- model.norm rfn_err: 0.255019 max_diff/norm: 0.007259 sqnr: 13.788102 cos_err: 0.031612
-- A perplexity: 4.65080432
-- B perplexity: 4.62864232
-- A label in top-K:
K = 1: 0.6828
K = 2: 0.7886
K = 3: 0.8310
K = 4: 0.8557
K = 5: 0.8728
-- B label in top-K:
K = 1: 0.6833
K = 2: 0.7913
K = 3: 0.8322
K = 4: 0.8564
K = 5: 0.8715
-- Top-K agreement, A vs B:
K = 1: 0.9061
K = 2: 0.7136
K = 3: 0.5029
K = 4: 0.3273
K = 5: 0.2002
-- KL divergence (A, B): 0.13325199
-- KL divergence (B, A): 0.13198433
The manual tuning was chosen from heuristics detailed in my model card https://huggingface.co/mratsim/glm-4.6-exl3#quantization-theory-and-heuristics-for-manual-tuning:
- Down-projection can have spike 2 orders of magnitude larger than the rest especially in second layer and last 2 layers of a model (hence +2 bits needed at minimum)
- Dense FFN should be kept unquantized as they have a very large impact
- self-attention has a large impact
- cross-attention (I assume router/gate?) has some impact
References
- Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)
Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia
https://arxiv.org/pdf/2506.12044 - Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1 - Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410 - Precision Where It Matters: A Novel Spike
Aware Mixed-Precision Quantization Strategy for
LLaMA-based Language Models (2025)
Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello
https://arxiv.org/pdf/2504.21553 - Systematic Outliers in Large Language Models (2025)
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
https://arxiv.org/pdf/2502.06415v2
Discussion and hypotheses
I suspect one of the two (or both) following reasons is at play:
- The optimization algorithm has no backtracking, i.e. single-pass but not comparing current layer importance with past layer importance.
- The optimization algorithm does not take synergies into account. Just like LLMs have emergent properties with size, it might be that up-quantizing certain projections significantly improve KL-divergence even if it appears as noise if we only measure improvement of a single up-quant, I've noticed that upquantizing down-projections to 6-bit can allow a 4-bit quant to operate almost as 6bpw despite being at 4.85bpw iirc.
1. The optimization algorithm has no backtracking
The current optimization algorithm is
exllamav3/exllamav3/conversion/optimize_model.py
Lines 65 to 100 in 2fc131e
| def optimize(meas, base_numel, base_cost, target_cost, base_kld, num_q): | |
| groups = meas["groups"] | |
| num_groups = len(groups) | |
| solution = [0] * num_groups | |
| budget = target_cost - base_cost | |
| def adjust(dkld): | |
| if dkld > 0: | |
| return dkld | |
| return -((-dkld) ** 0.69) | |
| print(" -- Optimizing...") | |
| while True: | |
| best = None | |
| best_r = 0.0 | |
| for i, g in enumerate(groups): | |
| cand = g["candidates"] | |
| s = solution[i] | |
| for j, c in enumerate(cand): | |
| if j < s or j >= num_q: continue | |
| dk = adjust(c["dkld"]) | |
| db = c["dbits"] | |
| if s > 0: | |
| dk -= adjust(cand[s - 1]["dkld"]) | |
| db -= cand[s - 1]["dbits"] | |
| r = 1e10 * dk / (db + 1) | |
| if r < best_r and budget > db: | |
| best = i, j, db | |
| best_r = r | |
| if best is None: | |
| break | |
| i, j, k = best | |
| solution[i] = j + 1 | |
| budget -= k | |
| return solution |
It proceeds in a single forward pass layer by layer and it might happend that at layer 90, it has a lot of budget but that budget would have been better spent in layer 1.
It might be an improvement to replace the algorithm with something taken from operations research or constraint programming for example:
- https://en.wikipedia.org/wiki/Constrained_optimization
- https://en.wikipedia.org/wiki/Combinatorial_optimization
- https://en.wikipedia.org/wiki/Constraint_programming
2. Synergies
Unfortunately I don't know any way to surface synergies (for example up-ing all down-projections by 1-bit for better outlier signal) from the measurements.json file so we might want to not address that and mention that as a limitation.
Overfitting
The last part I want to address is overfitting.
Quantization is done using get_calibration_data
see
exllamav3/exllamav3/conversion/convert_model.py
Lines 228 to 239 in 2fc131e
| def prepare_state(args, job_state, config, model, tokenizer): | |
| idx = job_state["next_module_idx"] | |
| if idx == 0: | |
| print(f" -- Preparing input state") | |
| state = get_default_calibration(args, tokenizer) | |
| else: | |
| if idx < len(model.modules): | |
| print(f" -- Resuming at: {model.modules[idx].key}") | |
| else: | |
| print(f" -- Resuming after: {model.modules[idx - 1].key}") | |
| state = load_tensor("ckpt/state.safetensors", args) | |
| return state |
Measurement is done using the same
exllamav3/exllamav3/conversion/measure_model.py
Lines 72 to 75 in 2fc131e
| def prepare_state(args, job_state, config, model, tokenizer): | |
| print(f" -- Preparing input state") | |
| state = get_default_calibration(args, tokenizer) | |
| return state[:args["cal_rows"]] |
And model_diff uses the wikitext dataset which is also part of the calibration data
Lines 33 to 40 in 2fc131e
| @disk_lru_cache("get_dataset_text") | |
| def get_dataset_text(spec: dict): | |
| assert spec["dataset"] == "wiki2", "Only wiki2 implemented atm" | |
| dataset_text = "\n\n".join( | |
| load_dataset("wikitext", "wikitext-2-raw-v1", split = "test") | |
| ["text"] | |
| ) | |
| return dataset_text |
This overfitting can explain why quantized models have lower perplexity during model_diff than the BF16 model.
Ideally there is a hold-out dataset that is not used for quantization that is used for measurements and model_diff -- and we can argue that if we pick our mixed quantization based on that hold-out dataset, we would overfit it so model_diff would need yet another validation set.