ggml-org
diff --git a/‎default.smarterquant.json‎
Lines changed: 4 additions & 196 deletions b/‎default.smarterquant.json‎
Lines changed: 4 additions & 196 deletions
diff --git a/‎docs/smarterquant.md‎
Lines changed: 66 additions & 0 deletions b/‎docs/smarterquant.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎gguf-py/examples/writer.py‎
Lines changed: 34 additions & 22 deletions b/‎gguf-py/examples/writer.py‎
Lines changed: 34 additions & 22 deletions
diff --git a/‎src/llama-model-loader.cpp‎
Lines changed: 74 additions & 0 deletions b/‎src/llama-model-loader.cpp‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎src/llama-model.cpp‎
Lines changed: 6 additions & 0 deletions b/‎src/llama-model.cpp‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎src/llama-model.h‎
Lines changed: 7 additions & 0 deletions b/‎src/llama-model.h‎
Lines changed: 7 additions & 0 deletions
@@ -0,0 +1,66 @@
+# SmarterQuant Configuration File (`default.smarterquant.json`)
+
+The `default.smarterquant.json` file allows for fine-grained control over the quantization process for specific tensors within a model when using `llama-quantize`. It also provides the necessary information for the inference engine (e.g., `llama-cli`, `llama-server`) to correctly dequantize and use these custom-quantized tensors.
+
+## File Location
+
+When running `llama-quantize` or any inference executable (`llama-cli`, `llama-server`), this JSON file is expected to be present in the **current working directory**.
+
+## Format
+
+The file must be a valid JSON object. Each key in this top-level object is the exact name of a tensor in the model (e.g., `"blk.0.attn_q.weight"`). The value associated with each tensor name is a JSON array containing exactly two elements:
+
+1.  **Compression Types Array (Required for SmarterQuant processing):**
+    *   A JSON array of exactly four integers.
+    *   These integers correspond to `ggml_type` enum values (e.g., `0` for `GGML_TYPE_F32`, `8` for `GGML_TYPE_Q4_0`, `14` for `GGML_TYPE_Q4_K_M`, etc. Refer to `ggml.h` for the full list of `ggml_type` enums).
+    *   The four integers specify the quantization type to be used for the first four 256-column-wide blocks of the tensor, respectively.
+        *   `compression_types[0]`: For columns 0-255.
+        *   `compression_types[1]`: For columns 256-511.
+        *   `compression_types[2]`: For columns 512-767.
+        *   `compression_types[3]`: For columns 768-1023.
+    *   All subsequent blocks (from column 1024 onwards) will also use the type specified by `compression_types[3]`. This type will also be stored as the main GGUF tensor type.
+    *   If this array is empty or not an array of 4 integers, SmarterQuant block-specific quantization will not be applied for this tensor, even if other settings are present.
+
+2.  **Column Permutation Array (Optional):**
+    *   A JSON array of integers.
+    *   If non-empty, this array defines how the columns of the original tensor should be reordered *before* any quantization (including the block-specific quantization above) is applied.
+    *   The length of this array *must* exactly match the number of columns of the tensor (i.e., `tensor->ne[0]`).
+    *   The values in the array must be unique integers from `0` to `C-1` (where `C` is the number of columns), representing the original column index.
+    *   The new layout will be such that `new_column[j]` takes its data from `original_column[permutation_array[j]]`.
+    *   If this array is empty (`[]`), no column permutation is applied.
+
+## Example
+
+```json
+{
+    "blk.0.attn_q.weight": [
+        [8, 9, 12, 13],  // ggml_type for block 0, 1, 2, 3. Block 4+ uses type 13.
+                           // (e.g., 8 could be GGML_TYPE_Q4_0, 9 GGML_TYPE_Q4_1, etc.)
+        [ /* Large array of column indices, e.g., 0, 2, 1, 5, 4, ... up to tensor_ne0-1 */ ]
+    ],
+    "blk.1.ffn_down.weight": [
+        [14, 14, 14, 14],
+        []
+    ],
+    "output.weight": [
+        [2, 2, 2, 2],  // Example: Quantize first four blocks as Q8_0 (assuming 2 maps to Q8_0 in ggml.h)
+        []             // No permutation
+    ]
+}
+```
+
+In this example:
+-   `blk.0.attn_q.weight`: Will have its columns permuted according to the provided list. Its first 256 columns (after permutation) will be quantized with `ggml_type` 8, the next with type 9, then 12, then 13. Subsequent blocks will also use type 13.
+-   `blk.1.ffn_down.weight`: Will not have its columns permuted. All its blocks (first four and subsequent) will be quantized with `ggml_type` 14.
+-   `output.weight`: Will not be permuted. All its blocks will be quantized as `ggml_type` 2.
+
+## GGUF Metadata
+
+When `llama-quantize` processes a tensor using instructions from `default.smarterquant.json`, it stores the applied configuration in the GGUF file's metadata for that tensor. This allows the inference engine to correctly dequantize and use the tensor. The following keys are used:
+
+-   `tensor_name.smarterquant.enabled` (boolean): `true` if SmarterQuant processing was applied.
+-   `tensor_name.smarterquant.permutation` (string): A JSON string representation of the column permutation array used (e.g., `"[3,0,1,2]"`).
+-   `tensor_name.smarterquant.block_types` (string): A JSON string representation of the four compression types used for the initial blocks (e.g., `"[8,9,12,13]"`).
+
+The inference engine will prioritize GGUF metadata. If `default.smarterquant.json` is also present during inference, it's primarily used to get the *original* permutation and block type details if they were not perfectly reconstructible from GGUF metadata alone (though the current implementation aims to store them completely in GGUF).
+```
@@ -7,33 +7,45 @@
 # Necessary to load the local gguf package
 sys.path.insert(0, str(Path(__file__).parent.parent))
 
-from gguf import GGUFWriter  # noqa: E402
-
-
-# Example usage:
-def writer_example() -> None:
-    # Example usage with a file
-    gguf_writer = GGUFWriter("example.gguf", "llama")
-
-    gguf_writer.add_block_count(12)
-    gguf_writer.add_uint32("answer", 42)  # Write a 32-bit integer
-    gguf_writer.add_float32("answer_in_float", 42.0)  # Write a 32-bit float
-    gguf_writer.add_custom_alignment(64)
-
-    tensor1 = np.ones((32,), dtype=np.float32) * 100.0
-    tensor2 = np.ones((64,), dtype=np.float32) * 101.0
-    tensor3 = np.ones((96,), dtype=np.float32) * 102.0
-
-    gguf_writer.add_tensor("tensor1", tensor1)
-    gguf_writer.add_tensor("tensor2", tensor2)
-    gguf_writer.add_tensor("tensor3", tensor3)
+from gguf import GGUFWriter, GGMLQuantizationType  # noqa: E402
+
+
+# Create a tiny GGUF model for testing SmarterQuant
+def create_tiny_model_for_sq_test() -> None:
+    # Output file will be in the root directory for easy access by llama-quantize
+    gguf_writer = GGUFWriter("../../tiny_model.gguf", "llama") # arch is set here
+
+    # Minimal metadata
+    gguf_writer.add_block_count(1) # This should represent layer count for llama arch
+    gguf_writer.add_context_length(128) # Dummy
+    embedding_length = 512
+    head_count = 1
+    gguf_writer.add_embedding_length(embedding_length)
+    gguf_writer.add_feed_forward_length(1024) # Dummy
+    gguf_writer.add_head_count(head_count)
+    gguf_writer.add_head_count_kv(1) # Dummy
+    gguf_writer.add_rope_dimension_count(embedding_length // head_count)
+    gguf_writer.add_layer_norm_rms_eps(1e-5) # Required for llama arch
+    gguf_writer.add_file_type(1) # F16 == 1 (GGML_FTYPE_MOSTLY_F16)
+
+    # Tensor to be targeted by SmarterQuant
+    # Dimensions: 4 rows, 512 columns.
+    # 512 columns = two 256-column blocks.
+    tensor_data_sq = np.random.rand(4, 512).astype(np.float32)
+    gguf_writer.add_tensor("blk.0.attn_q.weight", tensor_data_sq)
+
+    # Another dummy tensor
+    other_tensor_data = np.random.rand(4, 256).astype(np.float32)
+    gguf_writer.add_tensor("blk.0.ffn_down.weight", other_tensor_data)
+
+    gguf_writer.add_uint32("answer", 42) # Dummy KV pair
 
     gguf_writer.write_header_to_file()
     gguf_writer.write_kv_data_to_file()
     gguf_writer.write_tensors_to_file()
 
     gguf_writer.close()
-
+    print("Created ../../tiny_model.gguf")
 
 if __name__ == '__main__':
-    writer_example()
+    create_tiny_model_for_sq_test()
@@ -1,6 +1,8 @@
 #include "llama-model-loader.h"
 
 #include "ggml.h"
+#include "json.hpp" // For nlohmann::json
+#include "llama-quant.h" // For SmarterQuantTensorInfo, SmarterQuantConfig
 
 #include <array>
 #include <cinttypes>
@@ -487,6 +489,43 @@ llama_model_loader::llama_model_loader(
         n_elements += ggml_nelements(cur);
         n_bytes    += ggml_nbytes(cur);
         weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), 0, meta.get(), cur));
+
+        // Load SmarterQuant metadata for this tensor if present in GGUF
+        // This will augment or override what was loaded from default.smarterquant.json
+        std::string key_sq_enabled_gguf = tensor_name + ".smarterquant.enabled";
+        int key_idx_gguf = gguf_find_key(meta.get(), key_sq_enabled_gguf.c_str());
+        if (key_idx_gguf != -1 && gguf_get_kv_type(meta.get(), key_idx_gguf) == GGUF_TYPE_BOOL) {
+            bool sq_enabled_gguf_val = gguf_get_val_bool(meta.get(), key_idx_gguf);
+            if (sq_enabled_gguf_val) {
+                LLAMA_LOG_INFO("%s: Tensor '%s' has SmarterQuant GGUF metadata.\n", __func__, tensor_name.c_str());
+                SmarterQuantTensorInfo sq_info; // Local temporary holder
+                sq_info.enabled = true;
+
+                std::string key_sq_perm_gguf = tensor_name + ".smarterquant.permutation";
+                key_idx_gguf = gguf_find_key(meta.get(), key_sq_perm_gguf.c_str());
+                if (key_idx_gguf != -1 && gguf_get_kv_type(meta.get(), key_idx_gguf) == GGUF_TYPE_STRING) {
+                    const char * perm_str_c = gguf_get_val_str(meta.get(), key_idx_gguf);
+                    try {
+                        nlohmann::json perm_json = nlohmann::json::parse(perm_str_c);
+                        if (perm_json.is_array()) {
+                            sq_info.column_permutation = perm_json.get<std::vector<int>>();
+                        } else { LLAMA_LOG_WARN("%s: GGUF perm metadata for '%s' not an array.\n", __func__, tensor_name.c_str()); }
+                    } catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF perm for '%s': %s\n", __func__, tensor_name.c_str(), e.what()); }
+                }
+
+                std::string key_sq_block_types_gguf = tensor_name + ".smarterquant.block_types";
+                key_idx_gguf = gguf_find_key(meta.get(), key_sq_block_types_gguf.c_str());
+                if (key_idx_gguf != -1 && gguf_get_kv_type(meta.get(), key_idx_gguf) == GGUF_TYPE_STRING) {
+                    const char * types_str_c = gguf_get_val_str(meta.get(), key_idx_gguf);
+                    try {
+                        nlohmann::json types_json = nlohmann::json::parse(types_str_c);
+                        if (types_json.is_array() && types_json.size() == 4) {
+                            sq_info.compression_types = types_json.get<std::vector<int8_t>>();
+                        } else { LLAMA_LOG_WARN("%s: GGUF block_types metadata for '%s' not an array of 4.\n", __func__, tensor_name.c_str()); }
+                    } catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF block_types for '%s': %s\n", __func__, tensor_name.c_str(), e.what()); }
+                }
+            }
+        }
     }
     uint16_t n_split = 0;
     get_key(llm_kv(LLM_KV_SPLIT_COUNT), n_split, false);
@@ -553,6 +592,41 @@ llama_model_loader::llama_model_loader(
                 n_elements += ggml_nelements(cur);
                 n_bytes    += ggml_nbytes(cur);
                 weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), idx, ctx_gguf.get(), cur));
+                // Load SmarterQuant metadata for this tensor if present in GGUF (for split tensors)
+                // Use tensor_name, as split_tensor_name was not defined and likely referred to the current tensor's name.
+                std::string key_sq_enabled_gguf_split = tensor_name + ".smarterquant.enabled";
+                int key_idx_gguf_split = gguf_find_key(ctx_gguf.get(), key_sq_enabled_gguf_split.c_str());
+                if (key_idx_gguf_split != -1 && gguf_get_kv_type(ctx_gguf.get(), key_idx_gguf_split) == GGUF_TYPE_BOOL) {
+                    bool sq_enabled_gguf_val_split = gguf_get_val_bool(ctx_gguf.get(), key_idx_gguf_split);
+                     if (sq_enabled_gguf_val_split) {
+                        LLAMA_LOG_INFO("%s: Tensor '%s' (split %d) has SmarterQuant GGUF metadata.\n", __func__, tensor_name.c_str(), idx);
+                        SmarterQuantTensorInfo sq_info; // Local temporary holder
+                        sq_info.enabled = true;
+
+                        std::string key_sq_perm_gguf_split = tensor_name + ".smarterquant.permutation";
+                        key_idx_gguf_split = gguf_find_key(ctx_gguf.get(), key_sq_perm_gguf_split.c_str());
+                        if (key_idx_gguf_split != -1 && gguf_get_kv_type(ctx_gguf.get(), key_idx_gguf_split) == GGUF_TYPE_STRING) {
+                            const char * perm_str_c = gguf_get_val_str(ctx_gguf.get(), key_idx_gguf_split);
+                            try {
+                                nlohmann::json perm_json = nlohmann::json::parse(perm_str_c);
+                                if (perm_json.is_array()) {
+                                    sq_info.column_permutation = perm_json.get<std::vector<int>>();
+                                } else { LLAMA_LOG_WARN("%s: GGUF perm metadata for '%s' (split %d) not an array.\n", __func__, tensor_name.c_str(), idx); }
+                            } catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF perm for '%s' (split %d): %s\n", __func__, tensor_name.c_str(), idx, e.what()); }
+                        }
+                        std::string key_sq_block_types_gguf_split = tensor_name + ".smarterquant.block_types";
+                        key_idx_gguf_split = gguf_find_key(ctx_gguf.get(), key_sq_block_types_gguf_split.c_str());
+                        if (key_idx_gguf_split != -1 && gguf_get_kv_type(ctx_gguf.get(), key_idx_gguf_split) == GGUF_TYPE_STRING) {
+                            const char * types_str_c = gguf_get_val_str(ctx_gguf.get(), key_idx_gguf_split);
+                            try {
+                                nlohmann::json types_json = nlohmann::json::parse(types_str_c);
+                                if (types_json.is_array() && types_json.size() == 4) {
+                                    sq_info.compression_types = types_json.get<std::vector<int8_t>>();
+                                } else { LLAMA_LOG_WARN("%s: GGUF block_types for '%s' (split %d) not an array of 4.\n", __func__, tensor_name.c_str(), idx); }
+                            } catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF block_types for '%s' (split %d): %s\n", __func__, tensor_name.c_str(), idx, e.what()); }
+                        }
+                    }
+                }
             }
         }
 
 
@@ -1,5 +1,9 @@
 #include "llama-model.h"
 
+#include "json.hpp" // For nlohmann::json - common/ is in include path
+#include <fstream>          // For std::ifstream
+#include <stdexcept>        // For std::runtime_error
+
 #include "llama-impl.h"
 #include "llama-mmap.h"
 #include "llama-batch.h"
@@ -12,6 +16,8 @@
 #include <algorithm>
 #include <cassert>
 #include <cmath>
+#include <fstream> // For std::ifstream
+#include <stdexcept> // For std::runtime_error
 #include <cfloat>
 #include <cstring>
 #include <cmath>
 
@@ -12,6 +12,9 @@
 #include <unordered_map>
 #include <vector>
 
+#include "json.hpp" // For SmarterQuantConfig parsing (nlohmann::json) - common/ is in include path
+#include "llama-quant.h" // For SmarterQuantConfig definition
+
 struct llama_cparams;
 struct llama_ubatch;
 struct llama_model_loader;
@@ -350,6 +353,10 @@ struct llama_model {
     // for quantize-stats only
     std::vector<std::pair<std::string, struct ggml_tensor *>> tensors_by_name;
 
+    // SmarterQuant configuration loaded from default.smarterquant.json (parsed during model load)
+    // And per-tensor metadata read from GGUF.
+    SmarterQuantConfig sq_config;
+
     int64_t t_load_us  = 0;
     int64_t t_start_us = 0;