Skip to content

Commit 455fd47

Browse files
Feat/sq packing p1 (#5)
* feat: Add infrastructure for SmarterQuant custom quantization Implements the following: - Parsing of `default.smarterquant.json` in llama-quantize and during model loading to define custom quantization strategies for tensors. - Defines `SmarterQuantTensorInfo` and `SmarterQuantConfigMap` to hold this configuration. - Modifies `llama-quantize` to: - Apply column permutation based on the configuration (data manipulation placeholder). - Select block-specific compression types for the first four 256-column blocks (quantization placeholder). - Write SmarterQuant configuration (permutation, block types, enabled flag) as GGUF metadata. - Modifies model loading to read this GGUF metadata and populate the `llama_model` with the SmarterQuant configuration. - Adds documentation for the `default.smarterquant.json` format. Note: The core logic for packing custom-quantized blocks in `llama-quantize` and for dequantizing/unpermuting these blocks in `ggml.c` (inference) is not yet implemented. Models quantized with these features will not run correctly until that is complete. * docs: Add todo.txt outlining remaining SmarterQuant work This file details the pending implementation tasks for the core quantization and dequantization logic required to make the SmarterQuant feature fully functional. * Update todo.txt * Implement SmarterQuant custom block quantization packing - Add llama_tensor_quantize_smarter_blocks to handle per-segment quantization based on SmarterQuantTensorInfo. - Integrate this into llama_model_quantize_impl. - Ensure imatrix is permuted along with f32_data before quantization. - Correct GGUF metadata handling for SmarterQuant tensors (base type set to compression_types[3]). - Numerous compilation fixes related to includes, type definitions, and removal of old C-style SmarterQuant parsing code. - Initial verification confirms correct packed data size calculation for the new SmarterQuant path. --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
1 parent 4f9bbc4 commit 455fd47

File tree

10 files changed

+769
-351
lines changed

10 files changed

+769
-351
lines changed

default.smarterquant.json

Lines changed: 4 additions & 196 deletions
Large diffs are not rendered by default.

docs/smarterquant.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# SmarterQuant Configuration File (`default.smarterquant.json`)
2+
3+
The `default.smarterquant.json` file allows for fine-grained control over the quantization process for specific tensors within a model when using `llama-quantize`. It also provides the necessary information for the inference engine (e.g., `llama-cli`, `llama-server`) to correctly dequantize and use these custom-quantized tensors.
4+
5+
## File Location
6+
7+
When running `llama-quantize` or any inference executable (`llama-cli`, `llama-server`), this JSON file is expected to be present in the **current working directory**.
8+
9+
## Format
10+
11+
The file must be a valid JSON object. Each key in this top-level object is the exact name of a tensor in the model (e.g., `"blk.0.attn_q.weight"`). The value associated with each tensor name is a JSON array containing exactly two elements:
12+
13+
1. **Compression Types Array (Required for SmarterQuant processing):**
14+
* A JSON array of exactly four integers.
15+
* These integers correspond to `ggml_type` enum values (e.g., `0` for `GGML_TYPE_F32`, `8` for `GGML_TYPE_Q4_0`, `14` for `GGML_TYPE_Q4_K_M`, etc. Refer to `ggml.h` for the full list of `ggml_type` enums).
16+
* The four integers specify the quantization type to be used for the first four 256-column-wide blocks of the tensor, respectively.
17+
* `compression_types[0]`: For columns 0-255.
18+
* `compression_types[1]`: For columns 256-511.
19+
* `compression_types[2]`: For columns 512-767.
20+
* `compression_types[3]`: For columns 768-1023.
21+
* All subsequent blocks (from column 1024 onwards) will also use the type specified by `compression_types[3]`. This type will also be stored as the main GGUF tensor type.
22+
* If this array is empty or not an array of 4 integers, SmarterQuant block-specific quantization will not be applied for this tensor, even if other settings are present.
23+
24+
2. **Column Permutation Array (Optional):**
25+
* A JSON array of integers.
26+
* If non-empty, this array defines how the columns of the original tensor should be reordered *before* any quantization (including the block-specific quantization above) is applied.
27+
* The length of this array *must* exactly match the number of columns of the tensor (i.e., `tensor->ne[0]`).
28+
* The values in the array must be unique integers from `0` to `C-1` (where `C` is the number of columns), representing the original column index.
29+
* The new layout will be such that `new_column[j]` takes its data from `original_column[permutation_array[j]]`.
30+
* If this array is empty (`[]`), no column permutation is applied.
31+
32+
## Example
33+
34+
```json
35+
{
36+
"blk.0.attn_q.weight": [
37+
[8, 9, 12, 13], // ggml_type for block 0, 1, 2, 3. Block 4+ uses type 13.
38+
// (e.g., 8 could be GGML_TYPE_Q4_0, 9 GGML_TYPE_Q4_1, etc.)
39+
[ /* Large array of column indices, e.g., 0, 2, 1, 5, 4, ... up to tensor_ne0-1 */ ]
40+
],
41+
"blk.1.ffn_down.weight": [
42+
[14, 14, 14, 14],
43+
[]
44+
],
45+
"output.weight": [
46+
[2, 2, 2, 2], // Example: Quantize first four blocks as Q8_0 (assuming 2 maps to Q8_0 in ggml.h)
47+
[] // No permutation
48+
]
49+
}
50+
```
51+
52+
In this example:
53+
- `blk.0.attn_q.weight`: Will have its columns permuted according to the provided list. Its first 256 columns (after permutation) will be quantized with `ggml_type` 8, the next with type 9, then 12, then 13. Subsequent blocks will also use type 13.
54+
- `blk.1.ffn_down.weight`: Will not have its columns permuted. All its blocks (first four and subsequent) will be quantized with `ggml_type` 14.
55+
- `output.weight`: Will not be permuted. All its blocks will be quantized as `ggml_type` 2.
56+
57+
## GGUF Metadata
58+
59+
When `llama-quantize` processes a tensor using instructions from `default.smarterquant.json`, it stores the applied configuration in the GGUF file's metadata for that tensor. This allows the inference engine to correctly dequantize and use the tensor. The following keys are used:
60+
61+
- `tensor_name.smarterquant.enabled` (boolean): `true` if SmarterQuant processing was applied.
62+
- `tensor_name.smarterquant.permutation` (string): A JSON string representation of the column permutation array used (e.g., `"[3,0,1,2]"`).
63+
- `tensor_name.smarterquant.block_types` (string): A JSON string representation of the four compression types used for the initial blocks (e.g., `"[8,9,12,13]"`).
64+
65+
The inference engine will prioritize GGUF metadata. If `default.smarterquant.json` is also present during inference, it's primarily used to get the *original* permutation and block type details if they were not perfectly reconstructible from GGUF metadata alone (though the current implementation aims to store them completely in GGUF).
66+
```

gguf-py/examples/writer.py

Lines changed: 34 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -7,33 +7,45 @@
77
# Necessary to load the local gguf package
88
sys.path.insert(0, str(Path(__file__).parent.parent))
99

10-
from gguf import GGUFWriter # noqa: E402
11-
12-
13-
# Example usage:
14-
def writer_example() -> None:
15-
# Example usage with a file
16-
gguf_writer = GGUFWriter("example.gguf", "llama")
17-
18-
gguf_writer.add_block_count(12)
19-
gguf_writer.add_uint32("answer", 42) # Write a 32-bit integer
20-
gguf_writer.add_float32("answer_in_float", 42.0) # Write a 32-bit float
21-
gguf_writer.add_custom_alignment(64)
22-
23-
tensor1 = np.ones((32,), dtype=np.float32) * 100.0
24-
tensor2 = np.ones((64,), dtype=np.float32) * 101.0
25-
tensor3 = np.ones((96,), dtype=np.float32) * 102.0
26-
27-
gguf_writer.add_tensor("tensor1", tensor1)
28-
gguf_writer.add_tensor("tensor2", tensor2)
29-
gguf_writer.add_tensor("tensor3", tensor3)
10+
from gguf import GGUFWriter, GGMLQuantizationType # noqa: E402
11+
12+
13+
# Create a tiny GGUF model for testing SmarterQuant
14+
def create_tiny_model_for_sq_test() -> None:
15+
# Output file will be in the root directory for easy access by llama-quantize
16+
gguf_writer = GGUFWriter("../../tiny_model.gguf", "llama") # arch is set here
17+
18+
# Minimal metadata
19+
gguf_writer.add_block_count(1) # This should represent layer count for llama arch
20+
gguf_writer.add_context_length(128) # Dummy
21+
embedding_length = 512
22+
head_count = 1
23+
gguf_writer.add_embedding_length(embedding_length)
24+
gguf_writer.add_feed_forward_length(1024) # Dummy
25+
gguf_writer.add_head_count(head_count)
26+
gguf_writer.add_head_count_kv(1) # Dummy
27+
gguf_writer.add_rope_dimension_count(embedding_length // head_count)
28+
gguf_writer.add_layer_norm_rms_eps(1e-5) # Required for llama arch
29+
gguf_writer.add_file_type(1) # F16 == 1 (GGML_FTYPE_MOSTLY_F16)
30+
31+
# Tensor to be targeted by SmarterQuant
32+
# Dimensions: 4 rows, 512 columns.
33+
# 512 columns = two 256-column blocks.
34+
tensor_data_sq = np.random.rand(4, 512).astype(np.float32)
35+
gguf_writer.add_tensor("blk.0.attn_q.weight", tensor_data_sq)
36+
37+
# Another dummy tensor
38+
other_tensor_data = np.random.rand(4, 256).astype(np.float32)
39+
gguf_writer.add_tensor("blk.0.ffn_down.weight", other_tensor_data)
40+
41+
gguf_writer.add_uint32("answer", 42) # Dummy KV pair
3042

3143
gguf_writer.write_header_to_file()
3244
gguf_writer.write_kv_data_to_file()
3345
gguf_writer.write_tensors_to_file()
3446

3547
gguf_writer.close()
36-
48+
print("Created ../../tiny_model.gguf")
3749

3850
if __name__ == '__main__':
39-
writer_example()
51+
create_tiny_model_for_sq_test()

src/llama-model-loader.cpp

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
#include "llama-model-loader.h"
22

33
#include "ggml.h"
4+
#include "json.hpp" // For nlohmann::json
5+
#include "llama-quant.h" // For SmarterQuantTensorInfo, SmarterQuantConfig
46

57
#include <array>
68
#include <cinttypes>
@@ -487,6 +489,43 @@ llama_model_loader::llama_model_loader(
487489
n_elements += ggml_nelements(cur);
488490
n_bytes += ggml_nbytes(cur);
489491
weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), 0, meta.get(), cur));
492+
493+
// Load SmarterQuant metadata for this tensor if present in GGUF
494+
// This will augment or override what was loaded from default.smarterquant.json
495+
std::string key_sq_enabled_gguf = tensor_name + ".smarterquant.enabled";
496+
int key_idx_gguf = gguf_find_key(meta.get(), key_sq_enabled_gguf.c_str());
497+
if (key_idx_gguf != -1 && gguf_get_kv_type(meta.get(), key_idx_gguf) == GGUF_TYPE_BOOL) {
498+
bool sq_enabled_gguf_val = gguf_get_val_bool(meta.get(), key_idx_gguf);
499+
if (sq_enabled_gguf_val) {
500+
LLAMA_LOG_INFO("%s: Tensor '%s' has SmarterQuant GGUF metadata.\n", __func__, tensor_name.c_str());
501+
SmarterQuantTensorInfo sq_info; // Local temporary holder
502+
sq_info.enabled = true;
503+
504+
std::string key_sq_perm_gguf = tensor_name + ".smarterquant.permutation";
505+
key_idx_gguf = gguf_find_key(meta.get(), key_sq_perm_gguf.c_str());
506+
if (key_idx_gguf != -1 && gguf_get_kv_type(meta.get(), key_idx_gguf) == GGUF_TYPE_STRING) {
507+
const char * perm_str_c = gguf_get_val_str(meta.get(), key_idx_gguf);
508+
try {
509+
nlohmann::json perm_json = nlohmann::json::parse(perm_str_c);
510+
if (perm_json.is_array()) {
511+
sq_info.column_permutation = perm_json.get<std::vector<int>>();
512+
} else { LLAMA_LOG_WARN("%s: GGUF perm metadata for '%s' not an array.\n", __func__, tensor_name.c_str()); }
513+
} catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF perm for '%s': %s\n", __func__, tensor_name.c_str(), e.what()); }
514+
}
515+
516+
std::string key_sq_block_types_gguf = tensor_name + ".smarterquant.block_types";
517+
key_idx_gguf = gguf_find_key(meta.get(), key_sq_block_types_gguf.c_str());
518+
if (key_idx_gguf != -1 && gguf_get_kv_type(meta.get(), key_idx_gguf) == GGUF_TYPE_STRING) {
519+
const char * types_str_c = gguf_get_val_str(meta.get(), key_idx_gguf);
520+
try {
521+
nlohmann::json types_json = nlohmann::json::parse(types_str_c);
522+
if (types_json.is_array() && types_json.size() == 4) {
523+
sq_info.compression_types = types_json.get<std::vector<int8_t>>();
524+
} else { LLAMA_LOG_WARN("%s: GGUF block_types metadata for '%s' not an array of 4.\n", __func__, tensor_name.c_str()); }
525+
} catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF block_types for '%s': %s\n", __func__, tensor_name.c_str(), e.what()); }
526+
}
527+
}
528+
}
490529
}
491530
uint16_t n_split = 0;
492531
get_key(llm_kv(LLM_KV_SPLIT_COUNT), n_split, false);
@@ -553,6 +592,41 @@ llama_model_loader::llama_model_loader(
553592
n_elements += ggml_nelements(cur);
554593
n_bytes += ggml_nbytes(cur);
555594
weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), idx, ctx_gguf.get(), cur));
595+
// Load SmarterQuant metadata for this tensor if present in GGUF (for split tensors)
596+
// Use tensor_name, as split_tensor_name was not defined and likely referred to the current tensor's name.
597+
std::string key_sq_enabled_gguf_split = tensor_name + ".smarterquant.enabled";
598+
int key_idx_gguf_split = gguf_find_key(ctx_gguf.get(), key_sq_enabled_gguf_split.c_str());
599+
if (key_idx_gguf_split != -1 && gguf_get_kv_type(ctx_gguf.get(), key_idx_gguf_split) == GGUF_TYPE_BOOL) {
600+
bool sq_enabled_gguf_val_split = gguf_get_val_bool(ctx_gguf.get(), key_idx_gguf_split);
601+
if (sq_enabled_gguf_val_split) {
602+
LLAMA_LOG_INFO("%s: Tensor '%s' (split %d) has SmarterQuant GGUF metadata.\n", __func__, tensor_name.c_str(), idx);
603+
SmarterQuantTensorInfo sq_info; // Local temporary holder
604+
sq_info.enabled = true;
605+
606+
std::string key_sq_perm_gguf_split = tensor_name + ".smarterquant.permutation";
607+
key_idx_gguf_split = gguf_find_key(ctx_gguf.get(), key_sq_perm_gguf_split.c_str());
608+
if (key_idx_gguf_split != -1 && gguf_get_kv_type(ctx_gguf.get(), key_idx_gguf_split) == GGUF_TYPE_STRING) {
609+
const char * perm_str_c = gguf_get_val_str(ctx_gguf.get(), key_idx_gguf_split);
610+
try {
611+
nlohmann::json perm_json = nlohmann::json::parse(perm_str_c);
612+
if (perm_json.is_array()) {
613+
sq_info.column_permutation = perm_json.get<std::vector<int>>();
614+
} else { LLAMA_LOG_WARN("%s: GGUF perm metadata for '%s' (split %d) not an array.\n", __func__, tensor_name.c_str(), idx); }
615+
} catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF perm for '%s' (split %d): %s\n", __func__, tensor_name.c_str(), idx, e.what()); }
616+
}
617+
std::string key_sq_block_types_gguf_split = tensor_name + ".smarterquant.block_types";
618+
key_idx_gguf_split = gguf_find_key(ctx_gguf.get(), key_sq_block_types_gguf_split.c_str());
619+
if (key_idx_gguf_split != -1 && gguf_get_kv_type(ctx_gguf.get(), key_idx_gguf_split) == GGUF_TYPE_STRING) {
620+
const char * types_str_c = gguf_get_val_str(ctx_gguf.get(), key_idx_gguf_split);
621+
try {
622+
nlohmann::json types_json = nlohmann::json::parse(types_str_c);
623+
if (types_json.is_array() && types_json.size() == 4) {
624+
sq_info.compression_types = types_json.get<std::vector<int8_t>>();
625+
} else { LLAMA_LOG_WARN("%s: GGUF block_types for '%s' (split %d) not an array of 4.\n", __func__, tensor_name.c_str(), idx); }
626+
} catch (const std::exception& e) { LLAMA_LOG_WARN("%s: Failed to parse GGUF block_types for '%s' (split %d): %s\n", __func__, tensor_name.c_str(), idx, e.what()); }
627+
}
628+
}
629+
}
556630
}
557631
}
558632

src/llama-model.cpp

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
#include "llama-model.h"
22

3+
#include "json.hpp" // For nlohmann::json - common/ is in include path
4+
#include <fstream> // For std::ifstream
5+
#include <stdexcept> // For std::runtime_error
6+
37
#include "llama-impl.h"
48
#include "llama-mmap.h"
59
#include "llama-batch.h"
@@ -12,6 +16,8 @@
1216
#include <algorithm>
1317
#include <cassert>
1418
#include <cmath>
19+
#include <fstream> // For std::ifstream
20+
#include <stdexcept> // For std::runtime_error
1521
#include <cfloat>
1622
#include <cstring>
1723
#include <cmath>

src/llama-model.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@
1212
#include <unordered_map>
1313
#include <vector>
1414

15+
#include "json.hpp" // For SmarterQuantConfig parsing (nlohmann::json) - common/ is in include path
16+
#include "llama-quant.h" // For SmarterQuantConfig definition
17+
1518
struct llama_cparams;
1619
struct llama_ubatch;
1720
struct llama_model_loader;
@@ -350,6 +353,10 @@ struct llama_model {
350353
// for quantize-stats only
351354
std::vector<std::pair<std::string, struct ggml_tensor *>> tensors_by_name;
352355

356+
// SmarterQuant configuration loaded from default.smarterquant.json (parsed during model load)
357+
// And per-tensor metadata read from GGUF.
358+
SmarterQuantConfig sq_config;
359+
353360
int64_t t_load_us = 0;
354361
int64_t t_start_us = 0;
355362

0 commit comments

Comments
 (0)