Skip to content

Conversation

@708-145
Copy link
Contributor

@708-145 708-145 commented Jun 26, 2025

Make sure to read the contributing guidelines before submitting a PR

708-145 and others added 10 commits April 14, 2025 14:53
* feat: Add infrastructure for SmarterQuant custom quantization

Implements the following:
- Parsing of `default.smarterquant.json` in llama-quantize and during model loading to define custom quantization strategies for tensors.
- Defines `SmarterQuantTensorInfo` and `SmarterQuantConfigMap` to hold this configuration.
- Modifies `llama-quantize` to:
    - Apply column permutation based on the configuration (data manipulation placeholder).
    - Select block-specific compression types for the first four 256-column blocks (quantization placeholder).
    - Write SmarterQuant configuration (permutation, block types, enabled flag) as GGUF metadata.
- Modifies model loading to read this GGUF metadata and populate the `llama_model` with the SmarterQuant configuration.
- Adds documentation for the `default.smarterquant.json` format.

Note: The core logic for packing custom-quantized blocks in `llama-quantize` and for dequantizing/unpermuting these blocks in `ggml.c` (inference) is not yet implemented. Models quantized with these features will not run correctly until that is complete.

* docs: Add todo.txt outlining remaining SmarterQuant work

This file details the pending implementation tasks for the core
quantization and dequantization logic required to make the
SmarterQuant feature fully functional.

* Update todo.txt

* Implement SmarterQuant custom block quantization packing

- Add llama_tensor_quantize_smarter_blocks to handle per-segment quantization based on SmarterQuantTensorInfo.
- Integrate this into llama_model_quantize_impl.
- Ensure imatrix is permuted along with f32_data before quantization.
- Correct GGUF metadata handling for SmarterQuant tensors (base type set to compression_types[3]).
- Numerous compilation fixes related to includes, type definitions, and removal of old C-style SmarterQuant parsing code.
- Initial verification confirms correct packed data size calculation for the new SmarterQuant path.

---------

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
* Checkpoint: Refactor SmarterQuantTensorInfo and add headers

- Created C-compatible SmarterQuantTensorInfo in ggml-smarterquant-types.h
- Updated ggml.h, ggml-cpu.c, llama-quant.h, llama-quant.cpp,
  llama-model-loader.cpp, and llama-model.cpp to use the new struct.
- Added missing C++ headers and forward declarations to llama-quant.cpp
  in an attempt to resolve compilation errors.

Note: Codebase is not currently compiling due to issues in
llama-quant.cpp and an incorrect CMake build path used in the last
attempt. User will address compilation issues next.

* Fix compilation issues and implement SmarterQuant stubs

- Resolved various compilation errors in llama-quant.cpp related to includes, function definitions, and SmarterQuant logic.
- Implemented parsing for SmarterQuant JSON configuration in `load_smarter_quant_config`.
- Added a basic serial implementation for `llama_tensor_quantize_smarter_blocks`.
- Provided functional stubs for quantization helper functions within `llama-quant.cpp`.
- Ensured the public `llama_model_quantize` API correctly calls the implementation in `llama-quant.cpp`.
- Fixed a memory leak by adding a destructor to `llama_model` to free SmarterQuant permutation data.
- Verified that `ggml-cpu.c` and `llama-model.cpp` changes for SmarterQuant dequantization compile.
- The main library and all example tools now compile and link successfully.

* feat: Implement SmarterQuant numerical correctness tests

This commit introduces a new test suite for the SmarterQuant functionality
to verify the numerical correctness of the custom block quantization and
dequantization logic.

Key changes:
- Added `tests/test-smarterquant.cpp` with a test case that:
  - Uses a sample F32 tensor with mixed quantization types (Q4_0, Q5_1, Q8_0, Q2_K).
  - Applies column permutation.
  - Quantizes using `llama_tensor_quantize_smarter_blocks`.
  - Dequantizes using `ggml_get_rows_smarterquant`.
  - Verifies the output against the original data.
- Updated `tests/CMakeLists.txt` to build the new test.
- Made `llama_tensor_quantize_smarter_blocks` in `src/llama-quant.cpp` non-static and added its declaration to `src/llama-quant.h`.
- Made `ggml_get_rows_smarterquant` in `ggml/src/ggml-cpu/ggml-cpu.c` non-static to allow direct testing.
- The implemented test passes, confirming the core CPU implementation of SmarterQuant (Tasks 1 and 2 from todo.txt) is working as expected for the tested scenario.

* feat: Implement SmarterQuant numerical correctness tests and update todo

This commit introduces a new test suite for the SmarterQuant functionality
to verify the numerical correctness of the custom block quantization and
dequantization logic. It also updates todo.txt to reflect this progress.

Key changes:
- Added `tests/test-smarterquant.cpp` with a test case that:
  - Uses a sample F32 tensor with mixed quantization types (Q4_0, Q5_1, Q8_0, Q2_K).
  - Applies column permutation.
  - Quantizes using `llama_tensor_quantize_smarter_blocks`.
  - Dequantizes using `ggml_get_rows_smarterquant`.
  - Verifies the numerical output against the original F32 data.
- Updated `tests/CMakeLists.txt` to build the new test.
- Made `llama_tensor_quantize_smarter_blocks` in `src/llama-quant.cpp` non-static and added its declaration to `src/llama-quant.h`.
- Made `ggml_get_rows_smarterquant` in `ggml/src/ggml-cpu/ggml-cpu.c` non-static to allow direct testing by the new test suite.
- The implemented test passes, confirming the core CPU implementation of SmarterQuant (Tasks 1 and 2 from todo.txt) is working as expected for the tested scenario.
- Updated `todo.txt` to mark the CPU numerical correctness testing as DONE and outline further potential test enhancements.

---------

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
* Implement remaining SmarterQuant JSON and GGUF tests

- Enhanced `tests/test-smarterquant.cpp` with edge cases for tensor column counts (128, 300, 512, 768) and different permutation patterns (identity, few swaps).
- Added `tests/test-smarterquant-gguf.cpp` for end-to-end GGUF testing, including metadata writing/reading and numerical verification through the quantization and model loading pipeline.
- Updated `todo.txt` to reflect test completion.

* docs: Analyze memory impact of SmarterQuant unpermutation buffer

Documents the memory usage of the temporary F32 buffer used during
the unpermutation step in `ggml_get_rows_smarterquant`.

The buffer is stack-allocated (`alloca`) with size `n_cols * sizeof(float)`.
For typical model dimensions, this is a minor memory footprint (e.g., 16-32KB)
and is short-lived. A potential concern for extremely large column counts
is noted, though not typical for current LLM weights.

---------

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
* Refactor mul_mat for SmarterQuant permuted inference

Modify matrix multiplication (ggml_compute_forward_mul_mat_one_chunk)
to operate on SmarterQuant src0 tensors that have permuted columns
and per-segment quantization. This involves:
- Iterating through src0 segments.
- Determining segment-specific quantization types.
- On-the-fly quantization of corresponding src1 (activation)
  segments if src1 is F32.
- Performing dot products using the permuted, quantized src0 segments.

The resulting dst tensor from this operation is computed in a permuted
order (reflecting src0's column permutations' influence on how dst's
elements are effectively indexed or should be interpreted).

Add a new function `ggml_unpermute_f32_inplace` to unpermute the
first dimension of an F32 tensor.

Update `ggml_compute_forward_mul_mat` to:
- Correctly manage src1 data preparation, ensuring that the
  SmarterQuant path in `ggml_compute_forward_mul_mat_one_chunk`
  receives F32 src1 data for its internal per-segment quantization.
- Call `ggml_unpermute_f32_inplace` on the dst tensor after the
  matrix multiplication if src0 was SmarterQuant processed, to
  unpermute the result vector as per the requirements.

* Refactor mul_mat for SmarterQuant permuted inference

Modify matrix multiplication (ggml_compute_forward_mul_mat_one_chunk)
to operate on SmarterQuant src0 tensors that have permuted columns
and per-segment quantization. This involves:
- Iterating through src0 segments.
- Determining segment-specific quantization types.
- On-the-fly quantization of corresponding src1 (activation)
  segments if src1 is F32.
- Performing dot products using the permuted, quantized src0 segments.

The resulting dst tensor from this operation is computed in a permuted
order.

Add a new function `ggml_unpermute_f32_inplace` to unpermute the
first dimension of an F32 tensor.

Update `ggml_compute_forward_mul_mat` to:
- Correctly manage src1 data preparation, ensuring that the
  SmarterQuant path in `ggml_compute_forward_mul_mat_one_chunk`
  receives F32 src1 data for its internal per-segment quantization.
- Call `ggml_unpermute_f32_inplace` on the dst tensor after the
  matrix multiplication if src0 was SmarterQuant processed, to
  unpermute the result vector.

* Fix compilation errors in ggml-cpu and test-smarterquant

- Defined GGML_MAX_BLOCK_SIZE in ggml-cpu.c and used it instead of the undeclared GGML_MAX_TYPE_SIZE.
- Corrected a typo in ggml_compute_forward_mul_mat, changing wdata_src1_quantized to wdata.
- Fixed an incorrect function call to quantize_src1_segment by removing an extra NULL argument.
- Added a forward declaration for ggml_unpermute_f32_inplace in ggml-cpu.c.
- Included <cinttypes> in tests/test-smarterquant.cpp to resolve PRId64 undeclared identifier errors.

* Fix compilation errors in ggml-cpu and test-smarterquant

- Defined GGML_MAX_BLOCK_SIZE in ggml-cpu.c and used it instead of the undeclared GGML_MAX_TYPE_SIZE.
- Corrected a typo in ggml_compute_forward_mul_mat, changing wdata_src1_quantized to wdata.
- Fixed an incorrect function call to quantize_src1_segment by removing an extra NULL argument.
- Added a forward declaration for ggml_unpermute_f32_inplace in ggml-cpu.c.
- Included <cinttypes> in tests/test-smarterquant.cpp to resolve PRId64 undeclared identifier errors.

---------

Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
- ggml/src/ggml-cpu/ggml-cpu.c:
  - Fix cast discarding 'const' qualifier.
  - Remove unused variables 'bs' and 'nbw0'.
- src/llama-quant.cpp:
  - Fix comparison of integer expressions of different signedness.
  - Remove unused variables 'thread_src', 'thread_dst_char', and 'total_size_written'.
- ggml/src/ggml.c:
  - Remove braces around scalar initializer for `sq_info`.
  - Explicitly initialize the `padding` field.
- ggml/src/ggml-cpu/ggml-cpu.c:
  - Change `src1_segment_prepared_data` to `const void *` to fix assignment discards 'const' qualifier warning.
@708-145 708-145 closed this Jun 26, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 26, 2025
@708-145 708-145 deleted the fix-compile-warnings branch June 26, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant