Skip to content

Conversation

@LLukas22
Copy link
Contributor

@LLukas22 LLukas22 commented Nov 9, 2025

Disclaimer

This PR is still a work in progress and needs some polishing.
The main goal of opening it early is to gather feedback on whether this approach is heading in the right direction.


Overview

The primary goal of this PR is to introduce QuantizedDType support.
These data types allow regular tensors to store and operate on quantized data, making it much easier to add new quantization schemes to Candle in the future without relying on custom ops.


What’s Included

1. New candle-macros crates

  • candle-macros: contains the procedural macros that generate the dispatch code for QuantizedDTypes.
  • candle-macros-types: defines the traits that quantized types can implement to provide backend-specific support.

2. QuantizedType trait

Each quantization implements the QuantizedType trait, which defines:

  • a static NAME
  • functions to calculate storage size for the quantized format

Quantizations can optionally implement one or more of the following backend traits:

  • QuantizedCpuOps
  • QuantizedCudaOps
  • QuantizedMetalOps

These traits define the de/quantization logic and backend-specific matmul implementations (e.g., f32 × quantized).

3. The register_quantized_types! macro

This macro generates:

  • a QuantizedDType enum
  • all required dispatch functions to call backend ops efficiently (minimizing runtime overhead)

The enum is then integrated into Candle Core as a new DType::Quantized(QuantizedDType) variant.

Tensors using this type:

  • support most operations through implicit dequantization (currently to f32)
  • dispatch directly to backend-specific matmul implementations when data types match

4. External Quantization Support

A register_external_quantized_type! macro is also included.
This will allow external crates to register their own quantization types without modifying Candle Core directly.


Current Limitations

  • Quantized tensors can currently only be created via .to_dtype() — I haven’t yet found a clean way to load them from files.
  • Using f32 as the intermediate type isn’t ideal for some backends (like CUDA) and may need refinement.
  • The Metal backend implementation is not yet complete, as I don’t have access to Apple hardware for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant