[Question] Understanding function(QunatLinear) scaling and weight quantization for C/FPGA implementation

Hi, thanks so much for your positive reply on Bluesky! I truly appreciate your willingness to help.

As you suggested, I’m opening this issue with my questions and code examples about how the functions of Brevitas (especially "brevitas.nn.QuantLinear", for this case) works under the hood, since I'm aiming to reimplement it in C for deployment on an FPGA.

I’ve been working on deploying a PyTorch-based quantized neural network to an FPGA platform. During this process, I have been using the Brevitas library to perform quantization-aware training, particularly leveraging "brevitas.nn.QuantLinear" for fully connected layers.

To implement the trained model in hardware, I am converting all parameters from the ".pth" file into C header files. However, I have encountered a few challenges in understanding the exact internal quantization mechanism of "QuantLinear" function, especially regarding how the quantized weights and activations are computed, and how I can retrieve all necessary parameters (such as scale) to replicate the behavior in C.



For instance, in the following layer:

```
self.input_to_hidden_f = qnn.QuantLinear(

    model_input_size + 1, 

    num_neurons, 

    bias=False, 

    weight_quant=WeightQuant, 

    weight_bit_width=self.Weight_Quant_Bit, 

    output_quant=ActQuant, 

    return_quant_tensor=True

)
```



I examined the runtime behavior by printing the values in forward():

```
    def forward(self, input_01, input_02, input_03, input_04, input_05, input_06, input_07, input_08, input_09, input_10, input_11, input_12, input_13, input_14, input_15, input_16, input_17, input_18, input_19, input_20, input_21, input_22, input_23, input_24, integral_01, integral_02, integral_03, integral_04, integral_05, integral_06, integral_07, integral_08, integral_09, integral_10, integral_11, integral_12, integral_13, integral_14, integral_15, integral_16, integral_17, integral_18, integral_19, integral_20, integral_21, integral_22, integral_23):
              
        # Forward path
        combined_0101 = torch.cat((integral_01, input_01), dim=-1)
        print("1 identity:", combined_0101[0][0], combined_0101[0][0].type)
        
        forward_01_in = self.input_to_hidden_f(combined_0101)
        # qnn.QuantLinear(model_input_size + 1, num_neurons, bias=False, weight_quant=WeightQuant, weight_bit_width=self.Weight_Quant_Bit, output_quant = ActQuant,return_quant_tensor=True)
        print("Weight:", self.input_to_hidden_f.weight)
        print("Weight Type:", self.input_to_hidden_f.weight.dtype)
        print("Quantized Weight:", self.input_to_hidden_f.quant_weight().tensor)
        print("Weight Scale:", f"{model.input_to_hidden_f.quant_weight().scale:.6f}")
        
        forward_01_in_unquant = torch.matmul(combined_0101, self.input_to_hidden_f.weight.T)
        print("1-2 MatMul:", forward_01_in_unquant)
        
        print("2 fc linear:", forward_01_in[0][0][0], forward_01_in[0][0][0].type, forward_01_in[0][0][0].shape)
        
        forward_01_in = self.activation_01(forward_01_in)
        print("3 activation LReLU:", forward_01_in[0][0][0], forward_01_in[0][0][0].type, forward_01_in[0][0][0].shape)
        
        forward_01_in = self.identity_01(forward_01_in)
        print("4 identity:", forward_01_in[0][0][0], forward_01_in[0][0][0].type, forward_01_in[0][0][0].shape)
```





The quantized weight scale printed as 0.007812, and the quantized weights appeared to be per-tensor scaled to int8 values. However, when I load the .pth file, this scale value does not appear to be directly accessible. Instead, I only find:



```
input_to_hidden_f.output_quant.fused_activation_quant_proxy.tensor_quant.scaling_impl.value:

6.531342506408691
```



This value seems unrelated to the weight scale. Therefore, I would appreciate it if you could help me understand:


Where is the quantized weight scale (e.g., 0.007812) stored or how is it derived?
Is there a way to retrieve it directly from the saved .pth model?

How exactly does QuantLinear compute its output at runtime?
I understand it is something like:
output = quantize(quant_input @ quant_weight.T)
But I would like to know the precise order of operations, and whether input and weights are quantized before or after matrix multiplication, and when scaling is applied.

What are the best practices for replicating this quantization behavior exactly in C?
If there are recommended approaches or documentation (especially regarding fixed-point scaling and integer arithmetic used internally), I would be very grateful.


Understanding these details is crucial for me to ensure bit-accurate hardware implementation, especially in the context of low-bit quantization (e.g., 8-bit).

Thank you very much for your time and for your work on Brevitas — it has been an incredibly helpful library in bridging PyTorch and FPGA design.

*

If any part of my question was unclear due to my limited experience with this library or the field in general, I sincerely apologize. I would greatly appreciate any clarification or guidance you could provide.


Thank you again for your time and support.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Understanding function(QunatLinear) scaling and weight quantization for C/FPGA implementation #1246

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Understanding function(QunatLinear) scaling and weight quantization for C/FPGA implementation #1246

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions