|
141 | 141 | "cell_type": "markdown", |
142 | 142 | "metadata": {}, |
143 | 143 | "source": [ |
144 | | - "From an algorithmic point of view then the two different implementation are doing the same thing. However, as it will become clearer in later tutorials, there are currently some scenarios where picking one style over the other can make a difference when it comes to exporting to a format such as standard ONNX. In the meantime, we can just keep in mind that both alternatives exist." |
| 144 | + "From an algorithmic point of view the two different implementation are doing the same thing. However, as it will become clearer in later tutorials, there are currently some scenarios where picking one style over the other can make a difference when it comes to exporting to a format such as standard ONNX. In the meantime, we can just keep in mind that both alternatives exist." |
145 | 145 | ] |
146 | 146 | }, |
147 | 147 | { |
|
251 | 251 | "cell_type": "markdown", |
252 | 252 | "metadata": {}, |
253 | 253 | "source": [ |
254 | | - "As expected, a `QuantIdentity` with quantization disabled behaves like an identity function also when a `QuantTensor` is passed in. However, depending on whather `return_quant_tensor` is set to `False` or not, quantization metadata might be stripped out, i.e. the input `QuantTensor` is going to be returned as an implicitly quantized `torch.Tensor`:" |
| 254 | + "As expected, a `QuantIdentity` with quantization disabled behaves like an identity function also when a `QuantTensor` is passed in. However, depending on whether `return_quant_tensor` is set to `False` or not, quantization metadata might be stripped out, i.e. the input `QuantTensor` is going to be returned as an implicitly quantized `torch.Tensor`:" |
255 | 255 | ] |
256 | 256 | }, |
257 | 257 | { |
|
625 | 625 | "source": [ |
626 | 626 | "Regarding some premade activation quantizers, such as `Uint8ActPerTensorFloat`, `ShiftedUint8ActPerTensorFloat`, and `Int8ActPerTensorFloat`, a word of caution that anticipates some of the themes of the next tutorial.\n", |
627 | 627 | "To minimize user interaction, Brevitas initializes scale and zero-point by collecting statistics for a number of training steps (by default 30). This can be seen as a sort of very basic calibration step, although it typically happens during training and with quantization already enabled. These statistics are accumulated in an exponential moving average that at end of the collection phase is used to initialize a learned *parameter*.\n", |
628 | | - "During the collection phase then, the quantizer behaves differently between `train()` and `eval()` mode. In `train()` mode, the statistics for that particular batch are returned. In `eval()` mode, the exponential moving average is returned. After the collection phase is over the learned parameter is returned in both execution modes.\n", |
| 628 | + "During the collection phase then, the quantizer behaves differently between `train()` and `eval()` mode. In `train()` mode, the statistics for that particular batch are returned. In `eval()` mode, the exponential moving average is returned. After the collection phase is over, the learned parameter is returned in both execution modes.\n", |
629 | 629 | "We can easily observe this behaviour with an example. Let's first define a quantized activation and two random input tensors:" |
630 | 630 | ] |
631 | 631 | }, |
|
818 | 818 | "cell_type": "markdown", |
819 | 819 | "metadata": {}, |
820 | 820 | "source": [ |
821 | | - "In all of the examples that have currently been looked at in this tutorial, we have used per-tensor quantization. I.e., the output tensor of the activation, if quantized, was always quantized on a per-tensor level, with a single scale and zero-point quantization parameter per output tensor. However, one can also do per-channel quantization, where each output channel of the tensor has its own quantization parameters. In the example below, we look at per-tensor quantization of an input tensor that has 3 channels and 256 elements in the height and width dimensions. We purposely mutate the 1st channel to have its dynamic range be 3 times larger than the other 2 channels. We then feed it through a `QuantReLU`, whose default behavior is to quantize at a per-tensor granularity." |
| 821 | + "In all of the examples that have looked at so far in this tutorial, we have used per-tensor quantization. I.e., the output tensor of the activation, if quantized, was always quantized on a per-tensor level, with a single scale and zero-point quantization parameter per output tensor. However, one can also do per-channel quantization, where each output channel of the tensor has its own quantization parameters. In the example below, we look at per-tensor quantization of an input tensor that has 3 channels and 256 elements in the height and width dimensions. We purposely mutate the 1st channel to have its dynamic range be 3 times larger than the other 2 channels. We then feed it through a `QuantReLU`, whose default behavior is to quantize at a per-tensor granularity." |
822 | 822 | ] |
823 | 823 | }, |
824 | 824 | { |
|
1069 | 1069 | "cell_type": "markdown", |
1070 | 1070 | "metadata": {}, |
1071 | 1071 | "source": [ |
1072 | | - "We can see that the number of elements in the quantization scale of the outputted tensor is now 3, matching those of the 3-channel tensor! Furthermore, we see that each channel has an 8-bit quantization range that matches its data distribution, which is much more ideal in terms of reducing quantization mismatch. However, it's important to note that some hardware providers don't efficiently support per-channel quantization in production, so it's best to check if your targetted hardware will allow per-channel quantization." |
| 1072 | + "We can see that the number of elements in the quantization scale of the output tensor is now 3, matching those of the 3-channel tensor! Furthermore, we see that each channel has an 8-bit quantization range that matches its data distribution, which is much more ideal in terms of reducing quantization mismatch. However, it's important to note that some hardware providers don't efficiently support per-channel quantization in production, so it's best to check if your targetted hardware will allow per-channel quantization." |
1073 | 1073 | ] |
1074 | 1074 | }, |
1075 | 1075 | { |
1076 | 1076 | "cell_type": "markdown", |
1077 | 1077 | "metadata": {}, |
1078 | 1078 | "source": [ |
1079 | 1079 | "Finally, a reminder that mixing things up is perfectly legal and encouraged in Brevitas.\n", |
1080 | | - "For example, a `QuantIdentity` with `act_quant=Int8ActPerTensorFloatMinMaxInit` is equivalent to a default `QuantHardTanh`, or conversely a `QuantHardTanh` with `act_quant=Int8ActPerTensorFloat` is equivalent to a default `QuantIdentity`. This is allowed by the fact that - as it will be explained in the next tutorial - the same layer can accept different keyword arguments when different quantizers are set. So a QuantIdentity with `act_quant=Int8ActPerTensorFloatMinMaxInit` is going to expect arguments `min_val` and `max_val` the same way a default `QuantHardTanh` would." |
| 1080 | + "For example, a `QuantIdentity` with `act_quant=Int8ActPerTensorFloatMinMaxInit` is equivalent to a default `QuantHardTanh`, or conversely a `QuantHardTanh` with `act_quant=Int8ActPerTensorFloat` is equivalent to a default `QuantIdentity`. This is allowed by the fact that - as it will be explained in the next tutorial - the same layer can accept different keyword arguments when different quantizers are set. So a `QuantIdentity` with `act_quant=Int8ActPerTensorFloatMinMaxInit` is going to expect arguments `min_val` and `max_val` the same way a default `QuantHardTanh` would." |
1081 | 1081 | ] |
1082 | 1082 | } |
1083 | 1083 | ], |
|
0 commit comments