You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: Updating README with better examples, updating class and api
documentation and removing the unnecessary int_mm_fused_mul option from
dynamic quant
Test Plan: python test/test.py
Reviewers:
Subscribers:
Tasks:
Tags:
[ghstack-poisoned]
# convert linear modules to quantized linear modules
47
-
quant_api.apply_weight_only_int8_quant(model)
48
+
quant_api.apply_dynamic_quant(model)
48
49
49
50
# compile the model to improve performance
50
-
torch.compile(model, mode='max-autotune')
51
-
model(input)
51
+
...
52
52
```
53
53
54
-
### A8W8 Dynamic Quantization
54
+
This technique works best when the torch._inductor.config.force_fuse_int_mm_with_mul option is enabled. This allows fusion of the int8*int8 -> int32 matmul and subsequent mul op, thereby avoiding materialization of the int32 intermediary tensor.
55
55
56
-
Similar to the weight only api above, the `apply_dynamic_quant` function swaps all
57
-
linear modules to dynamically quantized quantized linear modules.
56
+
### A16W8 WeightOnly Quantization
57
+
58
+
The `apply_weight_only_int8_quant` function swaps all
59
+
linear modules to weight-only quantized linear modules.
58
60
59
61
Example
60
62
61
63
```
64
+
import torch
65
+
from torchao.quantization import quant_api
62
66
63
67
# some user model and example input
64
-
...
68
+
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
# convert linear modules to quantized linear modules
67
-
quant_api.apply_dynamic_quant(model)
72
+
quant_api.apply_weight_only_int8_quant(model)
68
73
69
74
# compile the model to improve performance
70
-
...
75
+
torch.compile(model, mode='max-autotune')
76
+
model(input)
71
77
```
72
78
79
+
This technique works best when the torch._inductor.config.use_mixed_mm option is enabled. This avoids dequantizing the weight tensor before the matmul, instead fusing the dequantization into the matmul, thereby avoiding materialization of a large floating point weight tensor.
0 commit comments