Update on "Documentation Updates"

HDCharles · HDCharles · commit 1bd361371cdd · 2023-11-15T16:10:22.000-08:00
Summary: Updating README with better examples, updating class and api
documentation and removing the unnecessary int_mm_fused_mul option from
dynamic quant

Test Plan: python test/test.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
diff --git a/README.md b/README.md
@@ -27,49 +27,57 @@ torchao                            0.0.1                   <install dir>
 
 Relevant APIs can be found in torchao.quantization.quant_api
 
+Note: Depending on the technique being applied to the model, you may see a perf degredation.
+This is because quantization adds additional overhead to the model that is hopefully made up for
+with faster matmuls. If your matmuls are small enough (or have odd shapes), the overhead can be larger than the gain
+from the quantized matmul.
 
-### A16W8 WeightOnly Quantization
+### A8W8 Dynamic Quantization
 
-The `apply_weight_only_int8_quant` function swaps all
-linear modules to weight-only quantized linear modules.
+Similar to the weight only api above, the `apply_dynamic_quant` function swaps all
+linear modules to dynamically quantized quantized linear modules.
 
 Example
 
 ```
-import torch
-from torchao.quantization import quant_api
 
 # some user model and example input
-model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
-input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')
+...
 
 # convert linear modules to quantized linear modules
-quant_api.apply_weight_only_int8_quant(model)
+quant_api.apply_dynamic_quant(model)
 
 # compile the model to improve performance
-torch.compile(model, mode='max-autotune')
-model(input)
+...
 ```
 
-### A8W8 Dynamic Quantization
+This technique works best when the torch._inductor.config.force_fuse_int_mm_with_mul option is enabled. This allows fusion of the int8*int8 -> int32 matmul and subsequent mul op, thereby avoiding materialization of the int32 intermediary tensor.
 
-Similar to the weight only api above, the `apply_dynamic_quant` function swaps all
-linear modules to dynamically quantized quantized linear modules.
+### A16W8 WeightOnly Quantization
+
+The `apply_weight_only_int8_quant` function swaps all
+linear modules to weight-only quantized linear modules.
 
 Example
 
 ```
+import torch
+from torchao.quantization import quant_api
 
 # some user model and example input
-...
+model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
+input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')
 
 # convert linear modules to quantized linear modules
-quant_api.apply_dynamic_quant(model)
+quant_api.apply_weight_only_int8_quant(model)
 
 # compile the model to improve performance
-...
+torch.compile(model, mode='max-autotune')
+model(input)
 ```
 
+This technique works best when the torch._inductor.config.use_mixed_mm option is enabled. This avoids dequantizing the weight tensor before the matmul, instead fusing the dequantization into the matmul, thereby avoiding materialization of a large floating point weight tensor.
+
 ## Other APIs
 
 ### A8W8 Dynamic Quantization by subclasses