comments

George Ohashi · George Ohashi · commit fcb3e328cc0f · 2025-03-06T17:21:26.000-05:00
diff --git a/README.md b/README.md
@@ -29,15 +29,15 @@
 PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:
 
 ##### [W4A16](./examples/quantization_w4a16/README.md)
-- Uses GPTQ to compress weights to 4 bits. 
+- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
 - Useful speed ups in low QPS regimes with more weight compression. 
 - Recommended for any GPUs types. 
 ##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md)
-- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. 
+- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
 - Useful for speed ups in high QPS regimes or offline serving on vLLM. 
 - Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older). 
 ##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md)
-- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. 
+- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
 - Useful for speed ups in high QPS regimes or offline serving on vLLM. 
 - Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).