Enable quantization for bf16 model (pytorch#14558)

navsud · facebook-github-bot · commit 291121cb04c5 · 2025-09-24T15:04:27.000-07:00
Summary:

To save GPU memory `bfloat16` dtype is commonly used for training of LLMs. Currently, the quantizer ignores quantizing the nodes if they are not float32. This change enables quantization of bf16 nodes as well.

Differential Revision: D82866443
diff --git a/backends/qualcomm/quantizer/annotators.py b/backends/qualcomm/quantizer/annotators.py
@@ -68,7 +68,7 @@ def _is_float_tensor(node: Node):
         or not isinstance(node.meta["val"], FakeTensor)
     ):
         return False
-    return node.meta["val"].dtype == torch.float32
+    return node.meta["val"].dtype in (torch.bfloat16, torch.float32)
 
 
 def _mark_nodes_as_annotated(nodes: List[Node]):