[PROTON] Implement max bps method for XPU (#5617)

dev-tomek · web-flow · commit 8bbfd2199207 · 2025-12-08T14:02:43.000-05:00
This PR implements theoretical memory bandwidth calculation for XPU GPUs
in proton.

Remarks:

- Derived formula was computed and compared against the published xpu
bandwidths.
- The multipliers in `arch_to_mem_type_multiplier` are related to memory
types that the architectures implement (gddr6, hbm2e)
- Only 3 arch mappings are included. To my knowledge the rest of intel
gpus are integrated, thus bandwidth is system dependent. Perhaps some
exception catching should be implemented, however that is not the case
for other branches.
- The result is in mega bytes, aligned with cuda, but the docstring as
well as hip case point to the fact that the returned value should be in
bytes. I think this should be aligned upstream.
diff --git a/third_party/proton/proton/specs.py b/third_party/proton/proton/specs.py
@@ -18,6 +18,12 @@
     'gfx950': 8.0 * 1e12,
 }
 
+xpu_arch_to_mem_type_multiplier = {
+    "pvc": 2,
+    "dg2": 8,
+    "bmg": 8,
+}
+
 # FP8 Matrix Performance(FLOPS/clock/CU)
 # For gfx90a we use the performance of INT8 since it doesn't support FP8 matrix operations.
 amd_fp8_flops_by_arch = {'gfx90a': 1024, 'gfx942': 4096, 'gfx950': 8192}
@@ -68,6 +74,4 @@ def max_bps(device_type, arch, bus_width, memory_clock_rate):
         return amd_bps_by_arch[arch]
     else:
         assert device_type == "XPU"
-        # FIXME: how to get correctly numbers on XPU?
-        # https://github.com/intel/intel-xpu-backend-for-triton/issues/5550
-        return 2 * bus_width * memory_clock_rate * 1e3 / 8
+        return xpu_arch_to_mem_type_multiplier[arch] * bus_width * memory_clock_rate * 1e3 / 8