Reformulating matrix multiplication scale equation to reduce math ops and improve power and performance. (pytorch#6437)

trivedivivek · facebook-github-bot · commit 50f551d9184a · 2024-10-23T08:19:49.000-07:00
Summary: This diff simplifies the the matrix multiplication scale equation in q_linear op. The existing equation in q_linear op is: ``` for i in K / 4 sums[c] = mat1_tex . (qmat2(c) scales[c]) out += sums ``` where c = [0, 4), out, sums, mat1_tex and qmat2 are vectors and scales is a scalar. The dot product is associative with respect to scalar multiplication as mentioned in https://en.wikipedia.org/wiki/Dot_product ie. (ac1).(bc2) = c1c2(a.b) Thus, the multiplication can be rearranged as: ``` for i in K / 4 sums[c] = (mat1_tex . qmat2(c)) scales[c] out += sums ``` Using distributive property of multiplication ie. ab + ac + ad ... = a(b + c+ d...) the code can be further simplified to: ``` for i in K / 4 sums[c] = mat1_tex . qmat2(c) out += sums out *= scale ``` This rearrangement significantly reduces redundant multiplications. Reviewed By: SS-JIA Differential Revision: D64479405
diff --git a/backends/vulkan/runtime/graph/ops/glsl/q_8w_linear.glsl b/backends/vulkan/runtime/graph/ops/glsl/q_8w_linear.glsl
@@ -102,22 +102,20 @@ VEC4_T q_8w_linear(const ivec3 out_pos, const int K) {
 
   for (int i = 0; i < K; i += 4) {
     const VEC4_T mat1_tex = load_texel(t_mat1, mat1_pos);
-
     const VEC4_T sums = VEC4_T(
-        dot(mat1_tex, load_texel(t_qmat2, qmat2_pos) * scales.x),
-        dot(mat1_tex,
-            load_texel(t_qmat2, qmat2_pos + u16vec3(0, 1, 0)) * scales.y),
-        dot(mat1_tex,
-            load_texel(t_qmat2, qmat2_pos + u16vec3(0, 2, 0)) * scales.z),
-        dot(mat1_tex,
-            load_texel(t_qmat2, qmat2_pos + u16vec3(0, 3, 0)) * scales.w));
+        dot(mat1_tex, load_texel(t_qmat2, qmat2_pos)),
+        dot(mat1_tex, load_texel(t_qmat2, qmat2_pos + u16vec3(0, 1, 0))),
+        dot(mat1_tex, load_texel(t_qmat2, qmat2_pos + u16vec3(0, 2, 0))),
+        dot(mat1_tex, load_texel(t_qmat2, qmat2_pos + u16vec3(0, 3, 0))));
 
     outtex += sums;
 
     mat1_pos.x++;
     qmat2_pos.x++;
   }
 
+  outtex *= scales;
+
   return outtex;
 }