Optimize MlasComputeSoftmax with prefetch (#20393)

yihonglyu · web-flow · commit edffa2a180d4 · 2024-04-25T08:28:59.000-07:00
The prefetching instructions (_mm_prefetch) is used to anticipate memory
accesses by prefetching the next row of the input buffer. This
optimization is designed to reduce the impact of memory latency, thereby
enhancing the performance of the MlasComputeSoftmax function. As a
result, the worst-case performance of the OCR model has improved by
approximately 50ms, which equates to a 3% improvement.
diff --git a/onnxruntime/core/mlas/lib/compute.cpp b/onnxruntime/core/mlas/lib/compute.cpp
@@ -850,8 +850,24 @@ Return Value:
     const float* Input = WorkBlock->Input + n * D;
     float* Output = WorkBlock->Output + n * D;
 
+#if defined(MLAS_SSE2_INTRINSICS)
+    // TODO: Use std::hardware_constructive_interference_size
+    constexpr size_t CacheLineSize = 64;
+    constexpr size_t ElementsPerCacheLine = CacheLineSize / sizeof(float);
+#endif
+
     while (CountN > 0) {
 
+#if defined(MLAS_SSE2_INTRINSICS)
+        //
+        // Prefetch the next row of the input buffer.
+        //
+
+        for (size_t i = 0; i * ElementsPerCacheLine < D; i++) {
+            _mm_prefetch((char*)(Input + D) + i * CacheLineSize, _MM_HINT_T0);
+        }
+#endif
+
         //
         // Find the maximum value for the row.
         //