Skip to content

Commit cc521fb

Browse files
authored
Update README.md
1 parent dc522b0 commit cc521fb

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ def matrixMultiply1(matrix1: Tensor, matrix2: Tensor): Tensor = {
295295

296296
You can imagine the Scala collection functions as the code generator of the kernel program, thus the loop running in Scala collection will finally create unrolled loop in the kernel program.
297297

298-
The above `matrixMultiply1` will create a kernel program that contains a unrolled loop of each row and column of `matrix2`. Thus it runs very fast when `matrix1` is big and `matrix2` is small. Our benchmark shows that the above `matrixMultiply1` runs 13 times faster than ND4J's cuBLAS back-end, on a Titan X GPU, when `matrix1` is 65536×8 and `matrix2` is 8×8.
298+
The above `matrixMultiply1` will create a kernel program that contains a unrolled loop of each row and column of `matrix2`. Thus it runs very fast when `matrix1` is big and `matrix2` is small. Our benchmark shows that the above `matrixMultiply1` runs 2.15 times faster than ND4J's cuBLAS back-end, on a Titan X GPU, when `matrix1` is 65536×8 and `matrix2` is 8×8.
299299

300300
---
301301

0 commit comments

Comments
 (0)