You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/).
3
+
**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [ND4J](http://ND4J.org/).
4
4
5
5
* Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
6
6
* Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory.
@@ -56,7 +56,7 @@ Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/c
56
56
57
57
### Creating an N-dimensional array
58
58
59
-
Import different the namespace object `gpu` or `cpu`, according to the OpenCL runtime you want to use.
59
+
Import types in `gpu` or `cpu` object according to the OpenCL runtime you want to use.
Generally, when `join`ing *n*`Tensor`s of shape *a*<sub>0</sub> × *a*<sub>1</sub> × *a*<sub>2</sub> × ⋯ × *a*<sub>*i*</sub> , the shape of the result `Tensor` is *a*<sub>0</sub> × *a*<sub>1</sub> × *a*<sub>2</sub> × ⋯ × *a*<sub>*i*</sub> × *n*
269
+
270
+
#### Case study: fast matrix multiplication via `split` and `join`
271
+
272
+
By combining `split` and `join`, you can create complex computation in the following steps:
273
+
274
+
1. Using `split` to create `Seq`s from some of dimensions of `Tensor`s.
275
+
2. Using Scala collection functions to manipulate `Seq`s.
276
+
3. Using `join` to merge transformed `Seq` back to `Tensor`.
277
+
278
+
For example, you can implement matrix multiplication in this style.
You can imagine the Scala collection functions as the code generator of the kernel program, thus the loop running in Scala collection will finally create unrolled loop in the kernel program.
297
+
298
+
The above `matrixMultiply1` will create a kernel program that contains a unrolled loop of each row and column of `matrix2`. Thus it runs very fast when `matrix1` is big and `matrix2` is small. Our benchmark shows that the above `matrixMultiply1` runs 13 times faster than ND4J's cuBLAS back-end, on a Titan X GPU, when `matrix1` is 65536×8 and `matrix2` is 8×8.
299
+
300
+
---
301
+
302
+
You can also create another version of matrix multiplication, which only unrolls the loop of each row of `matrix2`.
The final version of `matrixMultiply` will have good performance for both small and big matrixes.
249
332
250
333
## Benchmark
251
334
252
-
*[Compute.scala vs Nd4j on NVIDIA GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
253
-
*[Compute.scala on AMD GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
335
+
*[Compute.scala vs ND4J on a NVIDIA Titan X GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
336
+
*[Compute.scala on a AMD RX480 GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
337
+
338
+
Some information can be found in the benchmark result:
339
+
340
+
* Apparently, Compute.scala supports both NVIDIA GPU and AMD GPU, while ND4J does not support AMD GPU.
341
+
* Compute.scala is faster than ND4J on large arrays or complex expressions.
342
+
* ND4J is faster than Compute.scala when performing one simple primary operation on very small arrays.
343
+
* ND4J's reduced sum is faster than Compute.scala.
344
+
* ND4J's `permute` and `broadcast` are extremely slow, causing very low score in the convolution benchmark (unlike this benchmark, Deeplearning4j's convolution operation internally uses some undocumented variant of `permute` and `broadcast` in ND4J, which are not extremely slow).
0 commit comments