Skip to content

Commit f18182a

Browse files
authored
Merge pull request #135 from Atry/readme
Add the example for `join` in documentation
2 parents 62fa450 + 73013bd commit f18182a

File tree

3 files changed

+136
-20
lines changed

3 files changed

+136
-20
lines changed

README.md

Lines changed: 105 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Compute.scala
22

3-
**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/).
3+
**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [ND4J](http://ND4J.org/).
44

55
* Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
66
* Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory.
@@ -56,7 +56,7 @@ Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/c
5656

5757
### Creating an N-dimensional array
5858

59-
Import different the namespace object `gpu` or `cpu`, according to the OpenCL runtime you want to use.
59+
Import types in `gpu` or `cpu` object according to the OpenCL runtime you want to use.
6060

6161
``` scala
6262
// For N-dimensional array on GPU
@@ -68,7 +68,7 @@ import com.thoughtworks.compute.gpu._
6868
import com.thoughtworks.compute.cpu._
6969
```
7070

71-
In Compute.scala, an N-dimensional array is typed as `Tensor`, which can be created from `Seq` or `scala.Array`.
71+
In Compute.scala, an N-dimensional array is typed as `Tensor`, which can be created from `Seq` or `Array`.
7272

7373
``` scala
7474
val my2DArray: Tensor = Tensor(Array(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
@@ -203,7 +203,7 @@ By combining pure `Tensor`s along with the impure `cache` mechanism, we achieved
203203

204204
A `Tensor` can be `split` into small `Tensor`s on the direction of a specific dimension.
205205

206-
For example, given a 3D tensor whose `shape` is 2x3x4,
206+
For example, given a 3D tensor whose `shape` is 2×3×4,
207207

208208
``` scala
209209
val my3DTensor = Tensor((0.0f until 24.0f by 1.0f).grouped(4).toSeq.grouped(3).toSeq)
@@ -214,10 +214,10 @@ val Array(2, 3, 4) = my3DTensor.shape
214214
when `split` it at the dimension #0,
215215

216216
``` scala
217-
val subtensors0 = my3DTensor.split(dimension = 0)
217+
val subtensors0: Seq[Tensor] = my3DTensor.split(dimension = 0)
218218
```
219219

220-
then the result should be a `Seq` of two 3x4 tensors.
220+
then the result should be a `Seq` of two 3×4 tensors.
221221

222222
``` scala
223223
// Output: TensorSeq([[0.0,1.0,2.0,3.0],[4.0,5.0,6.0,7.0],[8.0,9.0,10.0,11.0]], [[12.0,13.0,14.0,15.0],[16.0,17.0,18.0,19.0],[20.0,21.0,22.0,23.0]])
@@ -227,30 +227,121 @@ println(subtensors0)
227227
When `split` it at the dimension #1,
228228

229229
``` scala
230-
val subtensors1 = my3DTensor.split(dimension = 1)
230+
val subtensors1: Seq[Tensor] = my3DTensor.split(dimension = 1)
231231
```
232232

233-
then the result should be a `Seq` of three 2x4 tensors.
233+
then the result should be a `Seq` of three 2×4 tensors.
234234

235235
``` scala
236236
// Output: TensorSeq([[0.0,1.0,2.0,3.0],[12.0,13.0,14.0,15.0]], [[4.0,5.0,6.0,7.0],[16.0,17.0,18.0,19.0]], [[8.0,9.0,10.0,11.0],[20.0,21.0,22.0,23.0]])
237237
println(subtensors1)
238238
```
239239

240-
Then you can use arbitrary Scala collection functions on Seq of subtensors.
240+
Then you can use arbitrary Scala collection functions on the `Seq` of subtensors.
241241

242242
#### `join`
243243

244-
TODO
244+
Multiple `Tensor`s of the same `shape` can be merged into a larger `Tensor` via the `Tensor.join` function.
245245

246-
#### Fast matrix multiplication from `split` and `join`
246+
Given a `Seq` of three 2×2 `Tensor`s,
247247

248-
TODO
248+
``` scala
249+
val mySubtensors: Seq[Tensor] = Seq(
250+
Tensor(Seq(Seq(1.0f, 2.0f), Seq(3.0f, 4.0f))),
251+
Tensor(Seq(Seq(5.0f, 6.0f), Seq(7.0f, 8.0f))),
252+
Tensor(Seq(Seq(9.0f, 10.0f), Seq(11.0f, 12.0f))),
253+
)
254+
```
255+
256+
when `join`ing them,
257+
``` scala
258+
val merged: Tensor = Tensor.join(mySubtensors)
259+
```
260+
261+
then the result should be a 2x2x3 `Tensor`.
262+
263+
``` scala
264+
// Output: [[[1.0,5.0,9.0],[2.0,6.0,10.0]],[[3.0,7.0,11.0],[4.0,8.0,12.0]]]
265+
println(merged.toString)
266+
```
267+
268+
Generally, when `join`ing *n* `Tensor`s of shape *a*<sub>0</sub> × *a*<sub>1</sub> × *a*<sub>2</sub> ×  ⋯ × *a*<sub>*i*</sub> , the shape of the result `Tensor` is *a*<sub>0</sub> × *a*<sub>1</sub> × *a*<sub>2</sub> ×  ⋯ × *a*<sub>*i*</sub> × *n*
269+
270+
#### Case study: fast matrix multiplication via `split` and `join`
271+
272+
By combining `split` and `join`, you can create complex computation in the following steps:
273+
274+
1. Using `split` to create `Seq`s from some of dimensions of `Tensor`s.
275+
2. Using Scala collection functions to manipulate `Seq`s.
276+
3. Using `join` to merge transformed `Seq` back to `Tensor`.
277+
278+
For example, you can implement matrix multiplication in this style.
279+
280+
``` scala
281+
def matrixMultiply1(matrix1: Tensor, matrix2: Tensor): Tensor = {
282+
val columns1 = matrix1.split(1)
283+
val columns2 = matrix2.split(1)
284+
val resultColumns = columns2.map { column2: Tensor =>
285+
(columns1 zip column2.split(0))
286+
.map {
287+
case (l: Tensor, r: Tensor) =>
288+
l * r.broadcast(l.shape)
289+
}
290+
.reduce[Tensor](_ + _)
291+
}
292+
Tensor.join(resultColumns)
293+
}
294+
```
295+
296+
You can imagine the Scala collection functions as the code generator of the kernel program, thus the loop running in Scala collection will finally create unrolled loop in the kernel program.
297+
298+
The above `matrixMultiply1` will create a kernel program that contains a unrolled loop of each row and column of `matrix2`. Thus it runs very fast when `matrix1` is big and `matrix2` is small. Our benchmark shows that the above `matrixMultiply1` runs 13 times faster than ND4J's cuBLAS back-end, on a Titan X GPU, when `matrix1` is 65536×8 and `matrix2` is 8×8.
299+
300+
---
301+
302+
You can also create another version of matrix multiplication, which only unrolls the loop of each row of `matrix2`.
303+
304+
``` scala
305+
def matrixMultiply2(matrix1: Tensor, matrix2: Tensor): Tensor = {
306+
val Array(i, j) = matrix1.shape
307+
val Array(`j`, k) = matrix2.shape
308+
val broadcastMatrix1 = matrix1.broadcast(Array(i, j, k))
309+
val broadcastMatrix2 = matrix2.reshape(Array(1, j, k)).broadcast(Array(i, j, k))
310+
val product = broadcastMatrix1 * broadcastMatrix2
311+
product.split(1).reduce[Tensor](_ + _)
312+
}
313+
```
314+
315+
`matrixMultiply2` will run faster than `matrixMultiply1` when `matrix1` is small.
316+
317+
A sophisticated matrix multiplication should dynamically switch the two version of implementation according to matrix size.
318+
319+
``` scala
320+
val UnrollThreshold = 4000
321+
322+
def matrixMultiply(matrix1: Tensor, matrix2: Tensor): Tensor = {
323+
if (matrix1.shape.head >= UnrollThreshold) {
324+
matrixMultiply1(matrix1, matrix2)
325+
} else {
326+
matrixMultiply2(matrix1, matrix2)
327+
}
328+
}
329+
```
330+
331+
The final version of `matrixMultiply` will have good performance for both small and big matrixes.
249332

250333
## Benchmark
251334

252-
* [Compute.scala vs Nd4j on NVIDIA GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
253-
* [Compute.scala on AMD GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
335+
* [Compute.scala vs ND4J on a NVIDIA Titan X GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
336+
* [Compute.scala on a AMD RX480 GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
337+
338+
Some information can be found in the benchmark result:
339+
340+
* Apparently, Compute.scala supports both NVIDIA GPU and AMD GPU, while ND4J does not support AMD GPU.
341+
* Compute.scala is faster than ND4J on large arrays or complex expressions.
342+
* ND4J is faster than Compute.scala when performing one simple primary operation on very small arrays.
343+
* ND4J's reduced sum is faster than Compute.scala.
344+
* ND4J's `permute` and `broadcast` are extremely slow, causing very low score in the convolution benchmark (unlike this benchmark, Deeplearning4j's convolution operation internally uses some undocumented variant of `permute` and `broadcast` in ND4J, which are not extremely slow).
254345

255346
## Future work
256347

Tensors/src/test/scala/com/thoughtworks/compute/TensorsSpec.scala

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -337,17 +337,17 @@ class TensorsSpec extends AsyncFreeSpec with Matchers {
337337
import tensors._
338338

339339
def matrixMultiply(matrix1: Tensor, matrix2: Tensor): Tensor = {
340-
341340
val columns1 = matrix1.split(1)
342-
343-
Tensor.join(matrix2.split(1).map { column2: Tensor =>
344-
(columns1 zip column2.split(0))
341+
val columns2 = matrix2.split(1)
342+
val resultColumns = columns2.map { column2: Tensor =>
343+
(columns1.view zip column2.split(0))
345344
.map {
346345
case (l: Tensor, r: Tensor) =>
347346
l * r.broadcast(l.shape)
348347
}
349348
.reduce[Tensor](_ + _)
350-
})
349+
}
350+
Tensor.join(resultColumns)
351351
}
352352

353353
matrixMultiply(

cpu/src/main/scala/com/thoughtworks/compute/cpu.scala

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,9 @@ import org.lwjgl.opencl.CL10.CL_DEVICE_TYPE_CPU
2121
* my2DArray.toString should be("[[1.0,2.0],[3.0,4.0]]")
2222
* }}}
2323
*
24-
* @example Given a 3D tensor whose `shape` is 2x3x4,
24+
* @example A `Tensor` can be `split` into small `Tensor`s on the direction of a specific dimension.
25+
*
26+
* Given a 3D tensor whose `shape` is 2x3x4,
2527
*
2628
* {{{
2729
* val my3DTensor = Tensor((0.0f until 24.0f by 1.0f).grouped(4).toSeq.grouped(3).toSeq)
@@ -65,6 +67,29 @@ import org.lwjgl.opencl.CL10.CL_DEVICE_TYPE_CPU
6567
* }
6668
* }}}
6769
*
70+
* @example Multiple `Tensor`s of the same `shape` can be merged into a larger `Tensor` via the `Tensor.join` function.
71+
*
72+
* Given a `Seq` of three 2x2 `Tensor`s,
73+
*
74+
* {{{
75+
* val mySubtensors: Seq[Tensor] = Seq(
76+
* Tensor(Seq(Seq(1.0f, 2.0f), Seq(3.0f, 4.0f))),
77+
* Tensor(Seq(Seq(5.0f, 6.0f), Seq(7.0f, 8.0f))),
78+
* Tensor(Seq(Seq(9.0f, 10.0f), Seq(11.0f, 12.0f))),
79+
* )
80+
* }}}
81+
*
82+
* when `join`ing them,
83+
* {{{
84+
* val merged: Tensor = Tensor.join(mySubtensors)
85+
* }}}
86+
*
87+
* then the result should be a 2x2x3 `Tensor`.
88+
*
89+
* {{{
90+
* merged.toString should be("[[[1.0,5.0,9.0],[2.0,6.0,10.0]],[[3.0,7.0,11.0],[4.0,8.0,12.0]]]")
91+
* merged.shape should be(Array(2, 2, 3))
92+
* }}}
6893
*
6994
*
7095
*/

0 commit comments

Comments
 (0)