Skip to content

Commit 62fa450

Browse files
authored
Merge pull request #126 from Atry/readme
Add System requirement and Project setup sections in README
2 parents 592da13 + 11f525b commit 62fa450

File tree

6 files changed

+367
-13
lines changed

6 files changed

+367
-13
lines changed

README.md

Lines changed: 248 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,263 @@
11
# Compute.scala
22

3-
**Compute.scala** is a Scala library for scientific computing on N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/).
3+
**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/).
44

55
* Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
6-
* Compute.scala manages memory and other native resource in a determinate approach, consuming less memory.
7-
* All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional buffer allocation.
6+
* Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory.
7+
* All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional data buffer allocation.
88
* N-dimensional arrays in Compute.scala can be converted from / to JVM collection, which support higher-ordered functions like `map` / `reduce`, and still can run on GPU.
99

1010
## Getting started
1111

12-
TODO:
12+
### System Requirements
13+
14+
Compute.scala is based on [LWJGL 3](https://www.lwjgl.org/)'s OpenCL binding, which supports AMD, NVIDIA and Intel's GPU and CPU on Linux, Windows and macOS.
15+
16+
Make sure you have met the following system requirements before using Compute.scala.
17+
18+
* Linux, Windows or macOS
19+
* JDK 8
20+
* OpenCL runtime
21+
22+
The performance of Compute.scala varies according to which OpenCL runtime you are using. For best performance, install OpenCL runtime according to the following table.
23+
24+
| | Linux | Windows | macOS |
25+
| --- | --- | --- | --- |
26+
| NVIDIA GPU | [NVIDIA GPU Driver](http://www.nvidia.com/drivers) | [NVIDIA GPU Driver](http://www.nvidia.com/drivers) | macOS's built-in OpenCL SDK |
27+
| AMD GPU | [AMDGPU-PRO Driver](https://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Driver-for-Linux-Release-Notes.aspx) | [AMD OpenCL™ 2.0 Driver](https://support.amd.com/en-us/kb-articles/Pages/OpenCL2-Driver.aspx) | macOS's built-in OpenCL SDK |
28+
| Intel or AMD CPU | [POCL](http://portablecl.org/) | [POCL](http://portablecl.org/) | [POCL](http://portablecl.org/) |
29+
30+
Especially, Compute.scala produces non-vectorized code, which needs POCL's auto-vectorization feature for best performance when running on CPU.
31+
32+
### Project setup
33+
34+
The artifacts of Compute.scala is published on Maven central repository for Scala 2.11 and 2.12. Add the following settings to your `build.sbt` if you are using [sbt](https://www.scala-sbt.org/).
35+
36+
``` sbt
37+
libraryDependencies += "com.thoughtworks.compute" %% "cpu" % "latest.release"
38+
libraryDependencies += "com.thoughtworks.compute" %% "gpu" % "latest.release"
39+
40+
// Platform dependent runtime of LWJGL core library
41+
libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release").jar().classifier {
42+
import scala.util.Properties._
43+
if (isMac) {
44+
"natives-macos"
45+
} else if (isLinux) {
46+
"natives-linux"
47+
} else if (isWin) {
48+
"natives-windows"
49+
} else {
50+
throw new MessageOnlyException(s"lwjgl does not support $osName")
51+
}
52+
}
53+
```
54+
55+
Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for Maven, Gradle and other build tools.
56+
57+
### Creating an N-dimensional array
58+
59+
Import different the namespace object `gpu` or `cpu`, according to the OpenCL runtime you want to use.
60+
61+
``` scala
62+
// For N-dimensional array on GPU
63+
import com.thoughtworks.compute.gpu._
64+
```
65+
66+
``` scala
67+
// For N-dimensional array on CPU
68+
import com.thoughtworks.compute.cpu._
69+
```
70+
71+
In Compute.scala, an N-dimensional array is typed as `Tensor`, which can be created from `Seq` or `scala.Array`.
72+
73+
``` scala
74+
val my2DArray: Tensor = Tensor(Array(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
75+
```
76+
77+
If you print out `my2DArray`,
78+
79+
``` scala
80+
println(my2DArray)
81+
```
82+
83+
then the output should be
84+
85+
```
86+
[[1.0,2.0,3.0],[4.0,5.0,6.0]]
87+
```
88+
89+
You can also print the sizes of each dimension using the `shape` method.
90+
91+
``` scala
92+
// Output 2 because my2DArray is a 2D array.
93+
println(my2DArray.length)
94+
95+
// Output 2 because the size of first dimension of my2DArray is 2.
96+
println(my2DArray.shape(0)) // 2
97+
98+
// Output 3 because the size of second dimension of my2DArray is 3.
99+
println(my2DArray.shape(1)) // 3
100+
```
101+
102+
So `my2DArray` is a 2D array of 2x3 size.
103+
104+
#### Scalar value
105+
106+
Note that a `Tensor` can be a zero dimensional array, which is simply a scalar value.
107+
108+
```
109+
val scalar = Tensor(42.f)
110+
println(scalar.shape.length) // 0
111+
```
112+
113+
### Element-wise operators
114+
115+
Element-wise operators are performed for each element of in `Tensor` operands.
116+
117+
```
118+
val plus100 = my2DArray + Tensor.fill(100.0f, Array(2, 3))
119+
120+
println(plus100) // Output [[101.0,102.0,103.0],[104.0,105.0,106.0]]
121+
```
122+
123+
## Design
124+
125+
### Lazy-evaluation
126+
127+
`Tensor`s in Compute.scala are immutable and lazy-evaluated. All operators that create `Tensor`s are pure, which allocate zero data buffer and not execute any time-consuming tasks. The actual computation is only performed when the final result is requested.
128+
129+
For example:
130+
131+
``` scala
132+
val a = Tensor(Seq(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
133+
val b = Tensor(Seq(Seq(7.0f, 8.0f, 9.0f), Seq(10.0f, 11.0f, 12.0f)))
134+
val c = Tensor(Seq(Seq(13.0f, 14.0f, 15.0f), Seq(16.0f, 17.0f, 18.0f)))
135+
136+
val result: InlineTensor = a * b + c
137+
```
138+
139+
All the `Tensor`s, including `a`, `b`, `c` and `result` are small JVM objects and no computation is performed up to now.
140+
141+
``` scala
142+
println(result.toString)
143+
```
144+
145+
When `result.toString` is called, the Compute.scala compiles the expression `a * b + c` into one kernel program and execute it.
146+
147+
Both `result` and the temporary variable `a * b` are `InlineTensor`s, indicating their computation can be inlined into a more complex kernel program.
148+
149+
This approach is faster than other libraries because we don't have to execute two kernels for multiplication and addition respectively.
150+
151+
Check the [Scaladoc](https://javadoc.io/page/com.thoughtworks.compute/tensors_2.12/latest/com/thoughtworks/compute/Tensors$Tensor.html) seeing which operators return `InlineTensor` or its subtype `TransformedTensor`, which can be inlined into a more complex kernel program as well.
152+
153+
#### Caching
154+
155+
By default, when `result.toString` is called more than once, the expression `a * b + c` is executed more than once.
156+
157+
``` scala
158+
println(result.toString)
159+
160+
// The computation is performed, again
161+
println(result.toString)
162+
```
163+
164+
Fortunately, we provides a `cache` method to eagerly fill in a `NonInlineTensor`, and keep the filling data for reusing. You can convert `result` to a `NonInlineTensor`, which has a corresponding non-inline kernel program.
165+
166+
``` scala
167+
val nonInlineTensor = result.nonInline
168+
val cache = nonInlineTensor.cache
169+
170+
try {
171+
// The cache is reused. No device-side computation is performed.
172+
println(nonInlineTensor.toString)
173+
174+
// The cache is reused. No device-side computation is performed.
175+
println(nonInlineTensor.toString)
176+
177+
val tmp: InlineTensor = exp(nonInlineTensor)
178+
179+
// The cache for nonInlineTensor is reused, but the exponential function is performed.
180+
println(tmp.toString)
181+
182+
// The cache for nonInlineTensor is reused, but the exponential function is performed, again.
183+
println(tmp.toString)
184+
} finally {
185+
cache.close()
186+
}
187+
188+
// (a * b + c) is performed because cache for nonInlineTensor has been closed.
189+
println(nonInlineTensor.toString)
190+
```
191+
192+
The data buffer allocated for `nonInlineResult` is kept until `cache.close()` is invoked.
193+
194+
By combining pure `Tensor`s along with the impure `cache` mechanism, we achieved the following goals:
195+
196+
* All `Tensor`s are pure of zero data buffer allocation when creating them.
197+
* The computation of `Tensor`s can be merged together, to minimize the number of intermediate data buffers and kernel programs.
198+
* The developers can create `cache`s for `Tensor`s, as a determinate way to manage the life-cycle of resources.
199+
200+
### Scala collection interoperability
201+
202+
#### `split`
203+
204+
A `Tensor` can be `split` into small `Tensor`s on the direction of a specific dimension.
205+
206+
For example, given a 3D tensor whose `shape` is 2x3x4,
207+
208+
``` scala
209+
val my3DTensor = Tensor((0.0f until 24.0f by 1.0f).grouped(4).toSeq.grouped(3).toSeq)
210+
211+
val Array(2, 3, 4) = my3DTensor.shape
212+
```
213+
214+
when `split` it at the dimension #0,
215+
216+
``` scala
217+
val subtensors0 = my3DTensor.split(dimension = 0)
218+
```
219+
220+
then the result should be a `Seq` of two 3x4 tensors.
221+
222+
``` scala
223+
// Output: TensorSeq([[0.0,1.0,2.0,3.0],[4.0,5.0,6.0,7.0],[8.0,9.0,10.0,11.0]], [[12.0,13.0,14.0,15.0],[16.0,17.0,18.0,19.0],[20.0,21.0,22.0,23.0]])
224+
println(subtensors0)
225+
```
226+
227+
When `split` it at the dimension #1,
228+
229+
``` scala
230+
val subtensors1 = my3DTensor.split(dimension = 1)
231+
```
232+
233+
then the result should be a `Seq` of three 2x4 tensors.
234+
235+
``` scala
236+
// Output: TensorSeq([[0.0,1.0,2.0,3.0],[12.0,13.0,14.0,15.0]], [[4.0,5.0,6.0,7.0],[16.0,17.0,18.0,19.0]], [[8.0,9.0,10.0,11.0],[20.0,21.0,22.0,23.0]])
237+
println(subtensors1)
238+
```
239+
240+
Then you can use arbitrary Scala collection functions on Seq of subtensors.
241+
242+
#### `join`
243+
244+
TODO
245+
246+
#### Fast matrix multiplication from `split` and `join`
247+
248+
TODO
13249

14250
## Benchmark
15251

16252
* [Compute.scala vs Nd4j on NVIDIA GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
17253
* [Compute.scala on AMD GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
18254

19-
## Supported platforms
255+
## Future work
256+
257+
Now this project is only a minimum viable product. Many important features are still under development:
258+
259+
* Support tensors of elements other than single-precision floating-point ([#104](https://github.com/ThoughtWorksInc/Compute.scala/issues/104)).
260+
* Add more OpenCL math functions ([#101](https://github.com/ThoughtWorksInc/Compute.scala/issues/101)).
261+
* Further optimization of performance ([#62, #103](https://github.com/ThoughtWorksInc/Compute.scala/labels/performance)).
20262

21-
Compute.scala is based on OpenCL, supporting AMD, NVIDIA and Intel's GPU and CPU on Linux, Windows and macOS.
263+
Contribution is welcome. Check [good first issues](https://github.com/ThoughtWorksInc/Compute.scala/labels/good%20first%20issue) to start hacking.

Tensors/build.sbt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,5 @@ libraryDependencies += "com.github.ghik" %% "silencer-lib" % "0.6"
2424
libraryDependencies += "ch.qos.logback" % "logback-classic" % "1.2.3" % Test
2525

2626
fork in Test := true
27+
28+
enablePlugins(Example)

Tensors/src/main/scala/com/thoughtworks/compute/Tensors.scala

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -768,7 +768,18 @@ trait Tensors extends OpenCL {
768768
} with InlineTensor
769769
}
770770

771-
/**
771+
/** Returns a new [[Tensor]] of new shape and the same data of this [[Tensor]].
772+
*
773+
* @note The data in this [[Tensor]] is considered as row-major order when [[reshape]].
774+
*
775+
* You can create another column-major version reshape by reversing the shape:
776+
*
777+
* {{{
778+
* def columnMajorReshape[Category <: Tensors](tensor: Category#Tensor, newShape: Array[Int]): Category#Tensor = {
779+
* tensor.permute(tensor.shape.indices.reverse.toArray).reshape(newShape.reverse).permute(newShape.indices.reverse.toArray)
780+
* }
781+
* }}}
782+
*
772783
* @group delayed
773784
*/
774785
def reshape(newShape: Array[Int]): NonInlineTensor = {
@@ -996,7 +1007,9 @@ trait Tensors extends OpenCL {
9961007

9971008
private[compute] def doBuffer: Do[PendingBuffer[closure.JvmValue]]
9981009

999-
/**
1010+
/** Returns a RAII managed asynchronous task to read this [[Tensor]] into an off-heap memory,
1011+
* which is linearized in row-majoy order.
1012+
*
10001013
* @group slow
10011014
*/
10021015
def flatBuffer: Do[FloatBuffer] = {
@@ -1011,7 +1024,9 @@ trait Tensors extends OpenCL {
10111024
}
10121025
}
10131026

1014-
/**
1027+
/** Returns an asynchronous task to read this [[Tensor]] into a [[scala.Array]],
1028+
* which is linearized in row-majoy order.
1029+
*
10151030
* @group slow
10161031
*/
10171032
def flatArray: Future[Array[closure.JvmValue]] = {

cpu/build.sbt

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,26 @@
11
enablePlugins(Example)
2+
3+
import scala.meta._
4+
exampleSuperTypes := exampleSuperTypes.value.map {
5+
case ctor"_root_.org.scalatest.FreeSpec" =>
6+
ctor"_root_.org.scalatest.AsyncFreeSpec"
7+
case otherTrait =>
8+
otherTrait
9+
}
10+
11+
exampleSuperTypes += ctor"_root_.org.scalatest.Inside"
12+
13+
libraryDependencies += ("org.lwjgl" % "lwjgl" % "3.1.6" % Test).jar().classifier {
14+
import scala.util.Properties._
15+
if (isMac) {
16+
"natives-macos"
17+
} else if (isLinux) {
18+
"natives-linux"
19+
} else if (isWin) {
20+
"natives-windows"
21+
} else {
22+
throw new MessageOnlyException(s"lwjgl does not support $osName")
23+
}
24+
}
25+
26+
fork := true

0 commit comments

Comments
 (0)