Skip to content

Commit 04890a8

Browse files
committed
Update README
1 parent 3926f9f commit 04890a8

File tree

1 file changed

+161
-4
lines changed

1 file changed

+161
-4
lines changed

README.md

Lines changed: 161 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/).
44

55
* Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
6-
* Compute.scala manages memory and other native resource in a determinate approach, consuming less memory.
7-
* All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional buffer allocation.
6+
* Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory.
7+
* All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional data buffer allocation.
88
* N-dimensional arrays in Compute.scala can be converted from / to JVM collection, which support higher-ordered functions like `map` / `reduce`, and still can run on GPU.
99

1010
## Getting started
@@ -38,7 +38,7 @@ libraryDependencies += "com.thoughtworks.compute" %% "cpu" % "latest.release"
3838
libraryDependencies += "com.thoughtworks.compute" %% "gpu" % "latest.release"
3939

4040
// Platform dependent runtime of LWJGL core library
41-
libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release" % Runtime).jar().classifier {
41+
libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release").jar().classifier {
4242
import scala.util.Properties._
4343
if (isMac) {
4444
"natives-macos"
@@ -52,9 +52,166 @@ libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release" % Runtime).jar(
5252
}
5353
```
5454

55-
Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for other build tools.
55+
Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for Maven, Gradle and other build tools.
56+
57+
### Creating an N-dimensional array
58+
59+
Import different the namespace object `gpu` or `cpu`, according to the OpenCL runtime you want to use.
60+
61+
``` scala
62+
// For N-dimensional array on GPU
63+
import com.thoughtworks.compute.gpu._
64+
```
65+
66+
``` scala
67+
// For N-dimensional array on CPU
68+
import com.thoughtworks.compute.cpu._
69+
```
70+
71+
In Compute.scala, an N-dimensional array is typed as `Tensor`, which can be created from `Seq` or `scala.Array`.
72+
73+
``` scala
74+
val my2DArray: Tensor = Tensor(Array(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
75+
```
76+
77+
If you print out `my2DArray`,
78+
79+
``` scala
80+
println(my2DArray)
81+
```
82+
83+
then the output should be
84+
85+
```
86+
[[1.0,2.0,3.0],[4.0,5.0,6.0]]
87+
```
88+
89+
You can also print the sizes of each dimension using the `shape` method.
90+
91+
``` scala
92+
// Output 2 because my2DArray is a 2D array.
93+
println(my2DArray.length)
94+
95+
// Output 2 because the size of first dimension of my2DArray is 2.
96+
println(my2DArray.shape(0)) // 2
97+
98+
// Output 3 because the size of second dimension of my2DArray is 3.
99+
println(my2DArray.shape(1)) // 3
100+
```
101+
102+
So `my2DArray` is a 2D array of 2x3 size.
103+
104+
#### Scalar value
105+
106+
Note that a `Tensor` can be a zero dimensional array, which is simply a scalar value.
107+
108+
```
109+
val scalar = Tensor(42.f)
110+
println(scalar.shape.length) // 0
111+
```
112+
113+
### Element-wise operators
114+
115+
Element-wise operators are performed for each element of in `Tensor` operands.
116+
117+
```
118+
val plus100 = my2DArray + Tensor.fill(100.0f, Array(2, 3))
119+
120+
println(plus100) // Output [[101.0,102.0,103.0],[104.0,105.0,106.0]]
121+
```
122+
123+
## Design
124+
125+
### Lazy-evaluation
126+
127+
`Tensor`s in Compute.scala are immutable and lazy-evaluated. All operators that create `Tensor`s are pure, which allocate zero data buffer and not execute any time-consuming tasks. The actual computation is only performed when the final result is requested.
128+
129+
For example:
130+
131+
``` scala
132+
val a = Tensor(Seq(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
133+
val b = Tensor(Seq(Seq(7.0f, 8.0f, 9.0f), Seq(10.0f, 11.0f, 12.0f)))
134+
val c = Tensor(Seq(Seq(13.0f, 14.0f, 15.0f), Seq(16.0f, 17.0f, 18.0f)))
135+
136+
val result: InlineTensor = a * b + c
137+
```
138+
139+
All the `Tensor`s, including `a`, `b`, `c` and `result` are small JVM objects and no computation is performed up to now.
140+
141+
``` scala
142+
println(result.toString)
143+
```
144+
145+
When `result.toString` is called, the Compute.scala compiles the expression `a * b + c` into one kernel program and execute it.
146+
147+
Both `result` and the temporary variable `a * b` are `InlineTensor`s, indicating their computation can be inlined into a more complex kernel program.
148+
149+
This approach is faster than other libraries because we don't have to execute two kernels for multiplication and addition respectively.
150+
151+
Check the [Scaladoc](https://javadoc.io/page/com.thoughtworks.compute/tensors_2.12/latest/com/thoughtworks/compute/Tensors$Tensor.html) seeing which operators return `InlineTensor` or its subtype `TransformedTensor`, which can be inlined into a more complex kernel program as well.
152+
153+
#### Caching
154+
155+
By default, when `result.toString` is called more than once, the expression `a * b + c` is executed more than once.
156+
157+
``` scala
158+
println(result.toString)
159+
160+
// The computation is performed, again
161+
println(result.toString)
162+
```
163+
164+
Fortunately, we provides a `cache` method to eagerly fill in a `NonInlineTensor`, and keep the filling data for reusing. You can convert `result` to a `NonInlineTensor`, which has a corresponding non-inline kernel program.
165+
166+
``` scala
167+
val nonInlineTensor = result.nonInline
168+
val cache = nonInlineTensor.cache
169+
170+
try {
171+
// The cache is reused. No device-side computation is performed.
172+
println(nonInlineTensor.toString)
173+
174+
// The cache is reused. No device-side computation is performed.
175+
println(nonInlineTensor.toString)
176+
177+
val tmp: InlineTensor = exp(nonInlineTensor)
178+
179+
// The cache for nonInlineTensor is reused, but the exponential function is performed.
180+
println(tmp.toString)
181+
182+
// The cache for nonInlineTensor is reused, but the exponential function is performed, again.
183+
println(tmp.toString)
184+
} finally {
185+
cache.close()
186+
}
187+
188+
// (a * b + c) is performed because cache for nonInlineTensor has been closed.
189+
println(nonInlineTensor.toString)
190+
```
191+
192+
The data buffer allocated for `nonInlineResult` is kept until `cache.close()` is invoked.
193+
194+
By combining pure `Tensor`s along with the impure `cache` mechanism, we achieved the following goals:
195+
196+
* All `Tensor`s are pure of zero data buffer allocation when creating them.
197+
* The computation of `Tensor`s can be merged together, to minimize the number of intermediate data buffers and kernel programs.
198+
* The developers can create `cache`s for `Tensor`s, as a determinate way to manage the life-cycle of resources.
199+
200+
### Scala collection interoperability
201+
202+
TODO
56203

57204
## Benchmark
58205

59206
* [Compute.scala vs Nd4j on NVIDIA GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
60207
* [Compute.scala on AMD GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
208+
209+
## Future work
210+
211+
Now this project is only a minimum viable product. Many important features are still under development:
212+
213+
* Support tensors of elements other than single-precision floating-point ([#104](https://github.com/ThoughtWorksInc/Compute.scala/issues/104)).
214+
* Add more OpenCL math functions ([#101](https://github.com/ThoughtWorksInc/Compute.scala/issues/101)).
215+
* Further optimization of performance ([#62, #103](https://github.com/ThoughtWorksInc/Compute.scala/labels/performance)).
216+
217+
Contribution is welcome. Check [good first issues](https://github.com/ThoughtWorksInc/Compute.scala/labels/good%20first%20issue) to start hacking.

0 commit comments

Comments
 (0)