You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+161-4Lines changed: 161 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,8 +3,8 @@
3
3
**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/).
4
4
5
5
* Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
6
-
* Compute.scala manages memory and other native resource in a determinate approach, consuming less memory.
7
-
* All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional buffer allocation.
6
+
* Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory.
7
+
* All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional data buffer allocation.
8
8
* N-dimensional arrays in Compute.scala can be converted from / to JVM collection, which support higher-ordered functions like `map` / `reduce`, and still can run on GPU.
Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for other build tools.
55
+
Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for Maven, Gradle and other build tools.
56
+
57
+
### Creating an N-dimensional array
58
+
59
+
Import different the namespace object `gpu` or `cpu`, according to the OpenCL runtime you want to use.
60
+
61
+
```scala
62
+
// For N-dimensional array on GPU
63
+
importcom.thoughtworks.compute.gpu._
64
+
```
65
+
66
+
```scala
67
+
// For N-dimensional array on CPU
68
+
importcom.thoughtworks.compute.cpu._
69
+
```
70
+
71
+
In Compute.scala, an N-dimensional array is typed as `Tensor`, which can be created from `Seq` or `scala.Array`.
`Tensor`s in Compute.scala are immutable and lazy-evaluated. All operators that create `Tensor`s are pure, which allocate zero data buffer and not execute any time-consuming tasks. The actual computation is only performed when the final result is requested.
All the `Tensor`s, including `a`, `b`, `c` and `result` are small JVM objects and no computation is performed up to now.
140
+
141
+
```scala
142
+
println(result.toString)
143
+
```
144
+
145
+
When `result.toString` is called, the Compute.scala compiles the expression `a * b + c` into one kernel program and execute it.
146
+
147
+
Both `result` and the temporary variable `a * b` are `InlineTensor`s, indicating their computation can be inlined into a more complex kernel program.
148
+
149
+
This approach is faster than other libraries because we don't have to execute two kernels for multiplication and addition respectively.
150
+
151
+
Check the [Scaladoc](https://javadoc.io/page/com.thoughtworks.compute/tensors_2.12/latest/com/thoughtworks/compute/Tensors$Tensor.html) seeing which operators return `InlineTensor` or its subtype `TransformedTensor`, which can be inlined into a more complex kernel program as well.
152
+
153
+
#### Caching
154
+
155
+
By default, when `result.toString` is called more than once, the expression `a * b + c` is executed more than once.
156
+
157
+
```scala
158
+
println(result.toString)
159
+
160
+
// The computation is performed, again
161
+
println(result.toString)
162
+
```
163
+
164
+
Fortunately, we provides a `cache` method to eagerly fill in a `NonInlineTensor`, and keep the filling data for reusing. You can convert `result` to a `NonInlineTensor`, which has a corresponding non-inline kernel program.
165
+
166
+
```scala
167
+
valnonInlineTensor= result.nonInline
168
+
valcache= nonInlineTensor.cache
169
+
170
+
try {
171
+
// The cache is reused. No device-side computation is performed.
172
+
println(nonInlineTensor.toString)
173
+
174
+
// The cache is reused. No device-side computation is performed.
175
+
println(nonInlineTensor.toString)
176
+
177
+
valtmp:InlineTensor= exp(nonInlineTensor)
178
+
179
+
// The cache for nonInlineTensor is reused, but the exponential function is performed.
180
+
println(tmp.toString)
181
+
182
+
// The cache for nonInlineTensor is reused, but the exponential function is performed, again.
183
+
println(tmp.toString)
184
+
} finally {
185
+
cache.close()
186
+
}
187
+
188
+
// (a * b + c) is performed because cache for nonInlineTensor has been closed.
189
+
println(nonInlineTensor.toString)
190
+
```
191
+
192
+
The data buffer allocated for `nonInlineResult` is kept until `cache.close()` is invoked.
193
+
194
+
By combining pure `Tensor`s along with the impure `cache` mechanism, we achieved the following goals:
195
+
196
+
* All `Tensor`s are pure of zero data buffer allocation when creating them.
197
+
* The computation of `Tensor`s can be merged together, to minimize the number of intermediate data buffers and kernel programs.
198
+
* The developers can create `cache`s for `Tensor`s, as a determinate way to manage the life-cycle of resources.
199
+
200
+
### Scala collection interoperability
201
+
202
+
TODO
56
203
57
204
## Benchmark
58
205
59
206
*[Compute.scala vs Nd4j on NVIDIA GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
60
207
*[Compute.scala on AMD GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
208
+
209
+
## Future work
210
+
211
+
Now this project is only a minimum viable product. Many important features are still under development:
212
+
213
+
* Support tensors of elements other than single-precision floating-point ([#104](https://github.com/ThoughtWorksInc/Compute.scala/issues/104)).
214
+
* Add more OpenCL math functions ([#101](https://github.com/ThoughtWorksInc/Compute.scala/issues/101)).
215
+
* Further optimization of performance ([#62, #103](https://github.com/ThoughtWorksInc/Compute.scala/labels/performance)).
216
+
217
+
Contribution is welcome. Check [good first issues](https://github.com/ThoughtWorksInc/Compute.scala/labels/good%20first%20issue) to start hacking.
0 commit comments