|
1 | 1 | # Compute.scala
|
2 | 2 |
|
3 |
| -**Compute.scala** is a Scala library for scientific computing on N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/). |
| 3 | +**Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/). |
4 | 4 |
|
5 | 5 | * Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
|
6 |
| - * Compute.scala manages memory and other native resource in a determinate approach, consuming less memory. |
7 |
| - * All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional buffer allocation. |
| 6 | + * Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory. |
| 7 | + * All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional data buffer allocation. |
8 | 8 | * N-dimensional arrays in Compute.scala can be converted from / to JVM collection, which support higher-ordered functions like `map` / `reduce`, and still can run on GPU.
|
9 | 9 |
|
10 | 10 | ## Getting started
|
11 | 11 |
|
12 |
| -TODO: |
| 12 | +### System Requirements |
| 13 | + |
| 14 | +Compute.scala is based on [LWJGL 3](https://www.lwjgl.org/)'s OpenCL binding, which supports AMD, NVIDIA and Intel's GPU and CPU on Linux, Windows and macOS. |
| 15 | + |
| 16 | +Make sure you have met the following system requirements before using Compute.scala. |
| 17 | + |
| 18 | + * Linux, Windows or macOS |
| 19 | + * JDK 8 |
| 20 | + * OpenCL runtime |
| 21 | + |
| 22 | +The performance of Compute.scala varies according to which OpenCL runtime you are using. For best performance, install OpenCL runtime according to the following table. |
| 23 | + |
| 24 | +| | Linux | Windows | macOS | |
| 25 | +| --- | --- | --- | --- | |
| 26 | +| NVIDIA GPU | [NVIDIA GPU Driver](http://www.nvidia.com/drivers) | [NVIDIA GPU Driver](http://www.nvidia.com/drivers) | macOS's built-in OpenCL SDK | |
| 27 | +| AMD GPU | [AMDGPU-PRO Driver](https://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Driver-for-Linux-Release-Notes.aspx) | [AMD OpenCL™ 2.0 Driver](https://support.amd.com/en-us/kb-articles/Pages/OpenCL2-Driver.aspx) | macOS's built-in OpenCL SDK | |
| 28 | +| Intel or AMD CPU | [POCL](http://portablecl.org/) | [POCL](http://portablecl.org/) | [POCL](http://portablecl.org/) | |
| 29 | + |
| 30 | +Especially, Compute.scala produces non-vectorized code, which needs POCL's auto-vectorization feature for best performance when running on CPU. |
| 31 | + |
| 32 | +### Project setup |
| 33 | + |
| 34 | +The artifacts of Compute.scala is published on Maven central repository for Scala 2.11 and 2.12. Add the following settings to your `build.sbt` if you are using [sbt](https://www.scala-sbt.org/). |
| 35 | + |
| 36 | +``` sbt |
| 37 | +libraryDependencies += "com.thoughtworks.compute" %% "cpu" % "latest.release" |
| 38 | +libraryDependencies += "com.thoughtworks.compute" %% "gpu" % "latest.release" |
| 39 | + |
| 40 | +// Platform dependent runtime of LWJGL core library |
| 41 | +libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release").jar().classifier { |
| 42 | + import scala.util.Properties._ |
| 43 | + if (isMac) { |
| 44 | + "natives-macos" |
| 45 | + } else if (isLinux) { |
| 46 | + "natives-linux" |
| 47 | + } else if (isWin) { |
| 48 | + "natives-windows" |
| 49 | + } else { |
| 50 | + throw new MessageOnlyException(s"lwjgl does not support $osName") |
| 51 | + } |
| 52 | +} |
| 53 | +``` |
| 54 | + |
| 55 | +Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for Maven, Gradle and other build tools. |
| 56 | + |
| 57 | +### Creating an N-dimensional array |
| 58 | + |
| 59 | +Import different the namespace object `gpu` or `cpu`, according to the OpenCL runtime you want to use. |
| 60 | + |
| 61 | +``` scala |
| 62 | +// For N-dimensional array on GPU |
| 63 | +import com.thoughtworks.compute.gpu._ |
| 64 | +``` |
| 65 | + |
| 66 | +``` scala |
| 67 | +// For N-dimensional array on CPU |
| 68 | +import com.thoughtworks.compute.cpu._ |
| 69 | +``` |
| 70 | + |
| 71 | +In Compute.scala, an N-dimensional array is typed as `Tensor`, which can be created from `Seq` or `scala.Array`. |
| 72 | + |
| 73 | +``` scala |
| 74 | +val my2DArray: Tensor = Tensor(Array(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f))) |
| 75 | +``` |
| 76 | + |
| 77 | +If you print out `my2DArray`, |
| 78 | + |
| 79 | +``` scala |
| 80 | +println(my2DArray) |
| 81 | +``` |
| 82 | + |
| 83 | +then the output should be |
| 84 | + |
| 85 | +``` |
| 86 | +[[1.0,2.0,3.0],[4.0,5.0,6.0]] |
| 87 | +``` |
| 88 | + |
| 89 | +You can also print the sizes of each dimension using the `shape` method. |
| 90 | + |
| 91 | +``` scala |
| 92 | +// Output 2 because my2DArray is a 2D array. |
| 93 | +println(my2DArray.length) |
| 94 | + |
| 95 | +// Output 2 because the size of first dimension of my2DArray is 2. |
| 96 | +println(my2DArray.shape(0)) // 2 |
| 97 | + |
| 98 | +// Output 3 because the size of second dimension of my2DArray is 3. |
| 99 | +println(my2DArray.shape(1)) // 3 |
| 100 | +``` |
| 101 | + |
| 102 | +So `my2DArray` is a 2D array of 2x3 size. |
| 103 | + |
| 104 | +#### Scalar value |
| 105 | + |
| 106 | +Note that a `Tensor` can be a zero dimensional array, which is simply a scalar value. |
| 107 | + |
| 108 | +``` |
| 109 | +val scalar = Tensor(42.f) |
| 110 | +println(scalar.shape.length) // 0 |
| 111 | +``` |
| 112 | + |
| 113 | +### Element-wise operators |
| 114 | + |
| 115 | +Element-wise operators are performed for each element of in `Tensor` operands. |
| 116 | + |
| 117 | +``` |
| 118 | +val plus100 = my2DArray + Tensor.fill(100.0f, Array(2, 3)) |
| 119 | +
|
| 120 | +println(plus100) // Output [[101.0,102.0,103.0],[104.0,105.0,106.0]] |
| 121 | +``` |
| 122 | + |
| 123 | +## Design |
| 124 | + |
| 125 | +### Lazy-evaluation |
| 126 | + |
| 127 | +`Tensor`s in Compute.scala are immutable and lazy-evaluated. All operators that create `Tensor`s are pure, which allocate zero data buffer and not execute any time-consuming tasks. The actual computation is only performed when the final result is requested. |
| 128 | + |
| 129 | +For example: |
| 130 | + |
| 131 | +``` scala |
| 132 | +val a = Tensor(Seq(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f))) |
| 133 | +val b = Tensor(Seq(Seq(7.0f, 8.0f, 9.0f), Seq(10.0f, 11.0f, 12.0f))) |
| 134 | +val c = Tensor(Seq(Seq(13.0f, 14.0f, 15.0f), Seq(16.0f, 17.0f, 18.0f))) |
| 135 | + |
| 136 | +val result: InlineTensor = a * b + c |
| 137 | +``` |
| 138 | + |
| 139 | +All the `Tensor`s, including `a`, `b`, `c` and `result` are small JVM objects and no computation is performed up to now. |
| 140 | + |
| 141 | +``` scala |
| 142 | +println(result.toString) |
| 143 | +``` |
| 144 | + |
| 145 | +When `result.toString` is called, the Compute.scala compiles the expression `a * b + c` into one kernel program and execute it. |
| 146 | + |
| 147 | +Both `result` and the temporary variable `a * b` are `InlineTensor`s, indicating their computation can be inlined into a more complex kernel program. |
| 148 | + |
| 149 | +This approach is faster than other libraries because we don't have to execute two kernels for multiplication and addition respectively. |
| 150 | + |
| 151 | +Check the [Scaladoc](https://javadoc.io/page/com.thoughtworks.compute/tensors_2.12/latest/com/thoughtworks/compute/Tensors$Tensor.html) seeing which operators return `InlineTensor` or its subtype `TransformedTensor`, which can be inlined into a more complex kernel program as well. |
| 152 | + |
| 153 | +#### Caching |
| 154 | + |
| 155 | +By default, when `result.toString` is called more than once, the expression `a * b + c` is executed more than once. |
| 156 | + |
| 157 | +``` scala |
| 158 | +println(result.toString) |
| 159 | + |
| 160 | +// The computation is performed, again |
| 161 | +println(result.toString) |
| 162 | +``` |
| 163 | + |
| 164 | +Fortunately, we provides a `cache` method to eagerly fill in a `NonInlineTensor`, and keep the filling data for reusing. You can convert `result` to a `NonInlineTensor`, which has a corresponding non-inline kernel program. |
| 165 | + |
| 166 | +``` scala |
| 167 | +val nonInlineTensor = result.nonInline |
| 168 | +val cache = nonInlineTensor.cache |
| 169 | + |
| 170 | +try { |
| 171 | + // The cache is reused. No device-side computation is performed. |
| 172 | + println(nonInlineTensor.toString) |
| 173 | + |
| 174 | + // The cache is reused. No device-side computation is performed. |
| 175 | + println(nonInlineTensor.toString) |
| 176 | + |
| 177 | + val tmp: InlineTensor = exp(nonInlineTensor) |
| 178 | + |
| 179 | + // The cache for nonInlineTensor is reused, but the exponential function is performed. |
| 180 | + println(tmp.toString) |
| 181 | + |
| 182 | + // The cache for nonInlineTensor is reused, but the exponential function is performed, again. |
| 183 | + println(tmp.toString) |
| 184 | +} finally { |
| 185 | + cache.close() |
| 186 | +} |
| 187 | + |
| 188 | +// (a * b + c) is performed because cache for nonInlineTensor has been closed. |
| 189 | +println(nonInlineTensor.toString) |
| 190 | +``` |
| 191 | + |
| 192 | +The data buffer allocated for `nonInlineResult` is kept until `cache.close()` is invoked. |
| 193 | + |
| 194 | +By combining pure `Tensor`s along with the impure `cache` mechanism, we achieved the following goals: |
| 195 | + |
| 196 | +* All `Tensor`s are pure of zero data buffer allocation when creating them. |
| 197 | +* The computation of `Tensor`s can be merged together, to minimize the number of intermediate data buffers and kernel programs. |
| 198 | +* The developers can create `cache`s for `Tensor`s, as a determinate way to manage the life-cycle of resources. |
| 199 | + |
| 200 | +### Scala collection interoperability |
| 201 | + |
| 202 | +#### `split` |
| 203 | + |
| 204 | +A `Tensor` can be `split` into small `Tensor`s on the direction of a specific dimension. |
| 205 | + |
| 206 | +For example, given a 3D tensor whose `shape` is 2x3x4, |
| 207 | + |
| 208 | +``` scala |
| 209 | +val my3DTensor = Tensor((0.0f until 24.0f by 1.0f).grouped(4).toSeq.grouped(3).toSeq) |
| 210 | + |
| 211 | +val Array(2, 3, 4) = my3DTensor.shape |
| 212 | +``` |
| 213 | + |
| 214 | +when `split` it at the dimension #0, |
| 215 | + |
| 216 | +``` scala |
| 217 | +val subtensors0 = my3DTensor.split(dimension = 0) |
| 218 | +``` |
| 219 | + |
| 220 | +then the result should be a `Seq` of two 3x4 tensors. |
| 221 | + |
| 222 | +``` scala |
| 223 | +// Output: TensorSeq([[0.0,1.0,2.0,3.0],[4.0,5.0,6.0,7.0],[8.0,9.0,10.0,11.0]], [[12.0,13.0,14.0,15.0],[16.0,17.0,18.0,19.0],[20.0,21.0,22.0,23.0]]) |
| 224 | +println(subtensors0) |
| 225 | +``` |
| 226 | + |
| 227 | +When `split` it at the dimension #1, |
| 228 | + |
| 229 | +``` scala |
| 230 | +val subtensors1 = my3DTensor.split(dimension = 1) |
| 231 | +``` |
| 232 | + |
| 233 | +then the result should be a `Seq` of three 2x4 tensors. |
| 234 | + |
| 235 | +``` scala |
| 236 | +// Output: TensorSeq([[0.0,1.0,2.0,3.0],[12.0,13.0,14.0,15.0]], [[4.0,5.0,6.0,7.0],[16.0,17.0,18.0,19.0]], [[8.0,9.0,10.0,11.0],[20.0,21.0,22.0,23.0]]) |
| 237 | +println(subtensors1) |
| 238 | +``` |
| 239 | + |
| 240 | +Then you can use arbitrary Scala collection functions on Seq of subtensors. |
| 241 | + |
| 242 | +#### `join` |
| 243 | + |
| 244 | +TODO |
| 245 | + |
| 246 | +#### Fast matrix multiplication from `split` and `join` |
| 247 | + |
| 248 | +TODO |
13 | 249 |
|
14 | 250 | ## Benchmark
|
15 | 251 |
|
16 | 252 | * [Compute.scala vs Nd4j on NVIDIA GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
|
17 | 253 | * [Compute.scala on AMD GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
|
18 | 254 |
|
19 |
| -## Supported platforms |
| 255 | +## Future work |
| 256 | + |
| 257 | +Now this project is only a minimum viable product. Many important features are still under development: |
| 258 | + |
| 259 | +* Support tensors of elements other than single-precision floating-point ([#104](https://github.com/ThoughtWorksInc/Compute.scala/issues/104)). |
| 260 | +* Add more OpenCL math functions ([#101](https://github.com/ThoughtWorksInc/Compute.scala/issues/101)). |
| 261 | +* Further optimization of performance ([#62, #103](https://github.com/ThoughtWorksInc/Compute.scala/labels/performance)). |
20 | 262 |
|
21 |
| -Compute.scala is based on OpenCL, supporting AMD, NVIDIA and Intel's GPU and CPU on Linux, Windows and macOS. |
| 263 | +Contribution is welcome. Check [good first issues](https://github.com/ThoughtWorksInc/Compute.scala/labels/good%20first%20issue) to start hacking. |
0 commit comments