Update README

Atry · Atry · commit 04890a8654a0 · 2018-03-29T23:32:39.000+08:00
diff --git a/README.md b/README.md
@@ -3,8 +3,8 @@
 **Compute.scala** is a Scala library for scientific computing with N-dimensional arrays in parallel on GPU, CPU and other devices. It will be the primary back-end of the incoming [DeepLearning.scala](http://deeplearning.thoughtworks.school/) 3.0, to address performance problems we encountered in DeepLearning.scala 2.0 with [nd4j](http://nd4j.org/).
 
  * Compute.scala can dynamically merge multiple operators into one kernel program, which runs significantly faster when performing complex computation.
- * Compute.scala manages memory and other native resource in a determinate approach, consuming less memory.
- * All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional buffer allocation.
+ * Compute.scala manages data buffers and other native resources in a determinate approach, consuming less memory.
+ * All dimensional transformation operators (`permute`, `broadcast`, `reshape`, etc) in Compute.scala are views, with no additional data buffer allocation.
  * N-dimensional arrays in Compute.scala can be converted from / to JVM collection, which support higher-ordered functions like `map` / `reduce`, and still can run on GPU.
 
 ## Getting started
@@ -38,7 +38,7 @@ libraryDependencies += "com.thoughtworks.compute" %% "cpu" % "latest.release"
 libraryDependencies += "com.thoughtworks.compute" %% "gpu" % "latest.release"
 
 // Platform dependent runtime of LWJGL core library
-libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release" % Runtime).jar().classifier {
+libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release").jar().classifier {
   import scala.util.Properties._
   if (isMac) {
     "natives-macos"
@@ -52,9 +52,166 @@ libraryDependencies += ("org.lwjgl" % "lwjgl" % "latest.release" % Runtime).jar(
 }
 ```
 
-Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for other build tools.
+Check [Compute.scala on Scaladex](https://index.scala-lang.org/thoughtworksinc/compute.scala) and [LWJGL customize tool](https://www.lwjgl.org/customize) for settings for Maven, Gradle and other build tools.
+
+### Creating an N-dimensional array
+
+Import different the namespace object `gpu` or `cpu`, according to the OpenCL runtime you want to use.
+
+``` scala
+// For N-dimensional array on GPU
+import com.thoughtworks.compute.gpu._
+```
+
+``` scala
+// For N-dimensional array on CPU
+import com.thoughtworks.compute.cpu._
+```
+
+In Compute.scala, an N-dimensional array is typed as `Tensor`, which can be created from `Seq` or `scala.Array`.
+
+``` scala
+val my2DArray: Tensor = Tensor(Array(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
+```
+
+If you print out `my2DArray`,
+
+``` scala
+println(my2DArray)
+```
+
+then the output should be
+
+```
+[[1.0,2.0,3.0],[4.0,5.0,6.0]]
+```
+
+You can also print the sizes of each dimension using the `shape` method.
+
+``` scala
+// Output 2 because my2DArray is a 2D array.
+println(my2DArray.length)
+
+// Output 2 because the size of first dimension of my2DArray is 2.
+println(my2DArray.shape(0)) // 2
+
+// Output 3 because the size of second dimension of my2DArray is 3.
+println(my2DArray.shape(1)) // 3
+```
+
+So `my2DArray` is a 2D array of 2x3 size.
+
+#### Scalar value
+
+Note that a `Tensor` can be a zero dimensional array, which is simply a scalar value.
+
+```
+val scalar = Tensor(42.f)
+println(scalar.shape.length) // 0
+```
+
+### Element-wise operators
+
+Element-wise operators are performed for each element of in `Tensor` operands.
+
+```
+val plus100 = my2DArray + Tensor.fill(100.0f, Array(2, 3))
+
+println(plus100) // Output [[101.0,102.0,103.0],[104.0,105.0,106.0]]
+```
+
+## Design
+
+### Lazy-evaluation
+
+`Tensor`s in Compute.scala are immutable and lazy-evaluated. All operators that create `Tensor`s are pure, which allocate zero data buffer and not execute any time-consuming tasks. The actual computation is only performed when the final result is requested.
+
+For example:
+
+``` scala
+val a = Tensor(Seq(Seq(1.0f, 2.0f, 3.0f), Seq(4.0f, 5.0f, 6.0f)))
+val b = Tensor(Seq(Seq(7.0f, 8.0f, 9.0f), Seq(10.0f, 11.0f, 12.0f)))
+val c = Tensor(Seq(Seq(13.0f, 14.0f, 15.0f), Seq(16.0f, 17.0f, 18.0f)))
+
+val result: InlineTensor = a * b + c
+```
+
+All the `Tensor`s, including `a`, `b`, `c` and `result` are small JVM objects and no computation is performed up to now.
+
+``` scala
+println(result.toString)
+```
+
+When `result.toString` is called, the Compute.scala compiles the expression `a * b + c` into one kernel program and execute it.
+
+Both `result` and the temporary variable `a * b` are `InlineTensor`s, indicating their computation can be inlined into a more complex kernel program.
+
+This approach is faster than other libraries because we don't have to execute two kernels for multiplication and addition respectively.
+
+Check the [Scaladoc](https://javadoc.io/page/com.thoughtworks.compute/tensors_2.12/latest/com/thoughtworks/compute/Tensors$Tensor.html) seeing which operators return `InlineTensor` or its subtype `TransformedTensor`, which can be inlined into a more complex kernel program as well.
+
+#### Caching
+
+By default, when `result.toString` is called more than once, the expression `a * b + c` is executed more than once.
+
+``` scala
+println(result.toString)
+
+// The computation is performed, again
+println(result.toString)
+```
+
+Fortunately, we provides a `cache` method to eagerly fill in a `NonInlineTensor`, and keep the filling data for reusing. You can convert `result` to a `NonInlineTensor`, which has a corresponding non-inline kernel program.
+
+``` scala
+val nonInlineTensor = result.nonInline
+val cache = nonInlineTensor.cache
+
+try {
+  // The cache is reused. No device-side computation is performed.
+  println(nonInlineTensor.toString)
+
+  // The cache is reused. No device-side computation is performed.
+  println(nonInlineTensor.toString)
+
+  val tmp: InlineTensor = exp(nonInlineTensor)
+  
+  // The cache for nonInlineTensor is reused, but the exponential function is performed.
+  println(tmp.toString)
+
+  // The cache for nonInlineTensor is reused, but the exponential function is performed, again.
+  println(tmp.toString)
+} finally {
+  cache.close()
+}
+
+// (a * b + c) is performed because cache for nonInlineTensor has been closed.
+println(nonInlineTensor.toString)
+```
+
+The data buffer allocated for `nonInlineResult` is kept until `cache.close()` is invoked.
+
+By combining pure `Tensor`s along with the impure `cache` mechanism, we achieved the following goals:
+
+* All `Tensor`s are pure of zero data buffer allocation when creating them.
+* The computation of `Tensor`s can be merged together, to minimize the number of intermediate data buffers and kernel programs.
+* The developers can create `cache`s for `Tensor`s, as a determinate way to manage the life-cycle of resources.
+
+### Scala collection interoperability
+
+TODO
 
 ## Benchmark
 
  * [Compute.scala vs Nd4j on NVIDIA GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/nvidia-gpu.json)
  * [Compute.scala on AMD GPU](http://jmh.morethan.io/?source=https://thoughtworksinc.github.io/Compute.scala/benchmarks/amd-gpu.json)
+
+## Future work
+
+Now this project is only a minimum viable product. Many important features are still under development:
+
+* Support tensors of elements other than single-precision floating-point ([#104](https://github.com/ThoughtWorksInc/Compute.scala/issues/104)).
+* Add more OpenCL math functions ([#101](https://github.com/ThoughtWorksInc/Compute.scala/issues/101)).
+* Further optimization of performance ([#62, #103](https://github.com/ThoughtWorksInc/Compute.scala/labels/performance)).
+
+Contribution is welcome. Check [good first issues](https://github.com/ThoughtWorksInc/Compute.scala/labels/good%20first%20issue) to start hacking.