improve docs

SimonDanisch · SimonDanisch · commit 59e4bc81c82a · 2017-09-27T11:25:51.000+02:00
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -1,18 +1,30 @@
 # GPUArrays Documentation
 
+GPUArrays is an abstract interface for GPU computations.
+Think of it as the AbstractArray interface in Julia Base but for GPUs.
+It allows you to write generic julia code for all GPU platforms and implements common algorithms for the GPU.
+Like Julia Base, this includes BLAS wrapper, FFTs, maps, broadcasts and mapreduces.
+So when you inherit from GPUArrays and overload the interface correctly, you will get a lot
+of functionality for free.
+This will allow to have multiple GPUArray implementation for different purposes, while
+maximizing the ability to share code.
+Currently there are two packages implementing the interface namely [CLArrays](https://github.com/JuliaGPU/CLArrays.jl) and [CuArrays](https://github.com/JuliaGPU/CuArrays.jl).
+As the name suggests, the first implements the interface using OpenCL and the latter uses CUDA.
 
-# Abstract GPU interface
 
-GPUArrays supports different platforms like CUDA and OpenCL, which all have different
-names for function that offer the same functionality on the hardware.
-E.g. how to call a function on the GPU, how to get the thread index etc.
-GPUArrays offers an abstract interface for these functions which are overloaded
-by the packages like [CLArrays](https://github.com/JuliaGPU/CLArrays.jl) and [CuArrays](https://github.com/JuliaGPU/CuArrays.jl).
+
+# The Abstract GPU interface
+
+Different GPU computation frameworks like CUDA and OpenCL, have different
+names for accessing the same hardware functionality.
+E.g. how to launch a GPU Kernel, how to get the thread index and so forth.
+GPUArrays offers a unified abstract interface for these functions.
 This makes it possible to write generic code that can be run on all hardware.
-GPUArrays itself even contains a pure Julia implementation of this interface.
-The julia reference implementation is also a great way to debug your GPU code, since it
-offers many more errors and debugging information compared to the GPU backends - which
+GPUArrays itself even contains a pure [Julia implementation](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/jlbackend.jl) of this interface.
+The julia reference implementation is a great way to debug your GPU code, since it
+offers more informative errors and debugging information compared to the GPU backends - which
 mostly silently error or give cryptic errors (so far).
+
 You can use the reference implementation by using the `GPUArrays.JLArray` type.
 
 The functions that are currently part of the interface:
@@ -25,6 +37,8 @@ blockidx_*(state), blockdim_*(state), threadidx_*(state), griddim_*(state)
 get_group_id,      get_local_size,    get_local_id,       get_num_groups
 ```
 
+Higher level functionality:
+
 ```@docs
 gpu_call(f, A::GPUArray, args::Tuple, configuration = length(A))
 
@@ -42,3 +56,35 @@ device(A::AbstractArray)
 
 synchronize(A::AbstractArray)
 ```
+
+
+# The abstract TestSuite
+
+Since all array packages inheriting from GPUArrays need to offer the same functionality
+and interface, it makes sense to test them in the same way.
+This is why GPUArrays contains a test suite which can be called with the array type
+you want to test.
+
+You can run the test suite like this:
+
+```@example
+using GPUArrays, GPUArrays.TestSuite
+TestSuite.run_tests(MyGPUArrayType)
+```
+If you don't want to run the whole suite, you can also run parts of it:
+
+
+```@example
+Typ = JLArray
+GPUArrays.allowslow(false) # fail tests when slow indexing path into Array type is used.
+
+TestSuite.run_gpuinterface(Typ) # interface functions like gpu_call, threadidx, etc
+TestSuite.run_base(Typ) # basic functionality like launching a kernel on the GPU and Base operations
+TestSuite.run_blas(Typ) # tests the blas interface
+TestSuite.run_broadcasting(Typ) # tests the broadcasting implementation
+TestSuite.run_construction(Typ) # tests all kinds of different ways of constructing the array
+TestSuite.run_fft(Typ) # fft tests
+TestSuite.run_linalg(Typ) # linalg function tests
+TestSuite.run_mapreduce(Typ) # mapreduce sum, etc
+TestSuite.run_indexing(Typ) # indexing tests
+```
diff --git a/src/abstract_gpu_interface.jl b/src/abstract_gpu_interface.jl
@@ -13,24 +13,40 @@ end
 
 
 """
+     synchronize_threads(state)
+
 in CUDA terms `__synchronize`
+in OpenCL terms: `barrier(CLK_LOCAL_MEM_FENCE)`
 """
 function synchronize_threads(state)
     error("Not implemented")
 end
 
 
 """
-    inear_index(state)
+    linear_index(state)
+
+linear index corresponding to each kernel launch (in OpenCL equal to get_global_id).
 
-linear index in a GPU kernel (equal to  OpenCL.get_global_id)
 """
 @inline function linear_index(state)
     UInt32((blockidx_x(state) - UInt32(1)) * blockdim_x(state) + threadidx_x(state))
 end
 
 """
-Macro form of `linear_index`, which returns when out of bounds
+    linearidx(A, statesym = :state)
+
+Macro form of `linear_index`, which calls return when out of bounds.
+So it can be used like this:
+    ```
+    function kernel(state, A)
+        idx = @linear_index A state
+        # from here on it's save to index into A with idx
+        @inbounds begin
+            A[idx] = ...
+        end
+    end
+    ```
 """
 macro linearidx(A, statesym = :state)
     quote
@@ -43,6 +59,8 @@ end
 
 
 """
+    cartesianidx(A, statesym = :state)
+
 Like `@linearidx`, but returns an N-dimensional `NTuple{ndim(A), Cuint}` as index
 """
 macro cartesianidx(A, statesym = :state)
@@ -54,22 +72,28 @@ macro cartesianidx(A, statesym = :state)
 end
 
 """
+    global_size(state)
+
 Global size == blockdim * griddim == total number of kernel execution
 """
 @inline function global_size(state)
     # TODO nd version
     griddim_x(state) * blockdim_x(state)
 end
 
-
 """
+    device(A::AbstractArray)
+
 Gets the device associated to the Array `A`
 """
 function device(A::AbstractArray)
     # fallback is a noop, for backends not needing synchronization. This
     # makes it easier to write generic code that also works for AbstractArrays
 end
+
 """
+    synchronize(A::AbstractArray)
+
 Blocks until all operations are finished on `A`
 """
 function synchronize(A::AbstractArray)
@@ -85,15 +109,17 @@ end
 
 
 """
+    gpu_call(f, A::GPUArray, args::Tuple, configuration = length(A))
+
 Calls function `f` on the GPU.
 `A` must be an GPUArray and will help to dispatch to the correct GPU backend
 and supplies queues and contexts.
-Calls kernel with `kernel(state, args...)`, where state is dependant on the backend
-and can be used for e.g getting an index into A with `linear_index(state)`.
-Optionally, launch configuration can be supplied in the following way:
+Calls the kernel function with `kernel(state, args...)`, where state is dependant on the backend
+and can be used for getting an index into `A` with `linear_index(state)`.
+Optionally, a launch configuration can be supplied in the following way:
 
     1) A single integer, indicating how many work items (total number of threads) you want to launch.
-        in this case `linear_index(state)` will be a number in the range 1:configuration
+        in this case `linear_index(state)` will be a number in the range `1:configuration`
     2) Pass a tuple of integer tuples to define blocks and threads per blocks!
 
 """