Skip to content

Commit 1b564e4

Browse files
committed
Merge branch 'master' into sd/devices
2 parents 1cac3f2 + 4f35ef0 commit 1b564e4

File tree

2 files changed

+49
-61
lines changed

2 files changed

+49
-61
lines changed

README.md

Lines changed: 48 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,15 @@
33
[![Build Status](https://travis-ci.org/JuliaGPU/GPUArrays.jl.svg?branch=master)](https://travis-ci.org/JuliaGPU/GPUArrays.jl)
44
[![Build status](https://ci.appveyor.com/api/projects/status/2aa4bvmq7e9rh338/branch/master?svg=true)](https://ci.appveyor.com/project/SimonDanisch/gpuarrays-jl-8n74h/branch/master)
55

6+
[Benchmarks](https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/results/results.md)
7+
68
GPU Array package for Julia's various GPU backends.
79
The compilation for the GPU is done with [CUDAnative.jl](https://github.com/JuliaGPU/CUDAnative.jl/)
810
and for OpenCL [Transpiler.jl](https://github.com/SimonDanisch/Transpiler.jl) is used.
911
In the future it's planned to replace the transpiler by a similar approach
1012
CUDAnative.jl is using (via LLVM + SPIR-V).
1113

14+
1215
# Why another GPU array package in yet another language?
1316

1417
Julia offers countless advantages for a GPU array package.
@@ -33,9 +36,9 @@ end
3336
Will result in one GPU kernel call to a function that combines the operations without any extra allocations.
3437
This allows GPUArrays to offer a lot of functionality with minimal code.
3538

36-
Also, when compiling Julia to the GPU, we can use all the cool features from Julia, e.g.
39+
Also, when compiling Julia for the GPU, we can use all the cool features from Julia, e.g.
3740
higher order functions, multiple dispatch, meta programming and generated functions.
38-
Checkout the examples, to see how this can be used to emit specialized code while not loosing flexibility:
41+
Checkout the examples, to see how this can be used to emit specialized code while not losing flexibility:
3942
[unrolling](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/examples/juliaset.jl),
4043
[vector loads/stores](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/examples/vectorload.jl)
4144

@@ -44,7 +47,7 @@ In theory, we could go as far as inspecting user defined callbacks (we can get t
4447

4548
### Automatic Differentiation
4649

47-
Because of neuronal netorks, automatic differentiation is super hyped right now!
50+
Because of neural networks, automatic differentiation is super hyped right now!
4851
Julia offers a couple of packages for that, e.g. [ReverseDiff](https://github.com/JuliaDiff/ReverseDiff.jl).
4952
It heavily relies on Julia's strength to specialize generic code and dispatch to different implementations depending on the Array type, allowing an almost overheadless automatic differentiation.
5053
Making this work with GPUArrays will be a bit more involved, but the
@@ -55,33 +58,25 @@ There is also [ReverseDiffSource](https://github.com/JuliaDiff/ReverseDiffSource
5558

5659
Current backends: OpenCL, CUDA, Julia Threaded
5760

58-
Planned backends: OpenGL, Vulkan
59-
6061
Implemented for all backends:
6162

6263
```Julia
6364
map(f, ::GPUArray...)
6465
map!(f, dest::GPUArray, ::GPUArray...)
6566

66-
# maps
67-
mapidx(f, A::GPUArray, args...) do idx, a, args...
68-
# e.g
69-
if idx < length(A)
70-
a[idx+1] = a[idx]
71-
end
72-
end
73-
74-
7567
broadcast(f, ::GPUArray...)
7668
broadcast!(f, dest::GPUArray, ::GPUArray...)
7769

78-
# calls `f` on args, with queues, block heuristics and context taken from `array`
79-
# f can be a julia function or a tuple (String, Symbol),
80-
# being a C kernel source string + the name of the kernel function
81-
gpu_call(array::GPUArray, f, args::Tuple)
70+
mapreduce(f, op, ::GPUArray...) # so support for sum/mean/minimum etc comes for free
8271

72+
getindex, setindex!, push!, append!, splice!, append!, copy!, reinterpret, convert
73+
74+
From (CL/CU)FFT
75+
fft!/fft/ifft/ifft! and the matching plan_fft functions.
76+
From (CL/CU)BLAS
77+
gemm!, scal!, gemv! and the high level functions that are implemented with these, like A * B, A_mul_B!, etc.
8378
```
84-
Example for [gpu_call](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/examples/custom_kernels.jl)
79+
8580

8681
# Usage
8782

@@ -98,67 +93,60 @@ function test(a, b)
9893
Complex64(sin(a / b))
9994
end
10095
complex_c = test.(c, b)
101-
fft!(complex_c) # fft!/ifft! is currently implemented for JLBackend and CLBackend
102-
96+
fft!(complex_c) # fft!/ifft!/plan_fft, plan_ifft, plan_fft!, plan_ifft!
97+
98+
"""
99+
When you program with GPUArrays, you can just write normal julia functions, feed them to gpu_call and depending on what backend you choose it will use Transpiler.jl or CUDAnative.
100+
"""
101+
#Signature, global_size == cuda blocks, local size == cuda threads
102+
gpu_call(kernel::Function, DispatchDummy::GPUArray, args::Tuple, global_size = length(DispatchDummy), local_size = nothing)
103+
with kernel looking like this:
104+
105+
function kernel(state, arg1, arg2, arg3) # args get splatted into the kernel call
106+
# state gets always passed as the first argument and is needed to offer the same
107+
# functionality across backends, even though they have very different ways of of getting e.g. the thread index
108+
# arg1 can be any gpu array - this is needed to dispatch to the correct intrinsics.
109+
# if you call gpu_call without any further modifications to global/local size, this should give you a linear index into
110+
# DispatchDummy
111+
idx = linear_index(state, arg1::GPUArray)
112+
arg1[idx] = arg2[idx] + arg3[idx]
113+
return #kernel must return void
114+
end
103115
```
116+
Example for [gpu_call](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/examples/custom_kernels.jl)
104117

105-
CLFFT, CUFFT, CLBLAS and CUBLAS will soon be supported.
106-
A prototype of generic support of these libraries can be found in [blas.jl](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/backends/blas.jl).
107-
The OpenCL backend already supports mat mul via `CLBLAS.gemm!` and `fft!`/`ifft!`.
108-
CUDAnative could support these easily as well, but we currently run into problems with the interactions of `CUDAdrv` and `CUDArt`.
109-
110-
111-
# Benchmarks
112-
113-
We have only benchmarked Blackscholes and not much time has been spent to optimize our kernels yet.
114-
So please treat these numbers with care!
115-
116-
[source](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/examples/blackscholes.jl)
117-
118-
![blackscholes](https://cdn.rawgit.com/JuliaGPU/GPUArrays.jl/91678a36/examples/blackscholes.svg)
119-
120-
Interestingly, on the GTX950, the CUDAnative backend outperforms the OpenCL backend by a factor of 10.
121-
This is most likely due to the fact, that LLVM is great at unrolling and vectorizing loops,
122-
while it seems that the nvidia OpenCL compiler isn't. So with our current primitive kernel,
123-
quite a bit of performance is missed out with OpenCL right now!
124-
This can be fixed by putting more effort into emitting specialized kernels, which should
125-
be straightforward with Julia's great meta programming and `@generated` functions.
126-
127-
128-
Times in a table:
118+
# Currently supported subset of Julia Code
129119

130-
| Backend | Time (s) for N = 10^7 | OP/s in million | Speedup |
131-
| ---- | ---- | ---- | ---- |
132-
| JLContext i3-4130 CPU @ 3.40GHz 1 threads | 1.0085 s| 10 | 1.0|
133-
| JLContext i7-6700 CPU @ 3.40GHz 1 threads | 0.8773 s| 11 | 1.1|
134-
| CLContext: i7-6700 CPU @ 3.40GHz 8 threads | 0.2093 s| 48 | 4.8|
135-
| JLContext i7-6700 CPU @ 3.40GHz 8 threads | 0.1981 s| 50 | 5.1|
136-
| CLContext: GeForce GTX 950 | 0.0301 s| 332 | 33.5|
137-
| CUContext: GeForce GTX 950 | 0.0032 s| 3124 | 315.0|
138-
| CLContext: FirePro w9100 | 0.0013 s| 7831 | 789.8|
120+
working with immutable isbits (not containing pointers) type should be completely supported
121+
non allocating code (so no constructs like `x = [1, 2, 3]`). Note that tuples are isbits, so this works x = (1, 2, 3).
122+
Transpiler/OpenCL has problems with putting GPU arrays on the gpu into a struct - so no views and actually no multidimensional indexing. For that `size` is needed which would need to be part of the array struct. A fix for that is in sight, though.
139123

140124
# TODO / up for grabs
141125

142-
* stencil operations
126+
* stencil operations, convolutions
143127
* more tests and benchmarks
144128
* tests, that only switch the backend but use the same code
145129
* performance improvements!!
146-
* implement push!, append!, resize!, getindex, setindex!
147130
* interop between OpenCL, CUDA and OpenGL is there as a protype, but needs proper hooking up via `Base.copy!` / `convert`
148-
* share implementation of broadcast etc between backends. Currently they don't, since there are still subtle differences which should be eliminated over time!
149131

150132

151133
# Installation
152134

153-
For the cudanative backend, you need to install [CUDAnative.jl manually](https://github.com/JuliaGPU/CUDAnative.jl/#installation).
154-
The cudanative backend only works on 0.6, while the other backends also support Julia 0.5.
155-
Make sure to have CUDA and OpenCL driver installed correctly.
135+
I recently added a lot of features and bug fixes to the master branch, so you might want to check that out (`Pkg.checkout("GPUArrays")`).
136+
137+
For the cudanative backend, you need to install [CUDAnative.jl manually](https://github.com/JuliaGPU/CUDAnative.jl/#installation) and it works only on osx + linux with a julia source build.
138+
Make sure to have either CUDA and/or OpenCL drivers installed correctly.
156139
`Pkg.build("GPUArrays")` will pick those up and should include the working backends.
157140
So if your system configuration changes, make sure to run `Pkg.build("GPUArrays")` again.
158141
The rest should work automatically:
142+
159143
```Julia
160144
Pkg.add("GPUArrays")
145+
Pkg.checkout("GPUArrays") # optional but recommended to checkout master branch
161146
Pkg.build("GPUArrays") # should print out information about what backends are added
162147
# Test it!
163148
Pkg.test("GPUArrays")
164149
```
150+
If a backend is not supported by the hardware, you will see build errors while running `Pkg.add("GPUArrays")`.
151+
Since GPUArrays selects only working backends when running `Pkg.build("GPUArrays")`
152+
**these errors can be ignored**.

REQUIRE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ julia 0.6
22
StaticArrays
33
ColorTypes
44

5-
Transpiler 0.2
5+
Transpiler 0.3
66
Sugar 0.3
77
Matcha 0.0.2
88

0 commit comments

Comments
 (0)