You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you program with GPUArrays, you can just write normal julia functions, feed them to gpu_call and depending on what backend you choose it will use Transpiler.jl or CUDAnative.
109
+
"""
110
+
#Signature, global_size == cuda blocks, local size == cuda threads
functionkernel(state, arg1, arg2, arg3) # args get splatted into the kernel call
115
+
# state gets always passed as the first argument and is needed to offer the same
116
+
# functionality across backends, even though they have very different ways of of getting e.g. the thread index
117
+
# arg1 can be any gpu array - this is needed to dispatch to the correct intrinsics.
118
+
# if you call gpu_call without any further modifications to global/local size, this should give you a linear index into
119
+
# DispatchDummy
120
+
idx =linear_index(state, arg1::GPUArray)
121
+
arg1[idx] = arg2[idx] + arg3[idx]
122
+
return#kernel must return void
123
+
end
107
124
```
108
125
109
-
CLFFT, CUFFT, CLBLAS and CUBLAS will soon be supported.
110
-
A prototype of generic support of these libraries can be found in [blas.jl](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/backends/blas.jl).
111
-
The OpenCL backend already supports mat mul via `CLBLAS.gemm!` and `fft!`/`ifft!`.
112
-
CUDAnative could support these easily as well, but we currently run into problems with the interactions of `CUDAdrv` and `CUDArt`.
113
-
114
-
115
-
# Benchmarks
116
-
117
-
We have only benchmarked Blackscholes and not much time has been spent to optimize our kernels yet.
Interestingly, on the GTX950, the CUDAnative backend outperforms the OpenCL backend by a factor of 10.
125
-
This is most likely due to the fact, that LLVM is great at unrolling and vectorizing loops,
126
-
while it seems that the nvidia OpenCL compiler isn't. So with our current primitive kernel,
127
-
quite a bit of performance is missed out with OpenCL right now!
128
-
This can be fixed by putting more effort into emitting specialized kernels, which should
129
-
be straightforward with Julia's great meta programming and `@generated` functions.
130
-
126
+
# Currently supported subset of Julia Code
131
127
132
-
Times in a table:
128
+
working with immutable isbits (not containing pointers) type should be completely supported
129
+
non allocating code (so no constructs like `x = [1, 2, 3]`). Note that tuples are isbits, so this works x = (1, 2, 3).
130
+
Transpiler/OpenCL has problems with putting GPU arrays on the gpu into a struct - so no views and actually no multidimensional indexing. For that `size` is needed which would need to be part of the array struct. A fix for that is in sight, though.
133
131
134
-
| Backend | Time (s) for N = 10^7 | OP/s in million | Speedup |
I recently added a lot of features and bug fixes to the master branch.
158
-
Please check that out first and see [pull #37](https://github.com/JuliaGPU/GPUArrays.jl/pull/37) for a list of new features.
159
148
160
149
For the cudanative backend, you need to install [CUDAnative.jl manually](https://github.com/JuliaGPU/CUDAnative.jl/#installation) and it works only on osx + linux with a julia source build.
161
150
Make sure to have either CUDA and/or OpenCL drivers installed correctly.
0 commit comments