The code found in this repository implements the conjugate gradient Method for the specific case of the laplace operator
To understand the conjugate gradient method let us consider the function
$$
f(\mathbf{x})=\frac{1}{2} \mathbf{x}^{\top} \mathbf{A} \mathbf{x}-\mathbf{x}^{\top} \mathbf{b}, \quad \mathbf{x} \in \mathbf{R}^n
$$
we note that the derivative of
Because of numerical precision however the point that
In order to do this, we start with an initial guess
The normal gradient method would now require a move along the residual
So the full conjugate gradient method is given by:
As shown in the Book "Conjugate gradient methods without the agonizing pain", the error at every iteration
$$
\left|e_{(i)}\right|A \leq 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^i\left|e{(0)}\right|_A
$$
where
The idea behind preconditioened GC is then to solve
Here some implementation details are described.
For the implementation of the reduction that needs to be done for the inner product, I refered to the document "Optimizing Parallel Reduction in CUDA" by Mark Harris. It uses sequential adressing, see figure below:
This is the code that is used for this:
__global__ void reduceMulAddComplete(float *v, float *w, float *g_odata,
unsigned int n,const unsigned int nthreads)
{
// set thread ID
unsigned int tid = threadIdx.x;
unsigned int gridSize = blockDim.x * 2 * gridDim.x;
unsigned int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;
// shared memory is per block
extern __shared__ float tmp[]; // shared memory can be given as 3rd argument to allocate it dynamicially
// unroll as many as possible
float sum = 0.0;
int i = idx;
while (i < n)
{
sum += v[i]* w[i] + v[i + blockDim.x]* w[i + blockDim.x];
i += gridSize;
}
// g_idata[idx] = sum;
tmp[tid] = sum;
__syncthreads();
// in-place reduction in shared memory
for (int stride = blockDim.x / 2; stride > 0; stride /= 2)
{
if (tid < stride)
{
tmp[tid] += tmp[tid + stride];
}
// synchronize within threadblock
__syncthreads();
}
// atomicAdd result of all blocks to global mem
if (tid == 0)
atomicAdd(g_odata, tmp[0]);
}
Note that for the final addition, the library function aromicAdd is used. The atomicAdd
function is performed atomicially, which means that it is compleated in one single uniteruptable step. This guarantees that no other thread can access the variable, before the operation is complete. This ensures that the values are added sequentially for each block.
This function can then be called using reduceMulAddComplete<<<nblocks, nthreads, nthreads*sizeof(float)>>>(v, w, bs, N, nthreads);
, where the third parameter in the launch configuration is the size of the shared memory that is allocated.
To call a c function from julia it can be wrapped inside of a julia function. For example to call the c function
extern "C" float conjugate_gradient_gpu(float * b, float * x , int L, int d);
We have to compile a shared object file. Using cmake this is done with the add_library(conjugate_gradient_gpu SHARED file.cu)
command.
Then we can load this shared library using julia: lib = Libdl.dlopen("./build/libconjugate_gradient_gpu.so")
. Now we can call c functions from julia.
We can wrap it inside a julia function for convenience:
# Define the wrapper function for `conjugate_gradient_gpu`
function conjugate_gradient_gpu(b::CuArray{Float32}, x::CuArray{Float32}, L, d)::Cfloat
sym = Libdl.dlsym(lib, :conjugate_gradient_gpu)
@ccall $sym(get_ptr(b)::CuPtr{Cfloat}, get_ptr(x)::CuPtr{Cfloat}, L::Cint, d::Cint)::Cfloat
end
then we can use unsafe_convert
to get a pointer to a cuda array.
function get_ptr(A)
return Base.unsafe_convert(CuPtr{Cfloat}, A)
end
This allows us to use all the functionality from julia such as plotting, generating random arrays, interactivity etc. We can even define test sets:
julia> @testset "indexing on GPU" begin
@test neighbour_index_gpu(2,1,1,3,2,9,0) == 5
# edges
@test neighbour_index_gpu(2,0,1,3,2,9,0) == 9
@test neighbour_index_gpu(3,0,-1,3,2,9,0) == 9
end;
Test Summary: | Pass Total Time
indexing on GPU | 3 3 0.1s
However in this code I implemented the tests in cpp.
There are currently two methods implemented for calculating the position of a certain value in memory given its index, as well as calculating adjacent indecies.
Given an
int get_index(int *cords, int L, int d, int N) {
for (int c = 0; c < d; c++) {
int cord = cords[c];
assert(cord < L + 1 && cord > -2);
if (cord == -1 || cord == L) {
return N;
}
}
int ind = 0;
for (int i = 0; i < d; i++) {
ind += pow(L, i) * cords[i];
}
return ind;
}
However for calculating the laplacian, we can skip this conversion step. We never need to calculate the index for a given coordinate, we just need to calculate the neighbouring index of a given index
__global__ void laplace_gpu(float *ddf, float *u, int d,
int L, int N, unsigned int index_mode)
{
int ind = blockIdx.x * blockDim.x + threadIdx.x;
if (ind < N)
{
float laplace_value = 0;
for (int i = 0; i < d; i++)
{
laplace_value += - u[neighbour_index_gpu(ind, i, 1, L, d, N, index_mode)]
+ 2 * u[neighbour_index_gpu(ind, i, 0, L, d, N, index_mode)]
- u[neighbour_index_gpu(ind, i, -1, L, d, N, index_mode)];
}
ddf[ind] = laplace_value;
}
}
One can either choose to precaculate a lookup table for all neighbouring indecies with a given index, or calculate the neighbouring index every time. I chose to reacalculate the neighbour index every time, because I expected the algorythm to be memory bound rather than beeing compute bound and the additional space can be used for computing larger systems and getting a better resolution/better approximation of the continous laplace operator.
With this naiive way of indexing however a problem arrises. The memory access pattern along the
The idea is that if nearby sites are accessed at the same time, the values are likely allready in the cache. Therefore, rather then assigning threads linearly, they are assigned in a fractal-like zig zag pattern.
This is done by storing the
For a 2d system for example, every second bit of the mixed coordinate long integer is for the second coordinate and every first bit for the first coordinate. Then for traversing the grid, you simply add pdep
and pext
function allows to encode and decode bits. They take two inputs, an integer and a mask. For pdep
the integer is seen as a list of bits and the mask gets applied to it. The bits get scattered along the output integer corresponding to the positions in the mask. For pext
they are extracted from the input integer according to the mask and gathered together.
For the arrays on the GPU I am using managed memory, which hands the responsibility of copying the arrays over to the driver and allows to access the arrays from the gpu host code as well, which is useful for debugging.
The scaleup vs the CPU version of the laplace operator can be found below:
we can be seen, with a single thread, the gpu version is slower than the gpu version, because the overhead from initializing the kernels and using the
atomicadd
from the inner product is significant.
However for multiple threads the relative performance is much better, the laplace operator takes about 1/10th of the time of the cpu version in the 500+ blocks range. For a small number of blocks, the overhead and thread divergence from bounds checking are siginifcant however.
For building, I use cmake. Therefore you have to run
mkdir build # if it does not exist yet
cd build
cmake ../src
make
then you can either run ./tests
or ./main
.
some additional, interactive functionality is provided by julia. For this you have to run julia --project=. src/main.jl
, the julia version used in this repository is 1.10.4
, for installing it on the hu cluster you have to download it using curl -fsSL https://install.julialang.org | sh
which will also add it to the bashrc. This is done by compiling a shared object file.
Every function contains a docstring that was generated using GitHub copilot and tweaked to match the function. The code was formatted using clang-format
.
[1] Jonathan Richard Shewchuk, "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf
[2] Generators for large sparse systems http://people.physik.hu-berlin.de/~leder/cp3/laplace.pdf
[3] Gene H. Golub, Charles F. Van Loan "Matrix Computations", Johns Hopkins University Press, 1989
[4] Briggs, William & Henson, Van & McCormick, Steve. (2000). "A Multigrid Tutorial, 2nd Edition" https://www.researchgate.net/publication/220690328_A_Multigrid_Tutorial_2nd_Edition
[5] Tatebe, Osamu. (1995). "The Multigrid Preconditioned Conjugate Gradient Method" https://www.researchgate.net/publication/2818681_The_Multigrid_Preconditioned_Conjugate_Gradient_Method