Device-callable solvers ? #1876

trsxvz · 2025-06-26T16:57:27Z

trsxvz
Jun 26, 2025

Hi everyone,

Thank you in advance for taking the time to read this. Please excuse my lack of experience, I am a newcomer to the field of HPC.

I have recently been working on some CUDA-parallelized scientific code. I am working on an embarassingly parallel problem in the sense that I have to do some independent computations on a given number of points, let n. These computations vary depending on the studied constitutive law ; therefore, they are encapsulated (as well as their cuda loading/storing operations) in different functors that I pass to a generic cuda kernel launcher as follows :

inline void gpu_runner(Functor&& func, 
                       const int n, 
                       const int NBPG, 
                       const int NTPB, 
                       Args&&... args) {

    // timers, etc

    cuda_kernel<<<NBPG, NTPB, shmem_bytes>>>(func, n, std::forward<Args>(args)...);

    // end of timers, etc

}

Now here is that CUDA kernel, where each thread works on exactly one out of the total n independent points :

template<typename Functor, typename... Args>
__global__ void cuda_kernel(const Functor func, 
                            const int n,
                            Args... args) {
    
    // Global CUDA thread ID
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Grid stride loop
    for (; idx < n; idx += blockDim.x * gridDim.x) {
    
        // Computations done according to the passed functor (corresponding to a given constitutive law)
        // Embarassingly parallel problem : each thread does independent computations
        func(idx, n, std::forward<Args>(args)...);
        
    }
    
}

All of my functors do some basic computations, but some of them feature additional iterative methods such as Newton-Raphson, which requires the solving of an Ax=b linear system at each iteration. This resolving causes performance issues, and instead of optimizing it myself, I am looking for pre-made algebraic libraries that come up with already-optimised solvers.

However, due to the specific structure of the problem/code, I am looking for some device-callable solvers, that I could call from inside my cuda kernel. So my two questions here are :

Does anyone know if Ginkgo features some sort of device-callable solver that would match this particular case ?
If not : Does Ginkgo feature some device-callable solvers at all ?

If the answer to both these questions is no, I guess I will have to use a host-called batched-solver, which would be quite annoying given the fact that all n points do not require the same number of iterations (i.e., not the same number of batched-solver calls, which would mean a completely different code structure).

Answered by yhmtsai

Jun 27, 2025

Hi @trsxvz ,
Thanks for reaching out to us.
@pratikvn corrects me if I am wrong.
I think it is possible to call batch solver from device but you need to having the batch corresponding header into your searching path.

you can check the kernel we have for batch solver in https://github.com/ginkgo-project/ginkgo/blob/develop/common/cuda_hip/solver/batch_cg_kernels.hpp where apply_kernel is the actual kernel we call for batch CG solver.
However, you need to arrange your data to fit the function style.

https://github.com/ginkgo-project/ginkgo/blob/develop/core/solver/batch_dispatch.hpp gives the ways how we map the host type to the type used in kernel, so you do not need to prepare the class f…

View full answer

yhmtsai · 2025-06-27T14:04:24Z

yhmtsai
Jun 27, 2025
Collaborator

Hi @trsxvz ,
Thanks for reaching out to us.
@pratikvn corrects me if I am wrong.
I think it is possible to call batch solver from device but you need to having the batch corresponding header into your searching path.

you can check the kernel we have for batch solver in https://github.com/ginkgo-project/ginkgo/blob/develop/common/cuda_hip/solver/batch_cg_kernels.hpp where apply_kernel is the actual kernel we call for batch CG solver.
However, you need to arrange your data to fit the function style.

https://github.com/ginkgo-project/ginkgo/blob/develop/core/solver/batch_dispatch.hpp gives the ways how we map the host type to the type used in kernel, so you do not need to prepare the class from host type.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device-callable solvers ? #1876

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Device-callable solvers ? #1876

Uh oh!

trsxvz Jun 26, 2025

Replies: 1 comment

Uh oh!

yhmtsai Jun 27, 2025 Collaborator

trsxvz
Jun 26, 2025

yhmtsai
Jun 27, 2025
Collaborator