| 
 | 1 | +# Accelerated Rootfinding on GPUs  | 
 | 2 | + | 
 | 3 | +NonlinearSolve.jl supports GPU acceleration on a wide array of devices, such as:  | 
 | 4 | + | 
 | 5 | +| GPU Manufacturer | GPU Kernel Language | Julia Support Package                              | Backend Type             |  | 
 | 6 | +|:---------------- |:------------------- |:-------------------------------------------------- |:------------------------ |  | 
 | 7 | +| NVIDIA           | CUDA                | [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)     | `CUDA.CUDABackend()`     |  | 
 | 8 | +| AMD              | ROCm                | [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) | `AMDGPU.ROCBackend()`    |  | 
 | 9 | +| Intel            | OneAPI              | [OneAPI.jl](https://github.com/JuliaGPU/oneAPI.jl) | `oneAPI.oneAPIBackend()` |  | 
 | 10 | +| Apple (M-Series) | Metal               | [Metal.jl](https://github.com/JuliaGPU/Metal.jl)   | `Metal.MetalBackend()`   |  | 
 | 11 | + | 
 | 12 | +To use NonlinearSolve.jl on GPUs, there are two distinctly different approaches:  | 
 | 13 | + | 
 | 14 | +1. You can build a `NonlinearProblem` / `NonlinearLeastSquaresProblem` where the elements  | 
 | 15 | +   of the problem, i.e. `u0` and `p`, are defined on GPUs. This will make the evaluations  | 
 | 16 | +   of `f` occur on the GPU, and all internal updates of the solvers will be completely  | 
 | 17 | +   on the GPU as well. This is the optimal form for large systems of nonlinear equations.  | 
 | 18 | +2. You can use SimpleNonlinearSolve.jl as kernels in KernelAbstractions.jl. This will build  | 
 | 19 | +   problem-specific GPU kernels in order to parallelize the solution of the chosen nonlinear  | 
 | 20 | +   system over a large number of inputs. This is useful for cases where you have a small  | 
 | 21 | +   `NonlinearProblem` / `NonlinearLeastSquaresProblem` which you want to solve over a large  | 
 | 22 | +   number of initial guesses or parameters.  | 
 | 23 | + | 
 | 24 | +For a deeper dive into the computational difference between these techniques and why it  | 
 | 25 | +leads to different pros/cons, see the  | 
 | 26 | +[DiffEqGPU.jl technical paper](https://www.sciencedirect.com/science/article/abs/pii/S0045782523007156).  | 
 | 27 | +In particular, the second form is unique to NonlinearSolve.jl and offers orders of magnitude  | 
 | 28 | +performance improvements over libraries in Jax and PyTorch which are restricted to only  | 
 | 29 | +using the first form.  | 
 | 30 | + | 
 | 31 | +In this tutorial we will highlight both use cases in separate parts.  | 
 | 32 | + | 
 | 33 | +!!! note  | 
 | 34 | + | 
 | 35 | +    If you're looking for GPU-accelerated neural networks inside of nonlinear solvers,  | 
 | 36 | +    check out [DeepEquilibriumNetworks.jl](https://docs.sciml.ai/DeepEquilibriumNetworks/stable/).  | 
 | 37 | + | 
 | 38 | +## GPU Acceleration of Large Nonlinear Systems using GPUArrays  | 
 | 39 | + | 
 | 40 | +The simplest way to GPU accelerate a large system is to simply make your `u0` and `p` values  | 
 | 41 | +be on the GPU via GPUArrays. For example, CUDA.jl has the CuArray type which implements  | 
 | 42 | +standard array operations, such as broadcasting and linear algebra. And since  | 
 | 43 | +NonlinearSolve.jl respects your chosen array types, if you choose to make `u0` be a type  | 
 | 44 | +that is on the GPU, then all internal broadcasting and linear algebra takes place on the  | 
 | 45 | +GPU.  | 
 | 46 | + | 
 | 47 | +This means the one major limitation is that you as a user must write your `f` to be  | 
 | 48 | +compatible with GPU arrays. Those limitations are discussed in detail in the GPU libraries,  | 
 | 49 | +for example  | 
 | 50 | +[the CuArray documentation discusses the operations available for CUDA arrays](https://cuda.juliagpu.org/stable/usage/array/)  | 
 | 51 | + | 
 | 52 | +In practice when this comes together, it looks like:  | 
 | 53 | + | 
 | 54 | +```julia  | 
 | 55 | +using NonlinearSolve, CUDA  | 
 | 56 | + | 
 | 57 | +f(u, p) = u .* u .- p  | 
 | 58 | +u0 = cu(ones(1000))  | 
 | 59 | +p = cu(collect(1:1000))  | 
 | 60 | +prob = NonlinearProblem(f, u0, p)  | 
 | 61 | +sol = solve(prob, NewtonRaphson(), abstol=1f-4)  | 
 | 62 | +```  | 
 | 63 | + | 
 | 64 | +Notice a few things here. One, nothing is different except the input array types. But  | 
 | 65 | +notice that `cu` arrays automatically default to `Float32` precision. Since NonlinearSolve.jl  | 
 | 66 | +respects the user's chosen types, this changes NonlinearSolve.jl to use `Float32` precision,  | 
 | 67 | +and thus the tolerances are adjusted accordingly.  | 
 | 68 | + | 
 | 69 | +## GPU Acceleration over Large Parameter Searches using KernelAbstractions.jl  | 
 | 70 | + | 
 | 71 | +If one has a "small" (200 equations or less) system of equations which they wish to solve  | 
 | 72 | +over many different inputs (parameters), then using the kernel generation approach will be  | 
 | 73 | +much more efficient than using the GPU-based array approach. In short, the GPU array  | 
 | 74 | +approach strings together standard GPU kernel calls (matrix multiply, +, etc.) where each  | 
 | 75 | +operation is an optimized GPU-accelerated call. In the kernel-building approach, we build  | 
 | 76 | +a custom kernel `f` that is then compiled specifically for our problem and ran in parallel.  | 
 | 77 | +This is equivalent to having built custom CUDA code for our problem! The reason this is  | 
 | 78 | +so much faster is because each kernel call has startup overhead, and we can cut that all  | 
 | 79 | +down to simply one optimized call.  | 
 | 80 | + | 
 | 81 | +To do this, we use KernelAbstractions.jl. First we have to say "what" our kernel is. The  | 
 | 82 | +kernel is the thing you want to embaressingly parallel call a bunch. For this nonlinear  | 
 | 83 | +solving, it will be the rebuilding of our nonlinear problem to new parameters and solving  | 
 | 84 | +it. This function must be defined using the `KernelAbstractions.@kernel` macro. This looks  | 
 | 85 | +like:  | 
 | 86 | + | 
 | 87 | +```julia  | 
 | 88 | +using NonlinearSolve, StaticArrays  | 
 | 89 | +using KernelAbstractions # For writing vendor-agnostic kernels  | 
 | 90 | +using CUDA # For if you have an NVIDIA GPU  | 
 | 91 | +using AMDGPU # For if you have an AMD GPU  | 
 | 92 | +using Metal # For if you have a Mac M-series device and want to use the built-in GPU  | 
 | 93 | +using OneAPI # For if you have an Intel GPU  | 
 | 94 | + | 
 | 95 | +@kernel function parallel_nonlinearsolve_kernel!(result, @Const(prob), @Const(alg))  | 
 | 96 | +    i = @index(Global)  | 
 | 97 | +    prob_i = remake(prob; p = prob.p[i])  | 
 | 98 | +    sol = solve(prob_i, alg)  | 
 | 99 | +    @inbounds result[i] = sol.u  | 
 | 100 | +end  | 
 | 101 | +```  | 
 | 102 | + | 
 | 103 | +Note that `i = @index(Global)` is used to get a global index. I.e. this kernel will be  | 
 | 104 | +called with `N` different `prob` objects, and this `i` means "for the ith call". So this  | 
 | 105 | +is saying, "for the ith call, get the i'th parameter set and solve with these parameters.  | 
 | 106 | +The ith result is then this solution".  | 
 | 107 | + | 
 | 108 | +!!! note  | 
 | 109 | + | 
 | 110 | +    Because kernel code needs to be able to be compiled to a GPU kernel, it has very strict  | 
 | 111 | +    specifications of what's allowed because GPU cores are not as flexible as CPU cores.  | 
 | 112 | +    In general, this means that you need to avoid any runtime operations in kernel code,  | 
 | 113 | +    such as allocating vectors, dynamic dispatch, type instabilities, etc. The main thing  | 
 | 114 | +    to note is that most NonlinearSolve.jl algorithms will not be compatible with being  | 
 | 115 | +    in kernels. However, **SimpleNonlinearSolve.jl solvers are tested to be compatible**,  | 
 | 116 | +    and thus one should only choose SimpleNonlinearSolve.jl methods within kernels.  | 
 | 117 | + | 
 | 118 | +Once you have defined your kernel, you need to use KernelAbstractions in order to distribute  | 
 | 119 | +your call. This looks like:  | 
 | 120 | + | 
 | 121 | +```julia  | 
 | 122 | +function vectorized_solve(prob, alg; backend = CPU())  | 
 | 123 | +    result = KernelAbstractions.allocate(backend, eltype(prob.p), length(prob.p))  | 
 | 124 | +    groupsize = min(length(prob.p), 1024)  | 
 | 125 | +    kernel! = parallel_nonlinearsolve_kernel!(backend, groupsize, length(prob.p))  | 
 | 126 | +    kernel!(result, prob, alg)  | 
 | 127 | +    KernelAbstractions.synchronize(backend)  | 
 | 128 | +    return result  | 
 | 129 | +end  | 
 | 130 | +```  | 
 | 131 | + | 
 | 132 | +Now let's build a nonlinear system to test it on.  | 
 | 133 | + | 
 | 134 | +```julia  | 
 | 135 | +@inbounds function p2_f(x, p)  | 
 | 136 | +    out1 = x[1] + p[1] * x[2]  | 
 | 137 | +    out2 = sqrt(p[2]) * (x[3] - x[4])  | 
 | 138 | +    out3 = (x[2] - p[3] * x[3])^2  | 
 | 139 | +    out4 = sqrt(p[4]) * (x[1] - x[4]) * (x[1] - x[4])  | 
 | 140 | +    SA[out1,out2,out3,out4]  | 
 | 141 | +end  | 
 | 142 | + | 
 | 143 | +p = @SVector [@SVector(rand(Float32, 4)) for _ in 1:1024]  | 
 | 144 | +u0 = SA[1f0, 2f0, 3f0, 4f0]  | 
 | 145 | +prob = NonlinearSolveBase.ImmutableNonlinearProblem{false}(p2_f, u0, p)  | 
 | 146 | +```  | 
 | 147 | + | 
 | 148 | +!!! note  | 
 | 149 | + | 
 | 150 | +    Because the custom kernel is going to need to embed the the code for our nonlinear  | 
 | 151 | +    problem into the kernel, it also must be written to be GPU compatible.  | 
 | 152 | +    In general, this means that you need to avoid any runtime operations in kernel code,  | 
 | 153 | +    such as allocating vectors, dynamic dispatch, type instabilities, etc. Thus to make this  | 
 | 154 | +    work, your `f` function should be non-allocating, your `u0` function should use  | 
 | 155 | +    StaticArrays, and you must use `NonlinearSolveBase.ImmutableNonlinearProblem`  | 
 | 156 | +    (which is exactly the same as NonlinearProblem except it's immutable to satisfy the  | 
 | 157 | +    requirements of GPU kernels). Also, it's recommended that for most GPUs you use Float32  | 
 | 158 | +    precision because many GPUs are much slower on 64-bit floating point operations.  | 
 | 159 | + | 
 | 160 | +and we then simply call our vectorized kernel to parallelize it:  | 
 | 161 | + | 
 | 162 | +```julia  | 
 | 163 | +# Threaded CPU  | 
 | 164 | +vectorized_solve(prob, SimpleNewtonRaphson(); backend = CPU())  | 
 | 165 | +# AMD ROCM GPU  | 
 | 166 | +vectorized_solve(prob, SimpleNewtonRaphson(); backend = ROCBackend())  | 
 | 167 | +# NVIDIA CUDA GPU  | 
 | 168 | +vectorized_solve(prob, SimpleNewtonRaphson(); backend = CUDABackend())  | 
 | 169 | +# Intel GPU  | 
 | 170 | +vectorized_solve(prob, SimpleNewtonRaphson(); backend = oneAPI.oneAPIBackend())  | 
 | 171 | +# Mac M-Series, such as M3Max  | 
 | 172 | +vectorized_solve(prob, SimpleNewtonRaphson(); backend = Metal.MetalBackend())  | 
 | 173 | +```  | 
 | 174 | + | 
 | 175 | +!!! warn  | 
 | 176 | +    The GPU-based calls will only work on your machine if you have a compatible GPU!  | 
0 commit comments