Skip to content

Commit 75d6546

Browse files
Merge branch 'main' into refactor-cuda-offload-factorizations
2 parents 9c67b43 + a26b9c8 commit 75d6546

File tree

7 files changed

+153
-17
lines changed

7 files changed

+153
-17
lines changed

Project.toml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ StaticArraysCore = "1e83bf80-4336-4d27-bf5d-d5a4f845583c"
2828
UnPack = "3a884ed6-31ef-47d7-9d2a-63182c4928ed"
2929

3030
[weakdeps]
31+
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
3132
BandedMatrices = "aae01518-5342-5314-be14-df237901396f"
3233
BlockDiagonals = "0a1fb500-61f7-11e9-3c65-f5ef3456f9f0"
33-
blis_jll = "6136c539-28a5-5bf0-87cc-b183200dce32"
3434
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
3535
CUDSS = "45b445bb-4962-46a0-9369-b4df9d0f772e"
3636
CUSOLVERRF = "a8cc9031-bad2-4722-94f5-40deabb4245c"
@@ -48,8 +48,10 @@ Pardiso = "46dd5b70-b6fb-5a00-ae2d-e8fea33afaf2"
4848
RecursiveFactorization = "f2c3362d-daeb-58d1-803e-2bc74f2840b4"
4949
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
5050
Sparspak = "e56a9233-b9d6-4f03-8d0f-1825330902ac"
51+
blis_jll = "6136c539-28a5-5bf0-87cc-b183200dce32"
5152

5253
[extensions]
54+
LinearSolveAMDGPUExt = "AMDGPU"
5355
LinearSolveBLISExt = ["blis_jll", "LAPACK_jll"]
5456
LinearSolveBandedMatricesExt = "BandedMatrices"
5557
LinearSolveBlockDiagonalsExt = "BlockDiagonals"
@@ -71,12 +73,12 @@ LinearSolveSparseArraysExt = "SparseArrays"
7173
LinearSolveSparspakExt = ["SparseArrays", "Sparspak"]
7274

7375
[compat]
76+
AMDGPU = "1"
7477
AllocCheck = "0.2"
7578
Aqua = "0.8"
7679
ArrayInterface = "7.7"
7780
BandedMatrices = "1.5"
7881
BlockDiagonals = "0.2"
79-
blis_jll = "0.9.0"
8082
CUDA = "5"
8183
CUDSS = "0.4"
8284
CUSOLVERRF = "0.2.6"
@@ -126,6 +128,7 @@ StaticArraysCore = "1.4.2"
126128
Test = "1"
127129
UnPack = "1"
128130
Zygote = "0.7"
131+
blis_jll = "0.9.0"
129132
julia = "1.10"
130133

131134
[extras]

docs/src/solvers/solvers.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,19 @@ CudaOffloadLUFactorization
241241
CudaOffloadQRFactorization
242242
```
243243

244+
### AMDGPU.jl
245+
246+
The following are GPU factorization routines for AMD GPUs using the ROCm stack.
247+
248+
!!! note
249+
250+
Using these solvers requires adding the package AMDGPU.jl, i.e. `using AMDGPU`
251+
252+
```@docs
253+
AMDGPUOffloadLUFactorization
254+
AMDGPUOffloadQRFactorization
255+
```
256+
244257
### CUSOLVERRF.jl
245258

246259
!!! note

docs/src/tutorials/gpu.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -39,18 +39,29 @@ sol.u
3939
This computation can be moved to the GPU by the following:
4040

4141
```julia
42-
using CUDA # Add the GPU library
42+
using CUDA # Add the GPU library for NVIDIA GPUs
4343
sol = LS.solve(prob, LS.CudaOffloadLUFactorization())
44+
# or
45+
sol = LS.solve(prob, LS.CudaOffloadQRFactorization())
4446
sol.u
4547
```
4648

47-
LinearSolve.jl provides two GPU offloading algorithms:
48-
- `CudaOffloadLUFactorization()` - Uses LU factorization (generally faster for well-conditioned problems)
49-
- `CudaOffloadQRFactorization()` - Uses QR factorization (more stable for ill-conditioned problems)
49+
For AMD GPUs, you can use the AMDGPU.jl package:
5050

51-
!!! warning
52-
The old `CudaOffloadFactorization()` is deprecated. Use `CudaOffloadLUFactorization()` or `CudaOffloadQRFactorization()` instead.
51+
```julia
52+
using AMDGPU # Add the GPU library for AMD GPUs
53+
sol = LS.solve(prob, LS.AMDGPUOffloadLUFactorization()) # LU factorization
54+
# or
55+
sol = LS.solve(prob, LS.AMDGPUOffloadQRFactorization()) # QR factorization
56+
sol.u
57+
```
5358

59+
LinearSolve.jl provides multiple GPU offloading algorithms:
60+
- `CudaOffloadLUFactorization()` - Uses LU factorization on NVIDIA GPUs (generally faster for well-conditioned problems)
61+
- `CudaOffloadQRFactorization()` - Uses QR factorization on NVIDIA GPUs (more stable for ill-conditioned problems)
62+
- `AMDGPUOffloadLUFactorization()` - Uses LU factorization on AMD GPUs (generally faster for well-conditioned problems)
63+
- `AMDGPUOffloadQRFactorization()` - Uses QR factorization on AMD GPUs (more stable for ill-conditioned problems)
64+
-
5465
## GPUArray Interface
5566

5667
For more manual control over the factorization setup, you can use the

docs/src/tutorials/linear.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -84,11 +84,10 @@ LinearSolve.jl specifically tests with the following cases:
8484

8585
!!! note
8686

87-
88-
Choosing the most specific matrix structure that matches your specific system will give you the most performance.
89-
Thus if your matrix is symmetric, specifically building with `Symmetric(A)` will be faster than simply using `A`,
90-
and will generally lead to better automatic linear solver choices. Note that you can also choose the type for `b`,
91-
but generally a dense vector will be the fastest here and many solvers will not support a sparse `b`.
87+
Choosing the most specific matrix structure that matches your specific system will give you the most performance.
88+
Thus if your matrix is symmetric, specifically building with `Symmetric(A)` will be faster than simply using `A`,
89+
and will generally lead to better automatic linear solver choices. Note that you can also choose the type for `b`,
90+
but generally a dense vector will be the fastest here and many solvers will not support a sparse `b`.
9291

9392
## Using Matrix-Free Operators via SciMLOperators.jl
9493

@@ -160,7 +159,6 @@ mfopA * sol.u - b
160159

161160
!!! note
162161

163-
164-
Note that not all methods can use a matrix-free operator. For example, `LS.LUFactorization()` requires a matrix. If you use an
165-
invalid method, you will get an error. The methods particularly from KrylovJL are the ones preferred for these cases
166-
(and are defaulted to).
162+
Note that not all methods can use a matrix-free operator. For example, `LS.LUFactorization()` requires a matrix. If you use an
163+
invalid method, you will get an error. The methods particularly from KrylovJL are the ones preferred for these cases
164+
(and are defaulted to).

ext/LinearSolveAMDGPUExt.jl

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
module LinearSolveAMDGPUExt
2+
3+
using AMDGPU
4+
using LinearSolve: LinearSolve, LinearCache, AMDGPUOffloadLUFactorization,
5+
AMDGPUOffloadQRFactorization, init_cacheval, OperatorAssumptions
6+
using LinearSolve.LinearAlgebra, LinearSolve.SciMLBase
7+
8+
# LU Factorization
9+
function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::AMDGPUOffloadLUFactorization;
10+
kwargs...)
11+
if cache.isfresh
12+
fact = AMDGPU.rocSOLVER.getrf!(AMDGPU.ROCArray(cache.A))
13+
cache.cacheval = fact
14+
cache.isfresh = false
15+
end
16+
17+
A_gpu, ipiv = cache.cacheval
18+
b_gpu = AMDGPU.ROCArray(cache.b)
19+
20+
AMDGPU.rocSOLVER.getrs!('N', A_gpu, ipiv, b_gpu)
21+
22+
y = Array(b_gpu)
23+
cache.u .= y
24+
SciMLBase.build_linear_solution(alg, y, nothing, cache)
25+
end
26+
27+
function LinearSolve.init_cacheval(alg::AMDGPUOffloadLUFactorization, A, b, u, Pl, Pr,
28+
maxiters::Int, abstol, reltol, verbose::Bool,
29+
assumptions::OperatorAssumptions)
30+
AMDGPU.rocSOLVER.getrf!(AMDGPU.ROCArray(A))
31+
end
32+
33+
# QR Factorization
34+
function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::AMDGPUOffloadQRFactorization;
35+
kwargs...)
36+
if cache.isfresh
37+
A_gpu = AMDGPU.ROCArray(cache.A)
38+
tau = AMDGPU.ROCVector{eltype(A_gpu)}(undef, min(size(A_gpu)...))
39+
AMDGPU.rocSOLVER.geqrf!(A_gpu, tau)
40+
cache.cacheval = (A_gpu, tau)
41+
cache.isfresh = false
42+
end
43+
44+
A_gpu, tau = cache.cacheval
45+
b_gpu = AMDGPU.ROCArray(cache.b)
46+
47+
# Apply Q^T to b
48+
AMDGPU.rocSOLVER.ormqr!('L', 'T', A_gpu, tau, b_gpu)
49+
50+
# Solve the upper triangular system
51+
m, n = size(A_gpu)
52+
AMDGPU.rocBLAS.trsv!('U', 'N', 'N', n, A_gpu, b_gpu)
53+
54+
y = Array(b_gpu[1:n])
55+
cache.u .= y
56+
SciMLBase.build_linear_solution(alg, y, nothing, cache)
57+
end
58+
59+
function LinearSolve.init_cacheval(alg::AMDGPUOffloadQRFactorization, A, b, u, Pl, Pr,
60+
maxiters::Int, abstol, reltol, verbose::Bool,
61+
assumptions::OperatorAssumptions)
62+
A_gpu = AMDGPU.ROCArray(A)
63+
tau = AMDGPU.ROCVector{eltype(A_gpu)}(undef, min(size(A_gpu)...))
64+
AMDGPU.rocSOLVER.geqrf!(A_gpu, tau)
65+
(A_gpu, tau)
66+
end
67+
68+
end

src/LinearSolve.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,7 @@ export HYPREAlgorithm
257257
export CudaOffloadFactorization
258258
export CudaOffloadLUFactorization
259259
export CudaOffloadQRFactorization
260+
export AMDGPUOffloadLUFactorization, AMDGPUOffloadQRFactorization
260261
export MKLPardisoFactorize, MKLPardisoIterate
261262
export PanuaPardisoFactorize, PanuaPardisoIterate
262263
export PardisoJL

src/extension_algs.jl

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,48 @@ struct CudaOffloadFactorization <: AbstractFactorization
129129
end
130130
end
131131

132+
"""
133+
`AMDGPUOffloadLUFactorization()`
134+
135+
An offloading technique using LU factorization to GPU-accelerate CPU-based computations on AMD GPUs.
136+
Requires a sufficiently large `A` to overcome the data transfer costs.
137+
138+
!!! note
139+
140+
Using this solver requires adding the package AMDGPU.jl, i.e. `using AMDGPU`
141+
"""
142+
struct AMDGPUOffloadLUFactorization <: LinearSolve.AbstractFactorization
143+
function AMDGPUOffloadLUFactorization()
144+
ext = Base.get_extension(@__MODULE__, :LinearSolveAMDGPUExt)
145+
if ext === nothing
146+
error("AMDGPUOffloadLUFactorization requires that AMDGPU is loaded, i.e. `using AMDGPU`")
147+
else
148+
return new{}()
149+
end
150+
end
151+
end
152+
153+
"""
154+
`AMDGPUOffloadQRFactorization()`
155+
156+
An offloading technique using QR factorization to GPU-accelerate CPU-based computations on AMD GPUs.
157+
Requires a sufficiently large `A` to overcome the data transfer costs.
158+
159+
!!! note
160+
161+
Using this solver requires adding the package AMDGPU.jl, i.e. `using AMDGPU`
162+
"""
163+
struct AMDGPUOffloadQRFactorization <: LinearSolve.AbstractFactorization
164+
function AMDGPUOffloadQRFactorization()
165+
ext = Base.get_extension(@__MODULE__, :LinearSolveAMDGPUExt)
166+
if ext === nothing
167+
error("AMDGPUOffloadQRFactorization requires that AMDGPU is loaded, i.e. `using AMDGPU`")
168+
else
169+
return new{}()
170+
end
171+
end
172+
end
173+
132174
## RFLUFactorization
133175

134176
"""

0 commit comments

Comments
 (0)