Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/src/release_notes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Release Notes

## Upcoming Changes

- `CudaOffloadFactorization` has been split into two algorithms:
- `CudaOffloadLUFactorization` - Uses LU factorization for better performance
- `CudaOffloadQRFactorization` - Uses QR factorization for better numerical stability
- `CudaOffloadFactorization` is now deprecated and will show a warning suggesting to use one of the new algorithms

## v2.0

- `LinearCache` changed from immutable to mutable. With this, the out of place interfaces like
Expand Down
15 changes: 9 additions & 6 deletions docs/src/solvers/solvers.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,14 @@ use your base system BLAS which can be fast or slow depending on the hardware co

For very large dense factorizations, offloading to the GPU can be preferred. Metal.jl can be used
on Mac hardware to offload, and has a cutoff point of being faster at around size 20,000 x 20,000
matrices (and only supports Float32). `CudaOffloadFactorization` can be more efficient at a
much smaller cutoff, possibly around size 1,000 x 1,000 matrices, though this is highly dependent
on the chosen GPU hardware. `CudaOffloadFactorization` requires a CUDA-compatible NVIDIA GPU.
matrices (and only supports Float32). `CudaOffloadLUFactorization` and `CudaOffloadQRFactorization`
can be more efficient at a much smaller cutoff, possibly around size 1,000 x 1,000 matrices, though
this is highly dependent on the chosen GPU hardware. These algorithms require a CUDA-compatible NVIDIA GPU.
CUDA offload supports Float64 but most consumer GPU hardware will be much faster on Float32
(many are >32x faster for Float32 operations than Float64 operations) and thus for most hardware
this is only recommended for Float32 matrices.
this is only recommended for Float32 matrices. Choose `CudaOffloadLUFactorization` for better
performance on well-conditioned problems, or `CudaOffloadQRFactorization` for better numerical
stability on ill-conditioned problems.

!!! note

Expand Down Expand Up @@ -232,10 +234,11 @@ The following are non-standard GPU factorization routines.

!!! note

Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
Using these solvers requires adding the package CUDA.jl, i.e. `using CUDA`

```@docs
CudaOffloadFactorization
CudaOffloadLUFactorization
CudaOffloadQRFactorization
```

### AMDGPU.jl
Expand Down
8 changes: 8 additions & 0 deletions docs/src/tutorials/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ This computation can be moved to the GPU by the following:
```julia
using CUDA # Add the GPU library for NVIDIA GPUs
sol = LS.solve(prob, LS.CudaOffloadLUFactorization())
# or
sol = LS.solve(prob, LS.CudaOffloadQRFactorization())
sol.u
```

Expand All @@ -54,6 +56,12 @@ sol = LS.solve(prob, LS.AMDGPUOffloadQRFactorization()) # QR factorization
sol.u
```

LinearSolve.jl provides multiple GPU offloading algorithms:
- `CudaOffloadLUFactorization()` - Uses LU factorization on NVIDIA GPUs (generally faster for well-conditioned problems)
- `CudaOffloadQRFactorization()` - Uses QR factorization on NVIDIA GPUs (more stable for ill-conditioned problems)
- `AMDGPUOffloadLUFactorization()` - Uses LU factorization on AMD GPUs (generally faster for well-conditioned problems)
- `AMDGPUOffloadQRFactorization()` - Uses QR factorization on AMD GPUs (more stable for ill-conditioned problems)
-
## GPUArray Interface

For more manual control over the factorization setup, you can use the
Expand Down
44 changes: 43 additions & 1 deletion ext/LinearSolveCUDAExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ using LinearSolve: LinearSolve, is_cusparse, defaultalg, cudss_loaded, DefaultLi
DefaultAlgorithmChoice, ALREADY_WARNED_CUDSS, LinearCache,
needs_concrete_A,
error_no_cudss_lu, init_cacheval, OperatorAssumptions,
CudaOffloadFactorization,
CudaOffloadFactorization, CudaOffloadLUFactorization, CudaOffloadQRFactorization,
SparspakFactorization, KLUFactorization, UMFPACKFactorization
using LinearSolve.LinearAlgebra, LinearSolve.SciMLBase, LinearSolve.ArrayInterface
using SciMLBase: AbstractSciMLOperator
Expand Down Expand Up @@ -35,6 +35,48 @@ function LinearSolve.error_no_cudss_lu(A::CUDA.CUSPARSE.CuSparseMatrixCSR)
nothing
end

function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::CudaOffloadLUFactorization;
kwargs...)
if cache.isfresh
fact = lu(CUDA.CuArray(cache.A))
cache.cacheval = fact
cache.isfresh = false
end
y = Array(ldiv!(CUDA.CuArray(cache.u), cache.cacheval, CUDA.CuArray(cache.b)))
cache.u .= y
SciMLBase.build_linear_solution(alg, y, nothing, cache)
end

function LinearSolve.init_cacheval(alg::CudaOffloadLUFactorization, A, b, u, Pl, Pr,
maxiters::Int, abstol, reltol, verbose::Bool,
assumptions::OperatorAssumptions)
T = eltype(A)
noUnitT = typeof(zero(T))
luT = LinearAlgebra.lutype(noUnitT)
ipiv = CuVector{Int32}(undef, 0)
info = zero(LinearAlgebra.BlasInt)
return LU{luT}(CuMatrix{Float64}(undef, 0, 0), ipiv, info)
end

function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::CudaOffloadQRFactorization;
kwargs...)
if cache.isfresh
fact = qr(CUDA.CuArray(cache.A))
cache.cacheval = fact
cache.isfresh = false
end
y = Array(ldiv!(CUDA.CuArray(cache.u), cache.cacheval, CUDA.CuArray(cache.b)))
cache.u .= y
SciMLBase.build_linear_solution(alg, y, nothing, cache)
end

function LinearSolve.init_cacheval(alg::CudaOffloadQRFactorization, A, b, u, Pl, Pr,
maxiters::Int, abstol, reltol, verbose::Bool,
assumptions::OperatorAssumptions)
qr(CUDA.CuArray(A))
end

# Keep the deprecated CudaOffloadFactorization working by forwarding to QR
function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::CudaOffloadFactorization;
kwargs...)
if cache.isfresh
Expand Down
6 changes: 3 additions & 3 deletions lib/LinearSolveAutotune/src/algorithms.jl
Original file line number Diff line number Diff line change
Expand Up @@ -84,10 +84,10 @@ function get_gpu_algorithms(; skip_missing_algs::Bool = false)
# CUDA algorithms
if is_cuda_available()
try
push!(gpu_algs, CudaOffloadFactorization())
push!(gpu_names, "CudaOffloadFactorization")
push!(gpu_algs, CudaOffloadLUFactorization())
push!(gpu_names, "CudaOffloadLUFactorization")
catch e
msg = "CUDA hardware detected but CudaOffloadFactorization not available: $e. Load CUDA.jl package."
msg = "CUDA hardware detected but CudaOffloadLUFactorization not available: $e. Load CUDA.jl package."
if skip_missing_algs
@warn msg
else
Expand Down
5 changes: 4 additions & 1 deletion src/LinearSolve.jl
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,10 @@ export KrylovJL, KrylovJL_CG, KrylovJL_MINRES, KrylovJL_GMRES,
export SimpleGMRES

export HYPREAlgorithm
export CudaOffloadFactorization, AMDGPUOffloadLUFactorization, AMDGPUOffloadQRFactorization
export CudaOffloadFactorization
export CudaOffloadLUFactorization
export CudaOffloadQRFactorization
export AMDGPUOffloadLUFactorization, AMDGPUOffloadQRFactorization
export MKLPardisoFactorize, MKLPardisoIterate
export PanuaPardisoFactorize, PanuaPardisoIterate
export PardisoJL
Expand Down
51 changes: 49 additions & 2 deletions src/extension_algs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -61,23 +61,70 @@ struct HYPREAlgorithm <: SciMLLinearSolveAlgorithm
end
end

# Debug: About to define CudaOffloadLUFactorization
"""
`CudaOffloadLUFactorization()`

An offloading technique used to GPU-accelerate CPU-based computations using LU factorization.
Requires a sufficiently large `A` to overcome the data transfer costs.

!!! note

Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
"""
struct CudaOffloadLUFactorization <: AbstractFactorization
function CudaOffloadLUFactorization()
ext = Base.get_extension(@__MODULE__, :LinearSolveCUDAExt)
if ext === nothing
error("CudaOffloadLUFactorization requires that CUDA is loaded, i.e. `using CUDA`")
else
return new()
end
end
end

"""
`CudaOffloadQRFactorization()`

An offloading technique used to GPU-accelerate CPU-based computations using QR factorization.
Requires a sufficiently large `A` to overcome the data transfer costs.

!!! note

Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
"""
struct CudaOffloadQRFactorization <: AbstractFactorization
function CudaOffloadQRFactorization()
ext = Base.get_extension(@__MODULE__, :LinearSolveCUDAExt)
if ext === nothing
error("CudaOffloadQRFactorization requires that CUDA is loaded, i.e. `using CUDA`")
else
return new()
end
end
end

"""
`CudaOffloadFactorization()`

!!! warning
This algorithm is deprecated. Use `CudaOffloadLUFactorization` or `CudaOffloadQRFactorization()` instead.

An offloading technique used to GPU-accelerate CPU-based computations.
Requires a sufficiently large `A` to overcome the data transfer costs.

!!! note

Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
"""
struct CudaOffloadFactorization <: LinearSolve.AbstractFactorization
struct CudaOffloadFactorization <: AbstractFactorization
function CudaOffloadFactorization()
Base.depwarn("`CudaOffloadFactorization` is deprecated, use `CudaOffloadLUFactorization` or `CudaOffloadQRFactorization` instead.", :CudaOffloadFactorization)
ext = Base.get_extension(@__MODULE__, :LinearSolveCUDAExt)
if ext === nothing
error("CudaOffloadFactorization requires that CUDA is loaded, i.e. `using CUDA`")
else
return new{}()
return new()
end
end
end
Expand Down
2 changes: 1 addition & 1 deletion test/gpu/cuda.jl
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ function test_interface(alg, prob1, prob2)
return
end

@testset "$alg" for alg in (CudaOffloadFactorization(), NormalCholeskyFactorization())
@testset "$alg" for alg in (CudaOffloadLUFactorization(), CudaOffloadQRFactorization(), NormalCholeskyFactorization())
test_interface(alg, prob1, prob2)
end

Expand Down
2 changes: 2 additions & 0 deletions test/resolve.jl
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ for alg in vcat(InteractiveUtils.subtypes(AbstractDenseFactorization),
if !(alg in [
DiagonalFactorization,
CudaOffloadFactorization,
CudaOffloadLUFactorization,
CudaOffloadQRFactorization,
CUSOLVERRFFactorization,
AppleAccelerateLUFactorization,
MetalLUFactorization
Expand Down
Loading