Update documentation for CudaOffload factorization changes

ChrisRackauckas · ChrisRackauckas · commit e775de6766a1 · 2025-08-10T09:55:47.000-04:00
- Updated GPU tutorial to show new CudaOffloadLUFactorization/QRFactorization
- Updated solver documentation to explain both algorithms
- Added deprecation warning in documentation
- Updated release notes with upcoming changes
- Created example demonstrating usage of both new algorithms
- Explained when to use each algorithm (LU for performance, QR for stability)
diff --git a/docs/src/release_notes.md b/docs/src/release_notes.md
@@ -1,5 +1,12 @@
 # Release Notes
 
+## Upcoming Changes
+
+  - `CudaOffloadFactorization` has been split into two algorithms:
+    - `CudaOffloadLUFactorization` - Uses LU factorization for better performance
+    - `CudaOffloadQRFactorization` - Uses QR factorization for better numerical stability
+  - `CudaOffloadFactorization` is now deprecated and will show a warning suggesting to use one of the new algorithms
+
 ## v2.0
 
   - `LinearCache` changed from immutable to mutable. With this, the out of place interfaces like
diff --git a/docs/src/solvers/solvers.md b/docs/src/solvers/solvers.md
@@ -23,12 +23,14 @@ use your base system BLAS which can be fast or slow depending on the hardware co
 
 For very large dense factorizations, offloading to the GPU can be preferred. Metal.jl can be used
 on Mac hardware to offload, and has a cutoff point of being faster at around size 20,000 x 20,000
-matrices (and only supports Float32). `CudaOffloadFactorization` can be more efficient at a
-much smaller cutoff, possibly around size 1,000 x 1,000 matrices, though this is highly dependent
-on the chosen GPU hardware. `CudaOffloadFactorization` requires a CUDA-compatible NVIDIA GPU.
+matrices (and only supports Float32). `CudaOffloadLUFactorization` and `CudaOffloadQRFactorization` 
+can be more efficient at a much smaller cutoff, possibly around size 1,000 x 1,000 matrices, though 
+this is highly dependent on the chosen GPU hardware. These algorithms require a CUDA-compatible NVIDIA GPU.
 CUDA offload supports Float64 but most consumer GPU hardware will be much faster on Float32
 (many are >32x faster for Float32 operations than Float64 operations) and thus for most hardware
-this is only recommended for Float32 matrices.
+this is only recommended for Float32 matrices. Choose `CudaOffloadLUFactorization` for better 
+performance on well-conditioned problems, or `CudaOffloadQRFactorization` for better numerical 
+stability on ill-conditioned problems.
 
 !!! note
     
@@ -232,9 +234,11 @@ The following are non-standard GPU factorization routines.
 
 !!! note
     
-    Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
+    Using these solvers requires adding the package CUDA.jl, i.e. `using CUDA`
 
 ```@docs
+CudaOffloadLUFactorization
+CudaOffloadQRFactorization
 CudaOffloadFactorization
 ```
 
diff --git a/docs/src/tutorials/gpu.md b/docs/src/tutorials/gpu.md
@@ -40,10 +40,17 @@ This computation can be moved to the GPU by the following:
 
 ```julia
 using CUDA # Add the GPU library
-sol = LS.solve(prob, LS.CudaOffloadFactorization())
+sol = LS.solve(prob, LS.CudaOffloadLUFactorization())
 sol.u
 ```
 
+LinearSolve.jl provides two GPU offloading algorithms:
+- `CudaOffloadLUFactorization()` - Uses LU factorization (generally faster for well-conditioned problems)
+- `CudaOffloadQRFactorization()` - Uses QR factorization (more stable for ill-conditioned problems)
+
+!!! warning
+    The old `CudaOffloadFactorization()` is deprecated. Use `CudaOffloadLUFactorization()` or `CudaOffloadQRFactorization()` instead.
+
 ## GPUArray Interface
 
 For more manual control over the factorization setup, you can use the
diff --git a/examples/cuda_offload_example.jl b/examples/cuda_offload_example.jl
@@ -0,0 +1,97 @@
+"""
+Example demonstrating the new CudaOffloadLUFactorization and CudaOffloadQRFactorization algorithms.
+
+This example shows how to use the new GPU offloading algorithms for solving linear systems
+with different numerical properties.
+"""
+
+using LinearSolve
+using LinearAlgebra
+using Random
+
+# Set random seed for reproducibility
+Random.seed!(123)
+
+println("CUDA Offload Factorization Examples")
+println("=" ^ 40)
+
+# Create a well-conditioned problem
+println("\n1. Well-conditioned problem (condition number ≈ 10)")
+A_good = rand(100, 100)
+A_good = A_good + 10I  # Make it well-conditioned
+b_good = rand(100)
+prob_good = LinearProblem(A_good, b_good)
+
+println("   Matrix size: $(size(A_good))")
+println("   Condition number: $(cond(A_good))")
+
+# Try to use CUDA if available
+try
+    using CUDA
+    
+    # Solve with LU (faster for well-conditioned)
+    println("\n   Solving with CudaOffloadLUFactorization...")
+    sol_lu = solve(prob_good, CudaOffloadLUFactorization())
+    println("   Solution norm: $(norm(sol_lu.u))")
+    println("   Residual norm: $(norm(A_good * sol_lu.u - b_good))")
+    
+    # Solve with QR (more stable)
+    println("\n   Solving with CudaOffloadQRFactorization...")
+    sol_qr = solve(prob_good, CudaOffloadQRFactorization())
+    println("   Solution norm: $(norm(sol_qr.u))")
+    println("   Residual norm: $(norm(A_good * sol_qr.u - b_good))")
+    
+catch e
+    println("\n   Note: CUDA.jl is not loaded. To use GPU offloading:")
+    println("   1. Install CUDA.jl: using Pkg; Pkg.add(\"CUDA\")")
+    println("   2. Add 'using CUDA' before running this example")
+    println("   3. Ensure you have a CUDA-compatible NVIDIA GPU")
+end
+
+# Create an ill-conditioned problem
+println("\n2. Ill-conditioned problem (condition number ≈ 1e6)")
+A_bad = rand(50, 50)
+# Make it ill-conditioned
+U, S, V = svd(A_bad)
+S[end] = S[1] / 1e6  # Create large condition number
+A_bad = U * Diagonal(S) * V'
+b_bad = rand(50)
+prob_bad = LinearProblem(A_bad, b_bad)
+
+println("   Matrix size: $(size(A_bad))")
+println("   Condition number: $(cond(A_bad))")
+
+try
+    using CUDA
+    
+    # For ill-conditioned problems, QR is typically more stable
+    println("\n   Solving with CudaOffloadQRFactorization (recommended for ill-conditioned)...")
+    sol_qr_bad = solve(prob_bad, CudaOffloadQRFactorization())
+    println("   Solution norm: $(norm(sol_qr_bad.u))")
+    println("   Residual norm: $(norm(A_bad * sol_qr_bad.u - b_bad))")
+    
+    println("\n   Solving with CudaOffloadLUFactorization (may be less stable)...")
+    sol_lu_bad = solve(prob_bad, CudaOffloadLUFactorization())
+    println("   Solution norm: $(norm(sol_lu_bad.u))")
+    println("   Residual norm: $(norm(A_bad * sol_lu_bad.u - b_bad))")
+    
+catch e
+    println("\n   Skipping GPU tests (CUDA not available)")
+end
+
+# Demonstrate the deprecation warning
+println("\n3. Testing deprecated CudaOffloadFactorization")
+try
+    using CUDA
+    println("   Creating deprecated CudaOffloadFactorization...")
+    alg = CudaOffloadFactorization()  # This will show a deprecation warning
+    println("   The deprecated algorithm still works but shows a warning above")
+catch e
+    println("   Skipping deprecation test (CUDA not available)")
+end
+
+println("\n" * "=" ^ 40)
+println("Summary:")
+println("- Use CudaOffloadLUFactorization for well-conditioned problems (faster)")
+println("- Use CudaOffloadQRFactorization for ill-conditioned problems (more stable)")
+println("- The old CudaOffloadFactorization is deprecated")