Adjoint of `cholesky` is hard-coded for the CPU 

Hi,

I've been attempting to differentiate through a Cholesky decomposition, which is common practice in Gaussian processes. The problem is that, the [current](https://github.com/FluxML/Zygote.jl/blob/f4c5ab1ec6f64032a573355157aba768c75f1b4e/src/lib/array.jl#L603) adjoint for the Cholesky is hard-coded for the CPU version of `trsm!`.

See the following minimal working example:
```julia
using CUDA
using KernelAbstractions
using CUDAKernels
using LinearAlgebra

import Tullio
import Zygote

function main()
    N = 1024
    D = 16
    X = randn(Float32, D, N)
    y = randn(Float32, N)

    CUDA.allowscalar(true)
    X_dev = CuArray(X)
    y_dev = CuArray(y)
    @time begin
        ∇K = Zygote.gradient(cu(randn(Float32, D+2))) do θ
            ℓα     = θ[1:1]
            ℓϵ     = θ[2]
            logℓ   = θ[3:end]
            Tullio.@tullio K[i,j] := exp(ℓα[1]*2 - (X_dev[k,i] - X_dev[k,j])^2 / exp(2*logℓ[k])) verbose=true
            K_ϵ      = K + cu(exp(ℓϵ)*I)
            K_ϵ_chol = cholesky(K_ϵ)
            α        = K_ϵ_chol \ y_dev
            dot(α, y_dev)
        end
    end
end

main()
```

output:
```julia
ERROR: ArgumentError: cannot take the CPU address of a CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
Stacktrace:
  [1] unsafe_convert(#unused#::Type{Ptr{Float32}}, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/5jdFl/src/array.jl:315
  [2] trsm!(side::Char, uplo::Char, transa::Char, diag::Char, alpha::Float32, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ LinearAlgebra.BLAS /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/blas.jl:1958
  [3] (::Zygote.var"#817#818"{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Cholesky{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}})(Δ::NamedTuple{(:uplo, :info, :factors), Tuple{Nothing, Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/lib/array.jl:603
  [4] (::Zygote.var"#3217#back#819"{Zygote.var"#817#818"{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Cholesky{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}}})(Δ::NamedTuple{(:uplo, :info, :factors), Tuple{Nothing, Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}})
    @ Zygote ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:67
  [5] Pullback
    @ ./REPL[33]:17 [inlined]
  [6] (::typeof(∂(λ)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
  [7] (::Zygote.var"#56#57"{typeof(∂(λ))})(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:41
  [8] gradient(f::Function, args::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:76
```

A simple fix is to use the following snippete:

```julia
@eval Zygote begin
    import CUDA
    @adjoint function cholesky(Σ::CUDA.CuArray; check = true)
        C = cholesky(Σ, check = check)
        C, function (Δ::NamedTuple)
            issuccess(C) || throw(PosDefException(C.info))
            U, Ū = C.U, Δ.factors

            U_tru = triu(U.data)
            Ū_tru = triu(Ū.data)

            Σ̄ = similar(U.data)
            Σ̄ = mul!(Σ̄, Ū_tru, U_tru')
            Σ̄ = copytri!(Σ̄, 'U')
            Σ̄ = ldiv!(U, Σ̄)
            Σ̄ = CUDA.CUBLAS.trsm!('R', 'U', 'T', 'N', one(eltype(Σ)), U.data, Σ̄)
            Σ̄[diagind(Σ̄)] ./= 2
            return (UpperTriangular(Σ̄),)
        end
    end
end
```
The two calls to `triu` are necessary for going around a performance bug in the matrix multiplication between two triangular matrices. I didn't pursue the cause further, but it seems that multiplying two triangular matrices on the GPU is like a 100 times slower than a simple matrix multiplication. Any thoughts on the reason for this?  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adjoint of `cholesky` is hard-coded for the CPU #1210

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Adjoint of cholesky is hard-coded for the CPU #1210

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Adjoint of `cholesky` is hard-coded for the CPU #1210