Skip to content

Add CUDA extension for GPU-accelerated solvers#574

Draft
zazabap wants to merge 1 commit intoJuliaManifolds:masterfrom
zazabap:cuda-gpu-extension
Draft

Add CUDA extension for GPU-accelerated solvers#574
zazabap wants to merge 1 commit intoJuliaManifolds:masterfrom
zazabap:cuda-gpu-extension

Conversation

@zazabap
Copy link

@zazabap zazabap commented Feb 11, 2026

Summary

Adds ManoptCUDAExt package extension enabling Manopt solvers to work transparently with CuArray-backed manifold points on NVIDIA GPUs.

What the extension does

  • Overrides ManifoldsBase.allocate for CuArray (4 methods) so workspace buffers stay on GPU
  • Patches linesearch_backtrack! to handle CPU workspace / GPU iterate type mismatches
  • Adds CUDA as a weak dependency ([weakdeps] + [extensions]) with compat CUDA = "5"

Files changed

File Description
ext/ManoptCUDAExt.jl Extension module (109 lines)
test/test_cuda_ext.jl GPU tests for allocate + solver paths
Project.toml CUDA weak dep, extension registration, compat
Changelog.md v0.5.33 entry

Solver and Stepsize Coverage

Tested

  • gradient_descent + ConstantLength on Euclidean
  • gradient_descent + ArmijoLinesearch (default) on Euclidean

Patched but not tested

  • NonmonotoneLinesearchStepsize — patched via same linesearch_backtrack! override
  • conjugate_gradient_descent — uses same Armijo default, should work

Likely work (no workspace arrays), not tested

  • DecreasingStepsize, PolyakStepsize, AdaptiveWNGradient, ConstantStepsize
  • WolfePowellBinaryLinesearchStepsize — allocating retract_fused, benefits from allocate override
  • Stateless direction rules: IdentityUpdateRule, ConjugateDescent, FletcherReeves
  • Scalar stopping criteria: GradientNormLess, CostLess, CostChangeLess, StepsizeLess

Will break — requires base-package changes

These stepsizes pre-allocate candidate_point::P via allocate_result(M, rand) at construction time (before iterate type is known) and use it directly in retract_fused!, bypassing linesearch_backtrack!:

  • WolfePowellLinesearchStepsize — affects quasi_Newton default
  • CubicBracketingLinesearchStepsizecandidate_point + candidate_direction
  • DomainBackTrackingStepsize — affects convex_bundle_method
  • NullStepBackTrackingStepsizecandidate_point + X
  • ProximalGradientMethodBacktrackingStepsize — affects proximal_gradient_method

Possible fix: lazy-init candidate_point on first solver step (when iterate type is known), or make the field non-parametric.

Will break — direction/stopping criterion issues

  • MomentumGradientRule, AverageGradientRule, NesterovRule — stored state + vector transport type mismatch
  • DirectionUpdateRuleStorage / StoreStateAction — stores CPU iterates
  • StopWhenIterateNaNany(isnan.(get_iterate(s))) triggers scalar indexing on CuArray

Manifolds.jl GPU Compatibility (tested locally)

The Manopt extension alone is sufficient for Euclidean and Sphere manifolds. Other manifolds need work in Manifolds.jl:

Manifold retract_fused inner/norm vector_transport distance
Euclidean all OK OK OK OK
Sphere Exp OK, Proj OK OK Parallel OK OK
Stiefel Polar OK, Exp FAIL OK FAIL FAIL
Rotations Exp FAIL, Proj FAIL OK FAIL FAIL
SPD Exp OK, Proj FAIL FAIL FAIL FAIL
Hyperbolic FAIL FAIL FAIL FAIL

Three failure modes in Manifolds.jl (not in scope for this PR):

  1. Scalar indexing (Stiefel exp, Rotations, Hyperbolic) — implementations use element-wise p[i,j] access or CPU-only eigen/exp
  2. copyto! ambiguity (SPD) — copyto!(CuArray, Symmetric{CuArray}) is ambiguous between LinearAlgebra and GPUArrays
  3. Missing GPU dispatch — matrix exp/log/eigen don't dispatch to CUSOLVER for CuArray

A ManifoldsCUDAExt in Manifolds.jl would be needed to address these.

Test plan

  • test/test_cuda_ext.jl — GPU allocate, GD + ConstantLength, GD + Armijo
  • conjugate_gradient_descent with CuArray
  • Matrix-valued iterates (Euclidean(n,n))
  • Float32 iterates
  • record=[:Cost] + return_state=true
  • Sphere manifold end-to-end
  • CI with CUDA runner (tested locally on RTX 3090)
  • Verify no regressions on CPU-only paths

Add ManoptCUDAExt package extension that enables Manopt solvers
(gradient_descent, conjugate_gradient_descent) to work transparently
with CuArray-backed manifold points. The extension overrides
ManifoldsBase.allocate for CuArray and patches linesearch_backtrack!
to handle CPU workspace / GPU iterate mismatches.

Supported stepsizes: ConstantLength, ArmijoLinesearch,
NonmonotoneLinesearch. Includes tests for both allocate and solver
paths on Euclidean manifolds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zazabap
Copy link
Author

zazabap commented Feb 11, 2026

Audit: Unresolved & Untested Items

What the extension currently covers

  • ManifoldsBase.allocate overrides for CuArray (4 methods)
  • linesearch_backtrack! patch for CPU workspace / GPU iterate mismatch

Solvers

  • gradient_descent (ConstantLength + Armijo)
  • conjugate_gradient_descent — claimed supported, not tested
  • quasi_Newtonwill break (default WolfePowellLinesearch uses candidate_point directly)
  • stochastic_gradient_descent, subgradient_method, FrankWolfe — likely work (scalar stepsizes), not tested
  • All other solvers (20+) — not tested

Stepsize Types

  • ConstantStepsize — works (no workspace)
  • ArmijoLinesearchStepsize — patched via linesearch_backtrack!
  • NonmonotoneLinesearchStepsize — patched, not tested
  • WolfePowellLinesearchStepsizewill break (uses retract_fused!(M, a.candidate_point, p, ...) directly, not via linesearch_backtrack!)
  • CubicBracketingLinesearchStepsizewill break (same pattern + candidate_direction)
  • WolfePowellBinaryLinesearchStepsize — uses allocating retract_fused, should work via allocate override, not tested
  • DomainBackTrackingStepsize, NullStepBackTrackingStepsize, ProximalGradientMethodBacktrackingStepsizewill break (pre-allocated candidate_point)
  • DecreasingStepsize, PolyakStepsize, AdaptiveWNGradient, DistanceOverGradients, AffineCovariant — likely work, not tested

Direction Update Rules

  • IdentityUpdateRule — works
  • MomentumGradientRule, AverageGradientRule, NesterovRulewill break (stored state + vector transport type mismatch)
  • DirectionUpdateRuleStoragewill break (StoreStateAction stores iterates)
  • Stateless coefficient rules (ConjugateDescent, FletcherReeves, etc.) — likely work, not tested

Stopping Criteria

  • StopAfterIteration — works
  • StopWhenIterateNaNwill break (any(isnan.(get_iterate(s))) on CuArray triggers scalar indexing)
  • StopWhenChangeLess, StopWhenEntryChangeLess — risk from StoreStateAction type mismatch
  • Scalar criteria (GradientNormLess, CostLess, etc.) — likely work, not tested

Root cause for "will break" items

All candidate_point issues stem from stepsize constructors calling allocate_result(M, rand) at construction time (before the iterate type is known), storing the result in a parametric field candidate_point::P where P = Vector{Float64}. The extension cannot fix this — it requires a base-package change (e.g. lazy-init or non-parametric field).

Missing test coverage

  • conjugate_gradient_descent with CuArray
  • Matrix-valued iterates (Euclidean(n,n))
  • Float32 iterates
  • record=[:Cost] + return_state=true
  • Non-Euclidean manifolds (Sphere, Stiefel)
  • Transpose/Adjoint CuArray wrappers (union CUDAManifoldPoint defined but unused)
  • CI with CUDA runner

Copy link
Member

@mateuszbaran mateuszbaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this. I think it would be good to have a well-designed solution here so it will require a bit of discussion.

Comment on lines +60 to +77
function ManifoldsBase.allocate(::ManifoldsBase.AbstractManifold, p::CuArray{T,N}) where {T,N}
return CuArray{T,N}(undef, size(p))
end

function ManifoldsBase.allocate(
::ManifoldsBase.AbstractManifold, p::CuArray{T,N}, ::Type{S}
) where {T,N,S}
return CuArray{S,N}(undef, size(p))
end

# Allocate without manifold argument (used by some ManifoldsBase paths)
function ManifoldsBase.allocate(p::CuArray{T,N}) where {T,N}
return CuArray{T,N}(undef, size(p))
end

function ManifoldsBase.allocate(p::CuArray{T,N}, ::Type{S}) where {T,N,S}
return CuArray{S,N}(undef, size(p))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write a test which is fixed by these methods? I would expect these to not be necessary. All added allocation tests pass on unpatched JuliaManifolds.

Comment on lines +87 to +100
function Manopt.linesearch_backtrack!(
M::ManifoldsBase.AbstractManifold,
q::Array, # CPU workspace (from stepsize.candidate_point)
f::TF,
p::CuArray, # GPU iterate
s,
decrease,
contract,
η;
kwargs...,
) where {TF}
q_gpu = ManifoldsBase.allocate(M, p)
return Manopt.linesearch_backtrack!(M, q_gpu, f, p, s, decrease, contract, η; kwargs...)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution would be to extend _produce_type to pass a point along the manifold, which would allow us to properly propagate the type. @kellertuer , what do you think about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively _produce_type(stepsize, M) would be _produce_type(stepsize, M, p) and then ArmijoLinesearchStepsize constructor could use that p to initialize candidate_point::P = allocate(p).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is an internal function and we could even break that without having a breaking change, it would even be possible to do p = rand(M) or so in order pass the point type a well; but sure, we could do that.

@kellertuer
Copy link
Member

Thanks for the start. I can see that the initial post is for sure AI, a bit verbose, but besides that a bit of road map is a good idea for sure

Here is a few generic comments not directly related to specific code lines

Overrides ManifoldsBase.allocate

We should probably move those to ManifoldsBase if they are really necessary; similarly if manifold do need updates to be “CUDA ready” that should probably go into an extension within Manifolds.jl so other people using manifolds would benefit as well.

Possible fix: lazy-init candidate_point on first solver step

This is absolutely not possible. For performance reasons the point and tangent vectors types (with parameters P and T are parameters of all structs they appear in, so if we postpone their initialisation for the initialisation (also not first solver step but before that) we can not offer that parametric type any longer (probably have to type the field ::Any) and that would be a huge performance loss also for the CPU case.

Will break — direction/stopping criterion issues [...]

not sure what this refers to, but things breaking is not something I would prefer

Manifolds.jl GPU Compatibility

This should be a PR in Manifolds.jl then.

I do not understand the points listed last in the “test plan”.

# Note: WolfePowellLinesearchStepsize and CubicBracketingLinesearchStepsize
# use candidate_point directly (not via linesearch_backtrack!). GPU support
# for these stepsizes requires changes to the Manopt.jl base package to make
# the candidate_point field non-parametric or to add a dispatch hook. For now,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a parametric field? Unclear what this refers to, but I would prefer generic solutions that work with as many step sizes as possible; for best of cases all.

Comment on lines +32 to +37
p0_cpu = zeros(10)
result_cpu = gradient_descent(
M, f_cpu, grad_f_cpu, p0_cpu;
stopping_criterion=StopAfterIteration(100),
stepsize=ConstantLength(0.1),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the quadratic task above this is not necessary, you know the closed form solution to just be target.

@zazabap
Copy link
Author

zazabap commented Feb 13, 2026

  1. For the Manopt.jl, if using manifold structure defined in the Manifold.jl not on CUDA array, it might cause some problems like overhead. The computation cost is intended to be extremely high in such case. I am still trying to grab the core structure for how manifold.jl and manopt.jl interact so it will take me a bit more time to think a better method for implementation. For the table listed here it is mandate to update the original Manifold.jl for CUDA extension.
Manifold retract_fused inner/norm vector_transport distance
Euclidean all OK OK OK OK
Sphere Exp OK, Proj OK OK Parallel OK OK
Stiefel Polar OK, Exp FAIL OK FAIL FAIL
Rotations Exp FAIL, Proj FAIL OK FAIL FAIL
SPD Exp OK, Proj FAIL FAIL FAIL FAIL
Hyperbolic FAIL FAIL FAIL FAIL

I am also considering directly modify the Manifoldbase in the Manifold.jl. --> Since the update is made in the #577 I will follow it and later merge into my PR.

  1. For the lazy-init, break point and some other features. I will remove the opinions and keep the code more efficient in the coding region and resolve your comments, please give me slightly a bit more time.

I apologize the current PR is still a bit verbose. I will spend more time to review the changes and update the change.

@mateuszbaran
Copy link
Member

Please wait with further changes to Manopt.jl until #577 is finished, it will resolve many issues encountered here.

I think the changes required in Manifolds.jl are largely independent, so you could try addressing those.

@kellertuer
Copy link
Member

I am still trying to grab the core structure for how manifold.jl and manopt.jl interact

The general scheme and idea here is that these are very modular:

  1. ManifoldsBase.jl defines an interface for defining/using manifolds, e.g. it declares function signatures like exp(M, p, X) and exp!(M, q, p, X) (working in place of given memory q.)
  2. Manopt.jl only has this interface package ManifoldsBase.jl as a dependency and hence its algorithms aim to work (and hopefully do work) on arbitrary manifolds. Manopt is not aware of any concrete manifold, only the interface
  3. Manifolds.jl is a library of manifolds using the interface ManifoldsBase.jl.

The computation cost is intended to be extremely high in such case.

I do not understand this sentence. high cost is never intended.

Of the manifold you list, only two use specific types (not mandatory):

  • SPD uses an internal structure (optionally) to store the matrix sqrt (and its inverse) since they are computed very often, but if you do not use the and “just an abstract array” it should work at the cost of repeated svd computations
  • Hyperbolic works just on arrays in the Hyperboloid model. For Poincaré half-line and Poincaré the abstract array to represent these are wrapped in point/vector types to distinguish which of the three representation is used (Hyperboloid has an optional wrapper for completeness.

A probably interesting case that does always need an internal structure is the fixed rank manifold, but also that is just a parametric type storing 3 abstract arrays (truncated svd) – so that should also be possible (hopefully as easy as the others).

For the table, I do not understand what “OK” and “Fail” actually mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants