Skip to content

Conversation

@vchuravy
Copy link
Member

@vchuravy vchuravy commented Dec 17, 2024

In order to eventually support GPU computation we need
to use Adapt.jl to allow GPU backend packages to swap
out host-array types like CuArray with device-side types
like CuDeviceArray.

Additionally this will allow us to change the element type
of a simulation by using adapt(Array{Float32}.

@github-actions
Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be close to final when tests are added

@vchuravy vchuravy marked this pull request as ready for review February 3, 2025 15:38
@vchuravy vchuravy requested review from benegee and ranocha February 3, 2025 15:38
@codecov
Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 68.44262% with 77 lines in your changes missing coverage. Please review.

Project coverage is 96.67%. Comparing base (43cd98d) to head (1acf007).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/solvers/dgsem_p4est/containers_parallel.jl 26.92% 19 Missing ⚠️
src/auxiliary/containers.jl 63.64% 16 Missing ⚠️
src/solvers/dgsem_p4est/containers.jl 83.53% 14 Missing ⚠️
src/solvers/dg.jl 60.00% 10 Missing ⚠️
src/auxiliary/vector_of_arrays.jl 35.71% 9 Missing ⚠️
ext/TrixiAMDGPUExt.jl 0.00% 6 Missing ⚠️
ext/TrixiCUDAExt.jl 0.00% 2 Missing ⚠️
src/semidiscretization/semidiscretization.jl 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2212      +/-   ##
==========================================
- Coverage   96.83%   96.67%   -0.16%     
==========================================
  Files         504      507       +3     
  Lines       41816    42010     +194     
==========================================
+ Hits        40490    40610     +120     
- Misses       1326     1400      +74     
Flag Coverage Δ
unittests 96.67% <68.44%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please adapt the t8code stuff as well? It causes the remaining failing CI tests like

  LoadError: MethodError: Cannot `convert` an object of type Vector{Vector{Int64}} to an object of type Trixi.VecOfArrays{Vector{Int64}}
  
  Closest candidates are:
    convert(::Type{T}, ::T) where T
     @ Base Base.jl:84
    (::Type{Trixi.VecOfArrays{T}} where T<:AbstractArray)(::Any)
     @ Trixi ~/work/Trixi.jl/Trixi.jl/src/auxiliary/vector_of_arrays.jl:12
  
  Stacktrace:
    [1] setproperty!(x::Trixi.P4estMPICache{Vector{Float64}, Vector{Int64}}, f::Symbol, v::Vector{Vector{Int64}})
      @ Base ./Base.jl:40
    [2] init_mpi_cache!(mpi_cache::Trixi.P4estMPICache{Vector{Float64}, Vector{Int64}}, mesh::Trixi.T8codeMesh{2, Float64, Static.True, 4, 4}, mpi_mesh_info::@NamedTuple{mpi_mortars::Trixi.P4estMPIMortarContainer{2, Float64, Float64, 3, 4, 5, Array{Float64, 5}, Vector{Float64}, Array}, mpi_interfaces::Trixi.P4estMPIInterfaceContainer{2, Float64, 4, Array{Float64, 4}, Vector{Int64}, Vector{Tuple{Symbol, Symbol}}, Vector{Float64}, Array}, global_mortar_ids::Vector{UInt128}, global_interface_ids::Vector{UInt128}, neighbor_ranks_mortar::Vector{Vector{Int64}}, neighbor_ranks_interface::Vector{Int64}}, nvars::Int64, nnodes::Int64, uEltype::Type)
      @ Trixi ~/work/Trixi.jl/Trixi.jl/src/solvers/dgsem_t8code/dg_parallel.jl:93

Copy link
Member

@sloede sloede left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments/suggestions, then this should be ready to merge (once the other convo by Hendrik is resolved as well)

Comment on lines 11 to 30
struct VecOfArrays{T <: AbstractArray}
arrays::Vector{T}
end
Base.getindex(v::VecOfArrays, i::Int) = Base.getindex(v.arrays, i)
Base.IndexStyle(v::VecOfArrays) = Base.IndexStyle(v.arrays)
Base.size(v::VecOfArrays) = Base.size(v.arrays)
Base.length(v::VecOfArrays) = Base.length(v.arrays)
Base.eltype(v::VecOfArrays{T}) where {T} = T
function Adapt.adapt_structure(to, v::VecOfArrays)
return VecOfArrays([Adapt.adapt(to, arr) for arr in v.arrays])
end
function Base.convert(::Type{<:VecOfArrays}, v::Vector{<:AbstractArray})
VecOfArrays(v)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct VecOfArrays{T <: AbstractArray}
arrays::Vector{T}
end
Base.getindex(v::VecOfArrays, i::Int) = Base.getindex(v.arrays, i)
Base.IndexStyle(v::VecOfArrays) = Base.IndexStyle(v.arrays)
Base.size(v::VecOfArrays) = Base.size(v.arrays)
Base.length(v::VecOfArrays) = Base.length(v.arrays)
Base.eltype(v::VecOfArrays{T}) where {T} = T
function Adapt.adapt_structure(to, v::VecOfArrays)
return VecOfArrays([Adapt.adapt(to, arr) for arr in v.arrays])
end
function Base.convert(::Type{<:VecOfArrays}, v::Vector{<:AbstractArray})
VecOfArrays(v)
end
struct VectorOfArrays{T <: AbstractArray}
arrays::Vector{T}
end
Base.getindex(v::VectorOfArrays, i::Int) = Base.getindex(v.arrays, i)
Base.IndexStyle(v::VectorOfArrays) = Base.IndexStyle(v.arrays)
Base.size(v::VectorOfArrays) = Base.size(v.arrays)
Base.length(v::VectorOfArrays) = Base.length(v.arrays)
Base.eltype(v::VectorOfArrays{T}) where {T} = T
function Adapt.adapt_structure(to, v::VectorOfArrays)
return VectorOfArrays([Adapt.adapt(to, arr) for arr in v.arrays])
end
function Base.convert(::Type{<:VectorOfArrays}, v::Vector{<:AbstractArray})
VectorOfArrays(v)
end

We try to avoid abbreviations unless really necessary for sanity 😬 Also, it would be consistent with the file name.

Note, however, that I do not have a super strong opinion here, thus if you deem VecOfArrays to be better, we can stay like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw that RecursiveArrayTools.jl seems to already have a VectorOfArrays type (https://github.com/trixi-framework/Trixi.jl/pull/2150/files#diff-6058d487915aa254cfe1a5f8da6315b8c1fef4da04df38b0089a599803e28e5aR63). Should we use that one instead?

If not, then of course we should use a different name for our own data structure, and keeping VecOfArrays as the name seems reasonable to me. In that case, it might be good to leave a short note in the comment why we do not want to use the SciML package's type for future reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using RecursiveArrayTools: VectorOfArray is added in #2150 for DGMulti support

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw that. I was just wondering whether their data type could (or should) be used instead of rolling our own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecursiveArrayTools VectorOfArray is not fit for purpose... They define:

function Adapt.adapt_structure(to, VA::AbstractVectorOfArray)
    Adapt.adapt(to, Array(VA))
end

Which is not what we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been some recent updates and it seems we can now actually use RecursiveArrayTools's VectorOfArray instead of our own. This still needs testing.

Comment on lines +197 to +208
uArray <: DenseArray{uEltype, NDIMSP2},
IdsMatrix <: DenseMatrix{Int},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the restriction to dense types? (just curious)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't know this is a left-over from the original work.

@sloede
Copy link
Member

sloede commented Mar 20, 2025

What's the status of this PR?

@vchuravy
Copy link
Member Author

vchuravy commented Apr 21, 2025

Okay, current status:

  1. Removed AbstractHeterogenousContainer and replaced it with a storage_type function that uses Adapt infrastructure for wrapped arrays.
  2. Started writing some docs on heterogeneous computing & Trixi
  3. Added a minimal extension on CUDA.jl

Open questions:

  • Should I untangle the whole VecOfArrays business into a separate PR? I am not all that happy with that implementation, but it is something we need to address for full CUDA support.
  • Can I add CUDA.jl as a test dependency, or should I pursue more the strategy in add minimal CUDA pipeline #2367 where I add it on-demand?
  • Right now we are not adapting the mesh, similar to remake, is that okay?

@vchuravy vchuravy force-pushed the vc/adapt branch 2 times, most recently from db49594 to e632e4c Compare April 22, 2025 07:30
@sloede
Copy link
Member

sloede commented Apr 22, 2025

Started writing some docs on heterogeneous computing & Trixi

That's great news, excellent!

Should I untangle the whole VecOfArrays business into a separate PR? I am not all that happy with that implementation, but it is something we need to address for full CUDA support.

If it's only with a reasonable overhead, I'd prefer to have a separate PR to make it easer to understand and review the code.

Can I add CUDA.jl as a test dependency, or should I pursue more the strategy in add minimal CUDA pipeline #2367 where I add it on-demand?

Puh. Do you have a rough idea about how much overhead (downloading + precompilation) this will add to the non-CUDA CI runs? That is, are we talking about tens of seconds or rather minutes?

For the time being, my take would be to hold off adding it as a test dependency for now, since it will also impact people testing locally.

Right now we are not adapting the mesh, similar to remake, is that okay?

It should be OK. The mesh in Trixi.jl is only used as an "infrastructure provider", i.e., it allows the solvers to figure out which cells exist, who their neighbors are, where their neighbors live (when running with MPI parallelism) etc. during the initialization phase. Thus in a usual simulation run there are no performance-critical loops working directly on mesh data.

One thought I had: It might make sense to create a meta issue for upcoming (and deferred) changes for GPU/heterogeneous computing support. That is, a single issue to keep track of the various PRs, other issues, and maybe open questions/TODOs that are not living in their own issue.

@vchuravy
Copy link
Member Author

Puh. Do you have a rough idea about how much overhead (downloading + precompilation) this will add to the non-CUDA CI runs? That is, are we talking about tens of seconds or rather minutes?

The most annoying thing is likely downloading the CUDA_Runtime_jll,, other than that we are talking seconds.

The biggest issue is that without it testing the storage_type=X change is almost impossible and would require significant mocking.

If it's only with a reasonable overhead, I'd prefer to have a separate PR to make it easer to understand and review the code.

That's somewhat related to the issue above, it is a bit of code that doesn't impact anything until you are actually trying to move the model to the GPU.

I could only do the real_type part of this PR , but that excludes quite a bit of complexity.

Thus in a usual simulation run there are no performance-critical loops working directly on mesh data.

Yeah, it's more a question if we are going to access data within it from the GPU, but I guess we don't know until we actually have a model running. As far as I can tell right now Lars didn't move it either.

@sloede
Copy link
Member

sloede commented Apr 22, 2025

Puh. Do you have a rough idea about how much overhead (downloading + precompilation) this will add to the non-CUDA CI runs? That is, are we talking about tens of seconds or rather minutes?

The most annoying thing is likely downloading the CUDA_Runtime_jll,, other than that we are talking seconds.

The biggest issue is that without it testing the storage_type=X change is almost impossible and would require significant mocking.

OK, then I'd say let's give it a try and see where it takes us.

If it's only with a reasonable overhead, I'd prefer to have a separate PR to make it easier to understand and review the code.

That's somewhat related to the issue above, it is a bit of code that doesn't impact anything until you are actually trying to move the model to the GPU.

I could only do the real_type part of this PR , but that excludes quite a bit of complexity.

Hm, ok. I understand you're saying that we'd be able to split it, but then not really review the impact that it will have later on in the subsequent PRs? In that case, it might be best to keep it in one PR. In that case, please try to keep the PR as focused as possible (such that it does not blow up in size even more).

Thus in a usual simulation run there are no performance-critical loops working directly on mesh data.

Yeah, it's more a question if we are going to access data within it from the GPU, but I guess we don't know until we actually have a model running. As far as I can tell right now Lars didn't move it either.

Right. No, there shouldn't be any access of mesh data from the GPU during the regular rhs! invocation. All possible mesh accesses would occur in one of the callbacks.

@vchuravy vchuravy force-pushed the vc/adapt branch 2 times, most recently from 97877e6 to 15a898b Compare May 14, 2025 07:39
@vchuravy
Copy link
Member Author

This looks like test_trixi_include is passing through an expression and is not widening getting it from the parent scope.

/var/lib/buildkite-agent/builds/gpuci-4/julialang/trixi-dot-jl/examples/p4est_2d_dgsem/elixir_advection_basic.jl
elixir_advection_basic.jl (Float32): Error During Test at /var/lib/buildkite-agent/builds/gpuci-4/julialang/trixi-dot-jl/test/test_trixi.jl:186
  Got exception outside of a @test
  LoadError: UndefVarError: `CuArray` not defined

ranocha
ranocha previously approved these changes Jul 16, 2025
Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@sloede Do you want to review this PR as well?

Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot to everybody involved!

@ranocha ranocha merged commit 6b8d05d into main Jul 17, 2025
35 of 39 checks passed
@ranocha ranocha deleted the vc/adapt branch July 17, 2025 20:08
@vchuravy
Copy link
Member Author

In the future should I squash the PR beforehand? I normally adjust the commit message before I squash merge so that it is useful (instead of my messy development history)

@JoshuaLampert
Copy link
Member

We usually don't do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants