-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Describe the bug
When arrays are initialized on the CPU and passed to the GPU via ClimaCore.to_device, the spatial data (e.g., grid, grid.full_grid, etc.) appears to be copied independently, resulting in the loss of internal consistency. Consequently, data layouts that should be equivalent are evaluated as false, leading to bugs during subsequent operations or comparisons.
In particular, when layout-dependent operations like column_reduce! and column_accumulate! are used with the GPU version of the input arrays.
To Reproduce
The following code reproduces the issue and shows how the spatial data becomes inconsistent after calling to_device. In this minimal example, Fields.level(axes(...), 1) == axes(...) evaluates to true on the CPU but fails post-transfer, which should not occur.
I could also push this as a unit test on ClimaAtmos, if that is how I should do it.
using ClimaCore:
Fields, Geometry, Meshes, Quadratures
using ClimaCore
using ClimaCore.CommonSpaces
import ClimaAtmos as CA
import ClimaComms
using CUDA
const FT = Float64
comms_ctx = ClimaComms.SingletonCommsContext()
(; config_file, job_id) = CA.commandline_kwargs()
config = CA.AtmosConfig(config_file; job_id, comms_ctx)
config.parsed_args["topography"] = "Earth";
config.parsed_args["topo_smoothing"] = false;
config.parsed_args["mesh_warp_type"] = "Linear";
(; parsed_args) = config
# Create meshes and spaces
h_elem = 16
nh_poly = 3
z_max = 42e3
z_elem = 33
dz_bottom = 300.0
radius = 6.371229e6
quad = Quadratures.GLL{nh_poly + 1}()
horizontal_mesh = CA.cubed_sphere_mesh(; radius, h_elem)
h_space = CA.make_horizontal_space(horizontal_mesh, quad, comms_ctx, false)
z_stretch = Meshes.HyperbolicTangentStretching(dz_bottom)
center_space, face_space =
CA.make_hybrid_spaces(h_space, z_max, z_elem, z_stretch; parsed_args)
ᶜlocal_geometry = Fields.local_geometry_field(center_space)
ᶠlocal_geometry = Fields.local_geometry_field(face_space)
# create Y
Yc = map(ᶜlocal_geometry) do lg
return (; ρ = FT(1.0), u_phy = FT(0), v_phy = FT(0), T = FT(0), qt = FT(0))
end
Yf = map(ᶠlocal_geometry) do lg
return (; u₃ = Geometry.Covariant3Vector(FT(0), lg))
end
Y = Fields.FieldVector(c = Yc, f = Yf)
ᶜz = Fields.coordinate_field(Y.c).z
z_level = similar(Fields.level(ᶜz, 1), FT)
# Prints True
print("Axes check on the CPU: ")
println(Fields.level(axes(ᶜz),1) == axes(z_level))
A = ClimaCore.to_device(ClimaComms.CUDADevice(),ᶜz)
B = ClimaCore.to_device(ClimaComms.CUDADevice(),z_level)
# Should also print True, but prints False
# This has the implication that column_accumulate
# and column_reduce do not work on inputs with
# mixed VIJFH and IJFH fields.
print("Axes check on the GPU: ")
println(Fields.level(axes(A), 1) == axes(B))
# Let's look at what is wrong...
C = Fields.level(axes(A),1)
D = axes(B)
# Prints True as it should be...
print("C == D: ")
println(getfield(C.grid.full_grid.face_local_geometry, :array) == getfield(D.grid.full_grid.face_local_geometry, :array))
print("C === D: ")
# ... but physical equivalence evaluates to False
println(getfield(C.grid.full_grid.face_local_geometry, :array) === getfield(D.grid.full_grid.face_local_geometry, :array))Project
I am using the .buildkite project:
# Project.toml
[deps]
ArgParse = "c7e460c6-2fb9-53a9-8c5b-16f535851c63"
ArtifactWrappers = "a14bc488-3040-4b00-9dc1-f6467924858a"
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
ClimaAnalysis = "29b5916a-a76c-4e73-9657-3c8fd22e65e6"
ClimaAtmos = "b2c96348-7fb7-4fe0-8da9-78d88439e717"
ClimaComms = "3a4d1b5c-c61d-41fd-a00a-5873ba7a1b0d"
ClimaCore = "d414da3d-4745-48bb-8d80-42e94e092884"
ClimaCoreSpectra = "c2caaa1d-32ae-4754-ba0d-80e7561362e9"
ClimaCoreTempestRemap = "d934ef94-cdd4-4710-83d6-720549644b70"
ClimaDiagnostics = "1ecacbb8-0713-4841-9a07-eb5aa8a2d53f"
ClimaReproducibilityTests = "e0c89595-00ba-42a9-9f9b-061ef3dc23a1"
ClimaTimeSteppers = "595c0a79-7f3d-439a-bc5a-b232dc3bde79"
ClimaUtilities = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
Debugger = "31a5f54b-26ea-5ae9-a837-f05ce5417438"
DiffEqBase = "2b5f629d-d688-5b77-993f-72d75c75574e"
HDF5 = "f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f"
Infiltrator = "5903a43b-9cc3-4c30-8d17-598619ec4e9b"
Interpolations = "a98d9a8b-a2ab-59e6-89dd-64a1c18fca59"
IntervalSets = "8197267c-284f-5f27-9208-e0e47529a953"
JET = "c3a54625-cd67-489e-a8e7-0a5a0ff4e31b"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
Krylov = "ba0b0d4f-ebba-5204-a429-3ac8c609bfb7"
MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
NCDatasets = "85f8d34a-cbdd-5861-8df4-14fed0d494ab"
NullBroadcasts = "0d71be07-595a-4f89-9529-4065a4ab43a6"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
Poppler_jll = "9c32591e-4766-534b-9725-b71a8799265b"
PrecompileCI = "76d61242-8ec2-4c91-8455-3234246697a2"
PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
Profile = "9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"
ProfileCanvas = "efd6af41-a80b-495e-886c-e51b0c7d77a3"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Revise = "295af30f-e4ad-537b-8983-00126c2a3abe"
SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
SnoopCompile = "aa65fe97-06da-5843-b5b1-d5d13cad87d2"
SnoopCompileCore = "e2b509da-e806-4183-be48-004708413034"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
YAML = "ddb6d928-2868-570f-bddf-ab3f9cf99eb6"
[compat]
CairoMakie = "0.11, 0.12"
ClimaCoreSpectra = "0.1"
ClimaCoreTempestRemap = "0.3"
JET = "0.9"
PrettyTables = "2"
ProfileCanvas = "0.1"
SnoopCompileCore = "3"
julia = "1.10"
The Manifest.toml is too long. Including it as a `.txt` attachment.
[Manifest-v1.11.txt](https://github.com/user-attachments/files/20474817/Manifest-v1.11.txt)
System details
Any relevant system information:
- Julia version:
╰─$ julia --version
julia version 1.11.5- operating system:
Description: Manjaro Linux
Release: 25.0.3- modules loaded on cluster (
module list)
╰─$ module list
No Modulefiles Currently Loaded.
Related issues / PRs
Appears to be related to #2312 and #2260 .
I found the to_device function here: https://clima.github.io/ClimaCore.jl/dev/faq/#Moving-objects-between-the-CPU-and-GPU.