Skip to content

to_device desynchronizes spatial data between CPU and GPU versions of array #2339

@ray-chew

Description

@ray-chew

Describe the bug

When arrays are initialized on the CPU and passed to the GPU via ClimaCore.to_device, the spatial data (e.g., grid, grid.full_grid, etc.) appears to be copied independently, resulting in the loss of internal consistency. Consequently, data layouts that should be equivalent are evaluated as false, leading to bugs during subsequent operations or comparisons.

In particular, when layout-dependent operations like column_reduce! and column_accumulate! are used with the GPU version of the input arrays.

To Reproduce

The following code reproduces the issue and shows how the spatial data becomes inconsistent after calling to_device. In this minimal example, Fields.level(axes(...), 1) == axes(...) evaluates to true on the CPU but fails post-transfer, which should not occur.

I could also push this as a unit test on ClimaAtmos, if that is how I should do it.

using ClimaCore:
    Fields, Geometry, Meshes, Quadratures

using ClimaCore
using ClimaCore.CommonSpaces
import ClimaAtmos as CA
import ClimaComms

using CUDA

const FT = Float64

comms_ctx = ClimaComms.SingletonCommsContext()

(; config_file, job_id) = CA.commandline_kwargs()
config = CA.AtmosConfig(config_file; job_id, comms_ctx)

config.parsed_args["topography"] = "Earth";
config.parsed_args["topo_smoothing"] = false;
config.parsed_args["mesh_warp_type"] = "Linear";
(; parsed_args) = config

# Create meshes and spaces
h_elem = 16
nh_poly = 3
z_max = 42e3
z_elem = 33
dz_bottom = 300.0
radius = 6.371229e6

quad = Quadratures.GLL{nh_poly + 1}()
horizontal_mesh = CA.cubed_sphere_mesh(; radius, h_elem)
h_space = CA.make_horizontal_space(horizontal_mesh, quad, comms_ctx, false)
z_stretch = Meshes.HyperbolicTangentStretching(dz_bottom)
center_space, face_space =
    CA.make_hybrid_spaces(h_space, z_max, z_elem, z_stretch; parsed_args)

ᶜlocal_geometry = Fields.local_geometry_field(center_space)
ᶠlocal_geometry = Fields.local_geometry_field(face_space)

# create Y
Yc = map(ᶜlocal_geometry) do lg
    return (; ρ = FT(1.0), u_phy = FT(0), v_phy = FT(0), T = FT(0), qt = FT(0))
end
Yf = map(ᶠlocal_geometry) do lg
    return (; u₃ = Geometry.Covariant3Vector(FT(0), lg))
end
Y = Fields.FieldVector(c = Yc, f = Yf)

ᶜz = Fields.coordinate_field(Y.c).z

z_level = similar(Fields.level(ᶜz, 1), FT)

# Prints True
print("Axes check on the CPU: ")
println(Fields.level(axes(ᶜz),1) == axes(z_level))

A = ClimaCore.to_device(ClimaComms.CUDADevice(),ᶜz)
B = ClimaCore.to_device(ClimaComms.CUDADevice(),z_level)

# Should also print True, but prints False
# This has the implication that column_accumulate
# and column_reduce do not work on inputs with
# mixed VIJFH and IJFH fields.
print("Axes check on the GPU: ")
println(Fields.level(axes(A), 1) == axes(B))

# Let's look at what is wrong...
C = Fields.level(axes(A),1)
D = axes(B)

# Prints True as it should be...
print("C == D: ")
println(getfield(C.grid.full_grid.face_local_geometry, :array) == getfield(D.grid.full_grid.face_local_geometry, :array))

print("C === D: ")
# ... but physical equivalence evaluates to False
println(getfield(C.grid.full_grid.face_local_geometry, :array) === getfield(D.grid.full_grid.face_local_geometry, :array))
Project

I am using the .buildkite project:

# Project.toml
[deps]
ArgParse = "c7e460c6-2fb9-53a9-8c5b-16f535851c63"
ArtifactWrappers = "a14bc488-3040-4b00-9dc1-f6467924858a"
BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
ClimaAnalysis = "29b5916a-a76c-4e73-9657-3c8fd22e65e6"
ClimaAtmos = "b2c96348-7fb7-4fe0-8da9-78d88439e717"
ClimaComms = "3a4d1b5c-c61d-41fd-a00a-5873ba7a1b0d"
ClimaCore = "d414da3d-4745-48bb-8d80-42e94e092884"
ClimaCoreSpectra = "c2caaa1d-32ae-4754-ba0d-80e7561362e9"
ClimaCoreTempestRemap = "d934ef94-cdd4-4710-83d6-720549644b70"
ClimaDiagnostics = "1ecacbb8-0713-4841-9a07-eb5aa8a2d53f"
ClimaReproducibilityTests = "e0c89595-00ba-42a9-9f9b-061ef3dc23a1"
ClimaTimeSteppers = "595c0a79-7f3d-439a-bc5a-b232dc3bde79"
ClimaUtilities = "b3f4f4ca-9299-4f7f-bd9b-81e1242a7513"
Debugger = "31a5f54b-26ea-5ae9-a837-f05ce5417438"
DiffEqBase = "2b5f629d-d688-5b77-993f-72d75c75574e"
HDF5 = "f67ccb44-e63f-5c2f-98bd-6dc0ccc4ba2f"
Infiltrator = "5903a43b-9cc3-4c30-8d17-598619ec4e9b"
Interpolations = "a98d9a8b-a2ab-59e6-89dd-64a1c18fca59"
IntervalSets = "8197267c-284f-5f27-9208-e0e47529a953"
JET = "c3a54625-cd67-489e-a8e7-0a5a0ff4e31b"
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
Krylov = "ba0b0d4f-ebba-5204-a429-3ac8c609bfb7"
MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195"
NCDatasets = "85f8d34a-cbdd-5861-8df4-14fed0d494ab"
NullBroadcasts = "0d71be07-595a-4f89-9529-4065a4ab43a6"
OrderedCollections = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
Poppler_jll = "9c32591e-4766-534b-9725-b71a8799265b"
PrecompileCI = "76d61242-8ec2-4c91-8455-3234246697a2"
PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"
Profile = "9abbd945-dff8-562f-b5e8-e1ebf5ef1b79"
ProfileCanvas = "efd6af41-a80b-495e-886c-e51b0c7d77a3"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Revise = "295af30f-e4ad-537b-8983-00126c2a3abe"
SciMLBase = "0bca4576-84f4-4d90-8ffe-ffa030f20462"
SnoopCompile = "aa65fe97-06da-5843-b5b1-d5d13cad87d2"
SnoopCompileCore = "e2b509da-e806-4183-be48-004708413034"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
YAML = "ddb6d928-2868-570f-bddf-ab3f9cf99eb6"

[compat]
CairoMakie = "0.11, 0.12"
ClimaCoreSpectra = "0.1"
ClimaCoreTempestRemap = "0.3"
JET = "0.9"
PrettyTables = "2"
ProfileCanvas = "0.1"
SnoopCompileCore = "3"
julia = "1.10"
The Manifest.toml is too long. Including it as a `.txt` attachment.
[Manifest-v1.11.txt](https://github.com/user-attachments/files/20474817/Manifest-v1.11.txt)

System details

Any relevant system information:

  • Julia version:
╰─$ julia --version
julia version 1.11.5
  • operating system:
Description:	Manjaro Linux
Release:	25.0.3
  • modules loaded on cluster (module list)
╰─$ module list
No Modulefiles Currently Loaded.

Related issues / PRs

Appears to be related to #2312 and #2260 .

I found the to_device function here: https://clima.github.io/ClimaCore.jl/dev/faq/#Moving-objects-between-the-CPU-and-GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions