Skip to content

Added generic fallback method to to_device #2362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

ray-chew
Copy link
Member

@ray-chew ray-chew commented Aug 1, 2025

This allows for to_device to work with generic data structures.

Specifically, this extension allows for the following use case:

gpu_params = ClimaCore.to_device(ClimaComms.CUDADevice(), cpu_params)

where params is a parameter struct.

@ray-chew ray-chew changed the title Added generic fallback method to to_device Added generic fallback method to to_device Aug 1, 2025
@ray-chew ray-chew requested a review from dennisYatunin August 1, 2025 22:57
@akshaysridhar akshaysridhar requested a review from ph-kev August 5, 2025 18:14
Comment on lines +29 to +32
# Generic fallback for other types that might need device adaptation
function to_device(device::ClimaComms.AbstractDevice, x)
return Adapt.adapt(ClimaComms.array_type(device), x)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for this is identical to what is written above, so it doesn't make sense to add this if everything goes through the same thing anyway.

Also, I don't think this fallback should be added, since there could be correctness issue if Adapt.adapt didn't throw an error, but the object isn't meant to be put onto the GPU.

Copy link
Member Author

@ray-chew ray-chew Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for this is identical to what is written above, so it doesn't make sense to add this if everything goes through the same thing anyway.

Well, not really. What's written above limits to_device to a union of certain ClimaCore data structures. We could either extend to_device via a generic fallback (as in my PR), or remove the restrictions, if we want to_device to be more versatile.

Also, I don't think this fallback should be added, since there could be correctness issue if Adapt.adapt didn't throw an error, but the object isn't meant to be put onto the GPU.

I suppose you are correct to be more cautious, but I submitted this PR due to an issue I faced. Specifically, the full orographic gravity wave pipeline requires loading in an external orography file and doing some preprocessing analysis on this dataset. These steps are done on the CPU.

Now, in one of the tests, see for example the link below, I explicitly move the arrays initialised on the CPU to the GPU via to_device. These GPU arrays are then used in the GPU orographic gravity wave parameterization. One of the obstacles was the existing to_device would not work to move instances of ThermodynamicsParameters to the GPU, and this PR resolves that issue.

I understand that my integral test is not very idiomatic Clima, but that is because I wanted to integrate CPU preprocessing with GPU ClimaAtmos computations into one integral test. But if you have a better solution to this problem, please let me know! :)

https://github.com/ray-chew/ClimaAtmos.jl/blob/2def4f0003c8326e80126ee9db29eb09d6e2b06a/test/parameterized_tendencies/gravity_wave/orographic_gravity_wave/ogwd_3d_gpu_integral.jl#L216C2-L219C1

Copy link
Member

@akshaysridhar akshaysridhar Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Aside: @ray-chew We should drop support for the Fields.bycolumn usage in the linked test above ; I also think we may be able to replace interp_latlong2cg with the SpaceVaryingInput utility. )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Aside: @ray-chew We should drop support for the Fields.bycolumn usage in the linked test above ; I also think we may be able to replace interp_latlong2cg with the SpaceVaryingInput utility. )

Yes, the whole bycolumn, parent, and etc part of the test is the CPU part I mentioned in my reply to Kevin. Which is the reason for me moving between host and device with to_device.

In ClimaAtmos #3867 point 3, I mentioned that we should move all these to GPU-friendly code. However, if we can already use the existing machinery in a preprocessing step, the cost of refactoring these right now is too high with too little benefits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ph-kev : Given @ray-chew 's workflow in ClimaAtmos - is there a reasonable alternative that avoids this generic method addition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ray-chew
@ph-kev and I are a bit confused by this. The thermo_params shouldn't need to be moved to the device because it is already isbits.
For example:

julia> function bar(x, s)
       x +  s.grav
       end
bar (generic function with 1 method)

julia> params = ClimaLand.Parameters.LandParameters(Float32).thermo_params
Thermodynamics.Parameters.ThermodynamicsParameters{Float32}(273.16f0, 101325.0f0, 100000.0f0, 1859.0f0, 4181.0f0, 2100.0f0, 2.5008f6, 2.8344f6, 611.657f0, 273.16f0, 273.15f0, 1.0f0, 1000.0f0, 150.0f0, 298.15f0, 6864.8f0, 10513.6f0, 0.2857143f0, 8.31446f0, 0.02897f0, 0.01801528f0, 290.0f0, 220.0f0, 9.81f0, 233.0f0, 1.0f0)

julia> c = CUDA.cu([1.0,2.0])
2-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 1.0
 2.0

julia> bar.(c, params)
2-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 10.81
 11.81

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ph-kev @imreddyTeja : Confirming that I can replicate your results where c is a ClimaCore.Fields.Field (VIJFH layout) and params has the same type as above. @ray-chew and I looked over the test setup where this issue popped up - turns out it was related to the Broadcast space mismatch error since the test problem involves computing subgrid variables on the CPU on a lat-long grid given some source dataset, and then moving them to the GPU : inconsistent spaces meant that the to_device was being used as a somewhat hacky solution. A better solution seems to be to use the Fields.Field(Fields.field_values(x),S) and ensuring that the target space S is always identical (thermo_params don't need additional manipulation). (His test case now runs on GPU following this change). We can discuss this further, but I'm closing this issue for now.
Thanks @ray-chew @ph-kev @imreddyTeja.

@imreddyTeja imreddyTeja self-requested a review August 15, 2025 17:09
@akshaysridhar akshaysridhar mentioned this pull request Aug 15, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants