Clarification on GPU execution semantics in AMReX #4850

chaitanya2596 · 2025-12-14T13:39:51Z

chaitanya2596
Dec 14, 2025

Hi AMReX team,

I’m trying to understand the default execution and data movement semantics when AMReX is built and run in GPU mode. I’d appreciate clarification on the following points:

When using MFIter with ParallelFor on GPU, are GPU kernels launched per box, or does AMReX ever combine multiple boxes into a single kernel launch?
Are GPU kernels launched by ParallelFor asynchronous with respect to the host by default, or are there implicit synchronization points users should be aware of?
Besides explicit calls to Gpu::synchronize(), are there implicit synchronizations (e.g., at MFIter boundaries, end of FillBoundary(), or before returning to host-only code)?

Any pointers to relevant documentation or source locations would be very helpful.

Thanks in advance!

WeiqunZhang · 2025-12-14T22:01:42Z

WeiqunZhang
Dec 14, 2025
Maintainer

When ParallelFor is used inside an MFIter loop, it's usually the case that the GPU kernel is launched on a single Box. There are other versions of ParallelFor that can be used to launch kernels on multiple boxes in a single kernel. See https://amrex-codes.github.io/amrex/doxygen/namespaceamrex_1_1experimental.html#ac79f243c0680723c0e561f4479ca6539
Yes, GPU kernels launched by ParallelFor are asynchronous to the host. (https://amrex-codes.github.io/amrex/docs_html/GPU.html?highlight=synchronicity#stream-and-synchronization) The only exception in the family of ParallelFor functions is ParallelForRNG, which is used when random number generation is used in the GPU kernel. It contains an implicit stream synchronization to avoid potential race conditions.
Yes, there are implicit synchronizations in a number of situations. Some of them are necessities. For example, MultiFab::Dot returns a value on the host, it must have an implicit synchronization inside. Another example is FillBoundary. MPI does not have the concept of GPU streams. So we must synchronize before we pass our data to MPI functions. Some of the synchronizations are for safety. The most notable example in this category is there are synchronizations at both the start and the end of MFIter. However, one could disable the synchronization in MFIter with MFItInfo (e.g., MFIter mfi(mf, MFItInfo{}.DisableDeviceSync())). One could also use NoSyncRegion (https://amrex-codes.github.io/amrex/doxygen/structamrex_1_1Gpu_1_1NoSyncRegion.html#a145be1ff4432aa6033daa6e1bed33e89) to disable GPU synchronization in the current C++ scope.

1 reply

chaitanya2596 Dec 15, 2025
Author

Thank you for the detailed and very clear explanations. This clarifies the kernel launch granularity, asynchrony guarantees, and the locations of implicit synchronizations in AMReX.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on GPU execution semantics in AMReX #4850

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification on GPU execution semantics in AMReX #4850

Uh oh!

chaitanya2596 Dec 14, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

WeiqunZhang Dec 14, 2025 Maintainer

Uh oh!

chaitanya2596 Dec 15, 2025 Author

chaitanya2596
Dec 14, 2025

Replies: 1 comment 1 reply

WeiqunZhang
Dec 14, 2025
Maintainer

chaitanya2596 Dec 15, 2025
Author