@@ -35,12 +35,40 @@ The high-level view of the design is:
3535 (different operators require different arguments, and therefore different
3636 types and amounts of shmem).
3737 - Recursively fill the shmem for all ` StencilBroadcasted ` . This is done
38- by reading the argument data from ` getidx `
38+ by reading the argument data from ` getidx ` . See the section discussion below for more details.
3939 - The destination field is filled with the result of ` getidx ` (as it is without
4040 shmem), except that we overload ` getidx ` (for supported ` StencilBroadcasted `
4141 types) to retrieve the result of ` getidx ` via ` fd_operator_evaluate ` , which
4242 retrieves the result from the shmem, instead of global memory.
4343
44+ ### Populating shared memory, and memory access safety
4445
46+ We use tail-recursion when filling shared memory of the broadcast expressions.
47+ That is, we visit leaves of the broadcast expression, then work our way up.
48+ It's important to note that the ` StencilBroadcasted ` and ` Broadcasted ` can be
49+ interleaved.
4550
51+ Let's take ` DivergenceF2C()(f*GradientC2F()(a*b))) ` as an example (depicted in
52+ the image below).
4653
54+ Recursion must go through the entire expression in order to ensure that we've
55+ reached all of the leaves of the ` StencilBroadcasted ` objects (otherwise, we
56+ could introduce race conditions with memory access). The leaves of the
57+ ` StencilBroadcasted ` will call ` getidx ` , below which there are (by definition)
58+ no more ` StencilBroadcasted ` , and those ` getidx ` calls will read from global
59+ memory. All subsequent reads will be from shmem(as they will be caught by the
60+ `getidx(parent_space, bc::StencilBroadcasted
61+ {CUDAWithShmemColumnStencilStyle}, idx, hidx)` defined in the
62+ ` ClimaCoreCUDAExt ` module).
63+
64+ In the diagram below, we traverse and fill the yellow highlighted sections
65+ (bottom first and top last). The algorithmic impact of using shared memory is
66+ that the duplicate global memory reads (highlighted in red circles) become one
67+ global memory read (performed in ` fd_operator_fill_shmem! ` ).
68+
69+ Finally, its important to note that threads must by syncrhonized after each node
70+ in the tree is filled, to avoid race conditions for subsequent `getidx
71+ (parent_space, bc::StencilBroadcasted{CUDAWithShmemColumnStencilStyle}, idx,
72+ hidx)` calls (which are retrieved via shmem).
73+
74+ ![ ] ( shmem_diagram_example.png )
0 commit comments