66
77-->
88
9- # ` DO CONCURENT ` mapping to OpenMP
9+ # ` DO CONCURRENT ` mapping to OpenMP
1010
1111``` {contents}
1212---
@@ -17,267 +17,52 @@ local:
1717This document seeks to describe the effort to parallelize ` do concurrent ` loops
1818by mapping them to OpenMP worksharing constructs. The goals of this document
1919are:
20- * Describing how to instruct ` flang-new ` to map ` DO CONCURENT ` loops to OpenMP
20+ * Describing how to instruct ` flang ` to map ` DO CONCURRENT ` loops to OpenMP
2121 constructs.
2222* Tracking the current status of such mapping.
23- * Describing the limitations of the current implmenentation .
23+ * Describing the limitations of the current implementation .
2424* Describing next steps.
25+ * Tracking the current upstreaming status (from the AMD ROCm fork).
2526
2627## Usage
2728
28- In order to enable ` do concurrent ` to OpenMP mapping, ` flang-new ` adds a new
29- compiler flag: ` -fdo-concurrent-parallel ` . This flags has 3 possible values:
30- 1 . ` host ` : this maps ` do concurent ` loops to run in parallel on the host CPU.
29+ In order to enable ` do concurrent ` to OpenMP mapping, ` flang ` adds a new
30+ compiler flag: ` -fdo-concurrent-to-openmp ` . This flag has 3 possible values:
31+ 1 . ` host ` : this maps ` do concurrent ` loops to run in parallel on the host CPU.
3132 This maps such loops to the equivalent of ` omp parallel do ` .
32- 2 . ` device ` : this maps ` do concurent ` loops to run in parallel on a device
33- (GPU). This maps such loops to the equivalent of `omp target teams
34- distribute parallel do`.
35- 3 . ` none ` : this disables ` do concurrent ` mapping altogether. In such case, such
33+ 2 . ` device ` : this maps ` do concurrent ` loops to run in parallel on a target device.
34+ This maps such loops to the equivalent of
35+ ` omp target teams distribute parallel do` .
36+ 3 . ` none ` : this disables ` do concurrent ` mapping altogether. In that case, such
3637 loops are emitted as sequential loops.
3738
38- The above compiler switch is currently avaialble only when OpenMP is also
39- enabled. So you need to provide the following options to flang in order to
40- enable it:
39+ The ` -fdo-concurrent-to-openmp ` compiler switch is currently available only when
40+ OpenMP is also enabled. So you need to provide the following options to flang in
41+ order to enable it:
4142```
42- flang-new ... -fopenmp -fdo-concurrent-parallel =[host|device|none] ...
43+ flang ... -fopenmp -fdo-concurrent-to-openmp =[host|device|none] ...
4344```
45+ For mapping to device, the target device architecture must be specified as well.
46+ See ` -fopenmp-targets ` and ` --offload-arch ` for more info.
4447
4548## Current status
4649
4750Under the hood, ` do concurrent ` mapping is implemented in the
4851` DoConcurrentConversionPass ` . This is still an experimental pass which means
4952that:
5053* It has been tested in a very limited way so far.
51- * It has been tested on simple synthetic inputs.
54+ * It has been tested mostly on simple synthetic inputs.
5255
53- To describe current status in more detail, following is a description of how
54- the pass currently behaves for single-range loops and then for multi-range
55- loops.
56-
57- ### Single-range loops
58-
59- Given the following loop:
60- ``` fortran
61- do concurrent(i=1:n)
62- a(i) = i * i
63- end do
64- ```
65-
66- #### Mapping to ` host `
67-
68- Mapping this loop to the ` host ` , generates MLIR operations of the following
69- structure:
70-
71- ``` mlir
72- %4 = fir.address_of(@_QFEa) ...
73- %6:2 = hlfir.declare %4 ...
74-
75- omp.parallel {
76- // Allocate private copy for `i`.
77- %19 = fir.alloca i32 {bindc_name = "i"}
78- %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
79-
80- omp.wsloop {
81- omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
82- %23 = fir.convert %arg0 : (index) -> i32
83- // Use the privatized version of `i`.
84- fir.store %23 to %20#1 : !fir.ref<i32>
85- ...
86-
87- // Use "shared" SSA value of `a`.
88- %42 = hlfir.designate %6#0
89- hlfir.assign %35 to %42
90- ...
91- omp.yield
92- }
93- omp.terminator
94- }
95- omp.terminator
96- }
97- ```
98-
99- #### Mapping to ` device `
100-
101- Mapping the same loop to the ` device ` , generates MLIR operations of the
102- following structure:
103-
104- ``` mlir
105- // Map `a` to the `target` region.
106- %29 = omp.map.info ... {name = "_QFEa"}
107- omp.target ... map_entries(..., %29 -> %arg4 ...) {
108- ...
109- %51:2 = hlfir.declare %arg4
110- ...
111- omp.teams {
112- // Allocate private copy for `i`.
113- %52 = fir.alloca i32 {bindc_name = "i"}
114- %53:2 = hlfir.declare %52
115- ...
116-
117- omp.distribute {
118- omp.parallel {
119- omp.wsloop {
120- omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) {
121- // Use the privatized version of `i`.
122- %56 = fir.convert %arg5 : (index) -> i32
123- fir.store %56 to %53#1
124- ...
125- // Use the mapped version of `a`.
126- ... = hlfir.designate %51#0
127- ...
128- }
129- omp.terminator
130- }
131- omp.terminator
132- }
133- omp.terminator
134- }
135- omp.terminator
136- }
137- omp.terminator
138- }
139- ```
140-
141- ### Multi-range loops
142-
143- The pass currently supports multi-range loops as well. Given the following
144- example:
145-
146- ``` fortran
147- do concurrent(i=1:n, j=1:m)
148- a(i,j) = i * j
149- end do
150- ```
151-
152- The generated ` omp.loop_nest ` operation look like:
153-
154- ``` mlir
155- omp.loop_nest (%arg0, %arg1)
156- : index = (%17, %19) to (%18, %20)
157- inclusive step (%c1_2, %c1_4) {
158- fir.store %arg0 to %private_i#1 : !fir.ref<i32>
159- fir.store %arg1 to %private_j#1 : !fir.ref<i32>
160- ...
161- omp.yield
162- }
163- ```
164-
165- It is worth noting that we have privatized versions for both iteration
166- variables: ` i ` and ` j ` . These are locally allocated inside the parallel/target
167- OpenMP region similar to what the single-range example in previous section
168- shows.
169-
170- #### Multi-range and perfectly-nested loops
171-
172- Currently, on the ` FIR ` dialect level, the following 2 loops are modelled in
173- exactly the same way:
174-
175- ``` fortran
176- do concurrent(i=1:n, j=1:m)
177- a(i,j) = i * j
178- end do
179- ```
180-
181- ``` fortran
182- do concurrent(i=1:n)
183- do concurrent(j=1:m)
184- a(i,j) = i * j
185- end do
186- end do
187- ```
188-
189- Both of the above loops are modelled as:
190-
191- ``` mlir
192- fir.do_loop %arg0 = %11 to %12 step %c1 unordered {
193- ...
194- fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered {
195- ...
196- }
197- }
198- ```
199-
200- Consequently, from the ` DoConcurrentConversionPass ` ' perspective, both loops
201- are treated in the same manner. Under the hood, the pass detects
202- perfectly-nested loop nests and maps such nests as if they were multi-range
203- loops.
204-
205- #### Non-perfectly-nested loops
206-
207- One limitation that the pass currently have is that it treats any intervening
208- code in a loop nest as being disruptive to detecting that nest as a single
209- unit. For example, given the following input:
210-
211- ``` fortran
212- do concurrent(i=1:n)
213- x = 41
214- do concurrent(j=1:m)
215- a(i,j) = i * j
216- end do
217- end do
218- ```
219-
220- Since there at least one statement between the 2 loop header (i.e. ` x = 41 ` ),
221- the pass does not detect the ` i ` and ` j ` loops as a nest. Rather, the pass in
222- that case only maps the ` i ` loop to OpenMP and leaves the ` j ` loop in its
223- origianl form. In theory, in this example, we can sink the intervening code
224- into the ` j ` loop and detect the complete nest. However, such transformation is
225- still to be implemented in the future.
226-
227- The above also has the consequence that the ` j ` variable will ** not** be
228- privatized in the OpenMP parallel/target region. In other words, it will be
229- treated as if it was a ` shared ` variable. For more details about privatization,
230- see the "Data environment" section below.
231-
232- ### Data environment
233-
234- By default, variables that are used inside a ` do concurernt ` loop nest are
235- either treated as ` shared ` in case of mapping to ` host ` , or mapped into the
236- ` target ` region using a ` map ` clause in case of mapping to ` device ` . The only
237- exceptions to this are:
238- 1 . the loop's iteration variable(s) (IV) of ** perfect** loop nests. In that
239- case, for each IV, we allocate a local copy as shown the by the mapping
240- examples above.
241- 1 . any values that are from allocations outside the loop nest and used
242- exclusively inside of it. In such cases, a local privatized
243- value is created in the OpenMP region to prevent multiple teams of threads
244- from accessing and destroying the same memory block which causes runtime
245- issues. For an example of such cases, see
246- ` flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90 ` .
247-
248- #### Non-perfectly-nested loops' IVs
249-
250- For non-perfectly-nested loops, the IVs are still treated as ` shared ` or
251- ` map ` entries as pointed out above. This ** might not** be consistent with what
252- the Fortran specficiation tells us. In particular, taking the following
253- snippets from the spec (version 2023) into account:
254-
255- > § 3.35
256- > ------
257- > construct entity
258- > entity whose identifier has the scope of a construct
259-
260- > § 19.4
261- > ------
262- > A variable that appears as an index-name in a FORALL or DO CONCURRENT
263- > construct, or ... is a construct entity. A variable that has LOCAL or
264- > LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
265- > ...
266- > The name of a variable that appears as an index-name in a DO CONCURRENT
267- > construct, FORALL statement, or FORALL construct has a scope of the statement
268- > or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
269- > CONCURRENT construct has the scope of that construct.
270-
271- From the above quotes, it seems there is an equivalence between the IV of a `do
272- concurrent` loop and a variable with a ` LOCAL` locality specifier (equivalent
273- to OpenMP's ` private ` clause). Which means that we should probably
274- localize/privatize a ` do concurernt ` loop's IV even if it is not perfectly
275- nested in the nest we are parallelizing. For now, however, we ** do not** do
276- that as pointed out previously. In the near future, we propose a middle-ground
277- solution (see the Next steps section for more details).
56+ <!--
57+ More details about current status will be added along with relevant parts of the
58+ implementation in later upstreaming patches.
59+ -->
27860
27961## Next steps
28062
63+ This section describes some of the open questions/issues that are not tackled yet
64+ even in the downstream implementation.
65+
28166### Delayed privatization
28267
28368So far, we emit the privatization logic for IVs inline in the parallel/target
@@ -296,25 +81,46 @@ loop nest in a more fine-grained way. Implementing these specifiers on the
29681` FIR ` dialect level is needed in order to support this in the
29782` DoConcurrentConversionPass ` .
29883
299- Such specified will also unlock a potential solution to the
84+ Such specifiers will also unlock a potential solution to the
30085non-perfectly-nested loops' IVs issue described above. In particular, for a
30186non-perfectly nested loop, one middle-ground proposal/solution would be to:
30287* Emit the loop's IV as shared/mapped just like we do currently.
30388* Emit a warning that the IV of the loop is emitted as shared/mapped.
30489* Given support for ` LOCAL ` , we can recommend the user to explicitly
30590 localize/privatize the loop's IV if they choose to.
30691
92+ #### Sharing TableGen clause records from the OpenMP dialect
93+
94+ At the moment, the FIR dialect does not have a way to model locality specifiers
95+ on the IR level. Instead, something similar to early/eager privatization in OpenMP
96+ is done for the locality specifiers in ` fir.do_loop ` ops. Having locality specifier
97+ modelled in a way similar to delayed privatization (i.e. the ` omp.private ` op) and
98+ reductions (i.e. the ` omp.declare_reduction ` op) can make mapping ` do concurrent `
99+ to OpenMP (and other parallel programming models) much easier.
100+
101+ Therefore, one way to approach this problem is to extract the TableGen records
102+ for relevant OpenMP clauses in a shared dialect for "data environment management"
103+ and use these shared records for OpenMP, ` do concurrent ` , and possibly OpenACC
104+ as well.
105+
106+ #### Supporting reductions
107+
108+ Similar to locality specifiers, mapping reductions from ` do concurrent ` to OpenMP
109+ is also still an open TODO. We can potentially extend the MLIR infrastructure
110+ proposed in the previous section to share reduction records among the different
111+ relevant dialects as well.
112+
307113### More advanced detection of loop nests
308114
309115As pointed out earlier, any intervening code between the headers of 2 nested
310- ` do concurrent ` loops prevents us currently from detecting this as a loop nest.
311- In some cases this is overly conservative. Therefore, a more flexible detection
312- logic of loop nests needs to be implemented.
116+ ` do concurrent ` loops prevents us from detecting this as a loop nest. In some
117+ cases this is overly conservative. Therefore, a more flexible detection logic
118+ of loop nests needs to be implemented.
313119
314120### Data-dependence analysis
315121
316122Right now, we map loop nests without analysing whether such mapping is safe to
317- do or not. We probalby need to at least warn the use of unsafe loop nests due
123+ do or not. We probably need to at least warn the user of unsafe loop nests due
318124to loop-carried dependencies.
319125
320126### Non-rectangular loop nests
@@ -330,3 +136,21 @@ end do
330136```
331137We defer this to the (hopefully) near future when we get the conversion in a
332138good share for the samples/projects at hand.
139+
140+ ### Generalizing the pass to other parallel programming models
141+
142+ Once we have a stable and capable ` do concurrent ` to OpenMP mapping, we can take
143+ this in a more generalized direction and allow the pass to target other models;
144+ e.g. OpenACC. This goal should be kept in mind from the get-go even while only
145+ targeting OpenMP.
146+
147+
148+ ## Upstreaming status
149+
150+ - [x] Command line options for ` flang ` and ` bbc ` .
151+ - [x] Conversion pass skeleton (no transormations happen yet).
152+ - [x] Status description and tracking document (this document).
153+ - [ ] Basic host/CPU mapping support.
154+ - [ ] Basic device/GPU mapping support.
155+ - [ ] More advanced host and device support (expaned to multiple items as needed).
156+
0 commit comments