Skip to content

Commit 34c6452

Browse files
jhuber6memfrob
authored andcommitted
[OpenMP] Add remark documentation to the OpenMP webpage
This patch begins adding documentation for each remark emitted by `openmp-opt`. This builds on the IDs introduced in D105939 so that users can more easily identify each remark in the webpage. Depends on D105939. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106018
1 parent 18ae02d commit 34c6452

18 files changed

+996
-38
lines changed

openmp/docs/remarks/OMP100.rst

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
Potentially unknown OpenMP target region caller `[OMP100]`
2+
==========================================================
3+
4+
.. _omp100:
5+
.. _omp_no_external_caller_in_target_region:
6+
7+
A function remark that indicates the function, when compiled for a GPU, is
8+
potentially called from outside the translation unit. Note that a remark is
9+
only issued if we tried to perform an optimization which would require us to
10+
know all callers on the GPU.
11+
12+
To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through
13+
which the code that makes up the body of a parallel region is shared with the
14+
threads in the team. Generally we use the address of the outlined parallel
15+
region to identify the code that needs to be executed. If we know all target
16+
regions that reach the parallel region we can avoid this function pointer
17+
passing scheme and often improve the register usage on the GPU. However, If a
18+
parallel region on the GPU is in a function with external linkage we may not
19+
know all callers statically. If there are outside callers within target
20+
regions, this remark is to be ignored. If there are no such callers, users can
21+
modify the linkage and thereby help optimization with a `static` or
22+
`__attribute__((internal))` function annotation. If changing the linkage is
23+
impossible, e.g., because there are outside callers on the host, one can split
24+
the function into an external visible interface which is not compiled for
25+
the target and an internal implementation which is compiled for the target
26+
and should be called from within the target region.

openmp/docs/remarks/OMP101.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
Parallel region is used in unknown / unexpected ways. Will not attempt to rewrite the state machine. [OMP101]
2+
=============================================================================================================
3+
4+
.. _omp101:
5+
6+
An analysis remark that indicates that a parallel region has unknown calls.

openmp/docs/remarks/OMP102.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
Parallel region is not called from a unique kernel. Will not attempt to rewrite the state machine. [OMP102]
2+
===========================================================================================================
3+
4+
.. _omp102:
5+
6+
This analysis remark indicates that a given parallel region is called by
7+
multiple kernels. This prevents the compiler from optimizing it to a single
8+
kernel and rewrite the state machine.

openmp/docs/remarks/OMP110.rst

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
Moving globalized variable to the stack. [OMP110]
2+
=================================================
3+
4+
.. _omp110:
5+
6+
This optimization remark indicates that a globalized variable was moved back to
7+
thread-local stack memory on the device. This occurs when the optimization pass
8+
can determine that a globalized variable is not possibly be shared between
9+
threads and globalization was unnecessary. Using stack memory is the best-case
10+
scenario for data globalization as the variable can now be stored in fast
11+
register files on the device. This optimization requires full visibility of each
12+
variable.
13+
14+
Globalization typically occurs when a pointer to a thread-local variable escapes
15+
the current scope. The compiler needs to be pessimistic and assume that the
16+
pointer could be shared between multiple threads according to the OpenMP
17+
standard. This is expensive on target offloading devices that do not allow
18+
threads to share data by default. Instead, this data must be moved to memory
19+
that can be shared, such as shared or global memory. This optimization moves the
20+
data back from shared or global memory to thread-local stack memory if the data
21+
is not actually shared between the threads.
22+
23+
Examples
24+
--------
25+
26+
A trivial example of globalization occurring can be seen with this example. The
27+
compiler sees that a pointer to the thread-local variable ``x`` escapes the
28+
current scope and must globalize it even though it is not actually necessary.
29+
Fortunately, this optimization can undo this by looking at its usage.
30+
31+
.. code-block:: c++
32+
33+
void use(int *x) { }
34+
35+
void foo() {
36+
int x;
37+
use(&x);
38+
}
39+
40+
int main() {
41+
#pragma omp target parallel
42+
foo();
43+
}
44+
45+
.. code-block:: console
46+
47+
$ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
48+
omp110.cpp:6:7: remark: Moving globalized variable to the stack. [OMP110]
49+
int x;
50+
^
51+
52+
A less trivial example can be seen using C++'s complex numbers. In this case the
53+
overloaded arithmetic operators cause pointers to the complex numbers to escape
54+
the current scope, but they can again be removed once the usage is visible.
55+
56+
.. code-block:: c++
57+
58+
#include <complex>
59+
60+
using complex = std::complex<double>;
61+
62+
void zaxpy(complex *X, complex *Y, const complex D, int N) {
63+
#pragma omp target teams distribute parallel for firstprivate(D)
64+
for (int i = 0; i < N; ++i)
65+
Y[i] = D * X[i] + Y[i];
66+
}
67+
68+
.. code-block:: console
69+
70+
$ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
71+
In file included from omp110.cpp:1:
72+
In file included from /usr/bin/clang/lib/clang/13.0.0/include/openmp_wrappers/complex:27:
73+
/usr/include/c++/8/complex:328:20: remark: Moving globalized variable to the stack. [OMP110]
74+
complex<_Tp> __r = __x;
75+
^
76+
/usr/include/c++/8/complex:388:20: remark: Moving globalized variable to the stack. [OMP110]
77+
complex<_Tp> __r = __x;
78+
^
79+
80+
Diagnostic Scope
81+
----------------
82+
83+
OpenMP target offloading optimization remark.

openmp/docs/remarks/OMP111.rst

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
Replaced globalized variable with X bytes of shared memory. [OMP111]
2+
====================================================================
3+
4+
.. _omp111:
5+
6+
This optimization occurs when a globalized variable's data is shared between
7+
multiple threads, but requires a static amount of memory that can be determined
8+
at compile time. This is the case when only a single thread creates the memory
9+
and then shares is between every thread. The memory can then be pushed to a
10+
static buffer of shared memory on the device. This optimization allows users to
11+
declare shared memory on the device without using OpenMP's allocators.
12+
13+
Globalization normally occurs when a pointer to a thread-local variables escapes
14+
the current scope. If a single thread is responsible for creating and sharing
15+
the data it can instead be mapped directly to shared memory on the target
16+
device. Checking if only a single thread can execute an instruction requires
17+
that the parent functions have internal linkage. Otherwise, an external caller
18+
could invalidate this analysis but having multiple threads call that function.
19+
The optimization pass can automatically make internal copied of each function,
20+
but it is still recommended to mark them as internal using keywords like
21+
``static`` whenever possible.
22+
23+
Example
24+
-------
25+
26+
This optimization should apply to any variable declared in an OpenMP target
27+
region that is then shared with every thread in a parallel region. This allows
28+
the user to declare shared memory without using custom allocators. A simple
29+
stencil calculation shows how this can be used.
30+
31+
.. code-block:: c++
32+
33+
void stencil(int M, int N, double *X, double *Y) {
34+
#pragma omp target teams distribute collapse(2) \
35+
map(to : X [0:M * N]) map(tofrom : Y [0:M * N])
36+
for (int i0 = 0; i0 < M; i0 += MC) {
37+
for (int j0 = 0; j0 < N; j0 += NC) {
38+
double sX[MC][NC];
39+
40+
#pragma omp parallel for collapse(2) default(firstprivate)
41+
for (int i1 = 0; i1 < MC; ++i1)
42+
for (int j1 = 0; j1 < NC; ++j1)
43+
sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];
44+
45+
#pragma omp parallel for collapse(2) default(firstprivate)
46+
for (int i1 = 1; i1 < MC - 1; ++i1)
47+
for (int j1 = 1; j1 < NC - 1; ++j1)
48+
Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] +
49+
sX[i1][j1 + 1] + sX[i1][j1 - 1] +
50+
-4.0 * sX[i1][j1]) / (dX * dX);
51+
}
52+
}
53+
}
54+
55+
.. code-block:: console
56+
57+
58+
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp
59+
omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111]
60+
double sX[MC][NC];
61+
^
62+
63+
Diagnostic Scope
64+
----------------
65+
66+
OpenMP target offloading optimization remark.

openmp/docs/remarks/OMP112.rst

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112]
2+
=====================================================================================================
3+
4+
.. _omp112:
5+
6+
This missed remark indicates that a globalized value was found on the target
7+
device that was not either replaced with stack memory by :ref:`OMP110 <omp110>`
8+
or shared memory by :ref:`OMP111 <omp111>`. Globalization that has not been
9+
removed will need to be handled by the runtime and will significantly hurt
10+
performance.
11+
12+
The OpenMP standard expects that threads can always share their data between
13+
each-other. However, this is not true by default when offloading to a target
14+
device such as a GPU. Threads on a GPU cannot shared their data unless it is
15+
first placed in global or shared memory. In order to create standards complaint
16+
code, the Clang compiler will globalize any variables that could potentially be
17+
shared between the threads. In the majority of cases, globalized variables can
18+
either be returns to a thread-local stack, or pushed to shared memory. However,
19+
in a few cases it is necessary and will cause a performance penalty.
20+
21+
Examples
22+
--------
23+
24+
This example shows legitimate data sharing on the device. It is a convoluted
25+
example, but is completely complaint with the OpenMP standard. If globalization
26+
was not added this would result in different results on different target
27+
devices.
28+
29+
.. code-block:: c++
30+
31+
#include <omp.h>
32+
#include <cstdio>
33+
34+
#pragma omp declare target
35+
static int *p;
36+
#pragma omp end declare target
37+
38+
void foo() {
39+
int x = omp_get_thread_num();
40+
if (omp_get_thread_num() == 1)
41+
p = &x;
42+
43+
#pragma omp barrier
44+
45+
printf ("Thread %d: %d\n", omp_get_thread_num(), *p);
46+
}
47+
48+
int main() {
49+
#pragma omp target parallel
50+
foo();
51+
}
52+
53+
.. code-block:: console
54+
55+
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp
56+
omp112.cpp:9:7: remark: Found thread data sharing on the GPU. Expect degraded performance
57+
due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
58+
int x = omp_get_thread_num();
59+
^
60+
61+
A less convoluted example globalization that cannot be removed occurs when
62+
calling functions that aren't visible from the current translation unit.
63+
64+
.. code-block:: c++
65+
66+
extern void use(int *x);
67+
68+
void foo() {
69+
int x;
70+
use(&x);
71+
}
72+
73+
int main() {
74+
#pragma omp target parallel
75+
foo();
76+
}
77+
78+
.. code-block:: console
79+
80+
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp
81+
omp112.cpp:4:7: remark: Found thread data sharing on the GPU. Expect degraded performance
82+
due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
83+
int x;
84+
^
85+
86+
Diagnostic Scope
87+
----------------
88+
89+
OpenMP target offloading missed remark.

openmp/docs/remarks/OMP113.rst

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
Could not move globalized variable to the stack. Variable is potentially captured in call. Mark parameter as `__attribute__((noescape))` to override. [OMP113]
2+
==============================================================================================================================================================
3+
4+
.. _omp113:
5+
6+
This missed remark indicates that a globalized value could not be moved to the
7+
stack because it is potentially captured by a call to a function we cannot
8+
analyze. In order for a globalized variable to be moved to the stack, copies to
9+
its pointer cannot be stored. Otherwise it is considered captured and could
10+
potentially be shared between the threads. This can be overridden using a
11+
parameter level attribute as suggested in the remark text.
12+
13+
Globalization will occur when a pointer to a thread-local variable escapes
14+
the current scope. In most cases it can be determined that the variable cannot
15+
be shared if a copy of its pointer is never made. However, this remark indicates
16+
a copy of the variable either is present, or is possible because it is used
17+
outside the current translation unit.
18+
19+
Examples
20+
--------
21+
22+
If a pointer to a thread-local variable is passed to a function not visible in
23+
the current translation unit we need to assume a copy is made of it that can be
24+
shared between the threads. This prevents :ref:`OMP110 <omp110>` from
25+
triggering, which will result in a performance penalty when executing on the
26+
target device.
27+
28+
.. code-block:: c++
29+
30+
extern void use(int *x);
31+
32+
void foo() {
33+
int x;
34+
use(&x);
35+
}
36+
37+
int main() {
38+
#pragma omp target parallel
39+
foo();
40+
}
41+
42+
.. code-block:: console
43+
44+
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-missed=openmp-opt omp113.cpp
45+
missed.cpp:4:7: remark: Could not move globalized variable to the stack. Variable is
46+
potentially captured in call. Mark parameter as `__attribute__((noescape))` to
47+
override. [OMP113]
48+
int x;
49+
^
50+
51+
As the remark suggests, this behaviour can be overridden using the ``noescape``
52+
attribute. This tells the compiler that no reference to the object the pointer
53+
points to that is derived from the parameter value will survive after the
54+
function returns. The user is responsible for verifying that this assertion is
55+
correct.
56+
57+
.. code-block:: c++
58+
59+
extern void use(__attribute__((noescape)) int *x);
60+
61+
void foo() {
62+
int x;
63+
use(&x);
64+
}
65+
66+
int main() {
67+
#pragma omp target parallel
68+
foo();
69+
}
70+
71+
.. code-block:: console
72+
73+
$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp113.cpp
74+
missed.cpp:4:7: remark: Moving globalized variable to the stack. [OMP110]
75+
int x;
76+
^
77+
78+
Diagnostic Scope
79+
----------------
80+
81+
OpenMP target offloading missed remark.

0 commit comments

Comments
 (0)