@@ -481,35 +481,34 @@ better job when optimising the code.
481481Example 14: OpenACC
482482^^^^^^^^^^^^^^^^^^^
483483
484- Example of adding OpenACC directives in the dynamo0.3 API. This is a
485- work in progress so the generated code may not work as
486- expected. However it is never-the-less useful as a starting
487- point. Three scripts are provided.
488-
489- The first script (``acc_kernels.py ``) shows how to add OpenACC Kernels
490- directives to the PSy-layer. This example only works with distributed
491- memory switched off as the OpenACC Kernels transformation does not yet
492- support halo exchanges within an OpenACC Kernels region.
493-
494- The second script (``acc_parallel.py ``)shows how to add OpenACC Loop,
495- Parallel and Enter Data directives to the PSy-layer. Again this
496- example only works with distributed memory switched off as the OpenACC
497- Parallel transformation does not support halo exchanges within an
498- OpenACC Parallel region.
499-
500- The third script (``acc_parallel_dm.py ``) is the same as the second
501- except that it does support distributed memory being switched on by
502- placing an OpenACC Parallel directive around each OpenACC Loop
503- directive, rather than having one for the whole invoke. This approach
504- avoids having halo exchanges within an OpenACC Parallel region.
505-
506- The generated code has a number of problems including 1) it does not
507- modify the kernels to include the OpenACC Routine directive, 2) a
508- loop's upper bound is computed via a derived type (this should be
509- computed beforehand) 3) set_dirty and set_clean calls are placed
510- within an OpenACC Parallel directive and 4) there are no checks on
511- whether loops are parallel or not, it is just assumed they are -
512- i.e. support for colouring or locking is not yet implemented.
484+ Example of adding OpenACC directives in the dynamo0.3 API.
485+ A single transformation script (``acc_parallel_dm.py ``) is provided
486+ which demonstrates how to add OpenACC Loop, Parallel and Enter Data
487+ directives to the PSy-layer. It supports distributed memory being
488+ switched on by placing an OpenACC Parallel directive around each
489+ OpenACC Loop directive, rather than having one for the whole invoke.
490+ This approach avoids having halo exchanges within an OpenACC Parallel
491+ region. The script also uses :ref: `ACCRoutineTrans <available_kernel_trans >`
492+ to transform the one user-supplied kernel through
493+ the addition of an ``!$acc routine `` directive. This ensures that the
494+ compiler builds a version suitable for execution on the accelerator (GPU).
495+
496+ The generated code has two problems:
497+
498+ 1. There are no checks on whether loops are safe to parallelise or not,
499+ it is just assumed they are - i.e. support for colouring or locking
500+ is not yet implemented.
501+ 2. Although the user-supplied kernel is transformed so as to have the
502+ necessary ``!$acc routine `` directive, the associated (but unnecessary)
503+ ``use `` statement in the transformed Algorithym layer still uses the
504+ name of the original, untransformed kernel (issue #1724).
505+
506+ Since no colouring is required in this case, the generated Alg layer
507+ may be fixed by hand (by simply deleting the offending ``use `` statement)
508+ and the resulting code compiled and run on GPU. However, performance will
509+ be very poor as, with the limited optimisations and directives currently
510+ applied, the NVIDIA compiler refuses to run the user-supplied kernel in
511+ parallel.
513512
514513Example 15: CPU Optimisation of Matvec
515514^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0 commit comments