@@ -1431,14 +1431,13 @@ Fast density calculation (for developers)
14311431
14321432:Author: Jacek Dziedzic, University of Southampton
14331433
1434- This section describes the "fast density" approach introduced in ONETEP 7.1.8 in January 2024,
1435- and extended in ONETEP 7.1.50 in July 2024.
1434+ This section describes the "fast density" approach introduced in ONETEP 7.1.8 in January 2024.
14361435This is developer-oriented material -- for a user manual, see :ref: `user_fast_density `.
1437- This documentation pertains to ONETEP 7.1.50 and later.
1436+ This documentation pertains to ONETEP 7.3.50 and later. If you are using a version of
1437+ ONETEP older than 7.3.50, please update -- not everything you read here will be applicable otherwise.
14381438
1439- There are three slightly different methods for "fast density", selected via
1440- ``fast_density_method 1 ``, ``fast_density_method 2 ``, and ``fast_density_method 3 ``.
1441- The first one is the default.
1439+ There are two slightly different methods for "fast density", selected via
1440+ ``fast_density_method 1 `` and ``fast_density_method 2 ``, with the latter now the default.
14421441
14431442We focus on the calculation on the double grid. If ``fine_grid_scale `` is different from 2.0,
14441443the density gets interpolated from the double to the fine grid, regardless of the approach
@@ -1500,7 +1499,7 @@ We address (2) and (3) by first interpolating only ``\phi_Aa``, and then
15001499communicating them to where they are needed (and where they become ``\phi_Bb ``).
15011500We use ``remote_mod `` for that, which separates the comms from the FFTs.
15021501In ``fast_density_method 1 `` we communicate ``\phi_Bb `` in the form of trimmed boxes.
1503- In ``fast_density_method 2 `` and `` 3 `` we communicate ``\phi_Bb `` in PPDs on the coarse grid,
1502+ In ``fast_density_method 2 `` we communicate ``\phi_Bb `` in PPDs on the coarse grid,
15041503and process them at destination. Of course, when communicating trimmed boxes, we only communicate the relevant points,
15051504not the entire double FFT-boxes.
15061505
@@ -1515,14 +1514,7 @@ we need memory to store the trimmed NGWFs, we have to communicate trimmed NGWFs
15151514and we need to do ``rowsum_Aa = \sum_\Bb K^Aa,B \phi_Bb `` on the new representation somehow. If the latter can be done efficiently,
15161515we are addressing (4) above, too.
15171516
1518- In ``fast_density_method 2 `` we do the FFTs in the inner loop, but only for the rowsums,
1519- as trimmed ``\phi_Aa `` are stored for the entire duration of the inner loop.
1520- We use *bursts * (described earlier) to calculate products between ``\phi_Aa `` and
1521- ``rowsum_Aa ``, which is not very efficient. This method is more FFT-heavy, but
1522- does less comms, as we communicate NGWFs on the coarse grid. Memory footprint
1523- is still rather high, because the bursts are memory-hungry.
1524-
1525- In ``fast_density_method 3 `` we similarly do the FFTs in the inner loop, but only for the rowsums,
1517+ In ``fast_density_method 2 `` we similarly do the FFTs in the inner loop, but only for the rowsums,
15261518as trimmed ``\phi_Aa `` are stored for the entire duration of the inner loop.
15271519However, we are smart and realize that we do not need the entire ``rowsum_Aa ``,
15281520but only its part that overlaps with ``\phi_Aa `` -- as these are multiplied
@@ -1531,62 +1523,67 @@ we only keep a ``\phi_Aa``-shaped fragment. This process is called *moulding* (d
15311523-- we take data from a double FFT-box and mould it to a shape of a previously
15321524trimmed NGWF. In so doing, we avoid bursts altogether -- multiplying two
15331525trimmed quantities with the same shape ("mask") is a simple pointwise job.
1534- This method has a vastly smaller memory footprint, is as FFT-heavy as `` fast_density_method 2 ``,
1535- and is light on comms, because it only transmits NGWFs in PPDs on the coarse grid.
1526+ This method has a vastly smaller memory footprint, and is light on comms, because it only
1527+ transmits NGWFs in PPDs on the coarse grid.
15361528Finally, this method GPU-ports well.
15371529
15381530The fast density approach thus proceeds in two stages -- one that is performed every time NGWFs change, and one that is performed
15391531in the inner loop. The details of the stages depend on ``fast_density_method ``.
15401532
15411533The following table summarizes the main differences between the three methods.
15421534
1543- +--------------------------+--------------------+-------------------+--------------- ----------+
1544- | Detail | method 1 | method 2 | method 3 |
1545- +==========================+====================+===================+=============== ==========+
1546- | FFTs done for | ``\phi_Aa `` | ``\phi_Aa `` in outer loop |
1547- | | | |
1548- | | in outer loop only | ``rowsum_Aa `` in inner loop |
1549- +--------------------------+--------------------+-------------------+--------------- ----------+
1550- | FFT load | minimal | ~half of original | ~half of original |
1551- | | | | |
1552- | | | | can be done on GPU |
1553- +--------------------------+--------------------+-------------------+--------------- ----------+
1554- | Communicated NGWFs | all required ``\phi_Bb `` |
1555- +--------------------------+--------------------+-------------------+--------------- ----------+
1556- | Communicated how | as trimmed boxes | in PPDs on coarse |
1557- +--------------------------+--------------------+-------------------+--------------- ----------+
1558- | Comms load | significant | minimal |
1559- +--------------------------+--------------------+-------------------+--------------- ----------+
1560- | Trimmed storage | ``\phi_Aa `` | ``\phi_Aa `` | `` \phi_Aa `` |
1561- | | | | |
1562- | | ``\phi_Bb `` | ``rowsum_Aa `` | `` rowsum_Aa `` (moulded) |
1563- +--------------------------+--------------------+-------------------+--------------- ----------+
1564- | Bursts | yes, many | yes, few | no |
1565- | | | | |
1566- | | (pairs *Aa-Bb *) | (pairs * Aa-Aa *) | |
1567- +--------------------------+--------------------+-------------------+--------------- ----------+
1568- | Memory load | high | moderate | low |
1569- +--------------------------+--------------------+-------------------+--------------- ----------+
1570- | Expected CPU performance | very good | poor | good |
1571- +--------------------------+--------------------+-------------------+--------------- ----------+
1572- | Expected GPU performance | very good | poor | excellent |
1573- +--------------------------+--------------------+-------------------+--------------- ----------+
1574-
1575- Typical speed-ups obtained using fast density range from 2x to 6x for the total time spent
1535+ +--------------------------+--------------------+-----------------------------+
1536+ | Detail | method 1 | method 2 |
1537+ +==========================+====================+=============================+
1538+ | FFTs done for | ``\phi_Aa `` | ``\phi_Aa `` in outer loop |
1539+ | | | |
1540+ | | in outer loop only | ``rowsum_Aa `` in inner loop |
1541+ +--------------------------+--------------------+-----------------------------+
1542+ | FFT load | minimal | ~half of original |
1543+ | | | |
1544+ | | | can be done on GPU |
1545+ +--------------------------+--------------------+-----------------------------+
1546+ | Communicated NGWFs | all required ``\phi_Bb `` |
1547+ +--------------------------+--------------------+-----------------------------+
1548+ | Communicated how | as trimmed boxes | in PPDs on coarse |
1549+ +--------------------------+--------------------+-----------------------------+
1550+ | Comms load | significant | minimal |
1551+ +--------------------------+--------------------+-----------------------------+
1552+ | Trimmed storage | ``\phi_Aa `` | ``\phi_Aa `` |
1553+ | | | |
1554+ | | ``\phi_Bb `` | ``rowsum_Aa `` (moulded) |
1555+ +--------------------------+--------------------+-----------------------------+
1556+ | Bursts | yes, many | no |
1557+ | | | |
1558+ | | (pairs *Aa-Bb *) | |
1559+ +--------------------------+--------------------+-----------------------------+
1560+ | Memory load | high | low |
1561+ +--------------------------+--------------------+-----------------------------+
1562+ | Expected CPU performance | very good | good |
1563+ +--------------------------+--------------------+-----------------------------+
1564+ | Expected GPU performance | very good | excellent |
1565+ +--------------------------+--------------------+-----------------------------+
1566+
1567+ On a CPU, typical speed-ups obtained using fast density range from 2x to 6x for the total time spent
15761568calculating the density, and between 10% and 50% can be shaved off the total calculation walltime.
15771569
1570+ On a GPU, typical speed-ups would be between 10x and 16x for the total time spent
1571+ calculating the density, and between 50% and 70% shaved off the total calculation walltime.
1572+ This measured is in a *fair comparison * -- a CPU-only compute node vs. the same compute node
1573+ with a reasonable GPU (e.g. an NVIDIA A100).
1574+
15781575Cost
15791576----
15801577
15811578The main drawback of fast density is increased memory consumption. There are two main components:
15821579 (A) The trimmed NGWF data itself, which is, to a large extent, replicated.
15831580 In ``fast_density_method 1 `` a single trimmed
15841581 NGWF can be needed on many processes, because it could be a ``\phi_Bb `` to many NGWFs Aa.
1585- The same holds for NGWFs in PPDs for ``fast_density_method 2 `` and `` 3 `` .
1582+ The same holds for NGWFs in PPDs for ``fast_density_method 2 ``.
15861583 Moreover, this memory requirement does not scale inverse-linearly with the number of processes.
15871584 That is, increasing the node count by a factor of two doesn't reduce the memory requirement
15881585 by a factor of two, because there is more replication.
1589- (B) The burst data (in ``fast_density_method 1 ``, and, to a smaller extent, in `` fast_density_method 2 `` ).
1586+ (B) The burst data (in ``fast_density_method 1 ``).
15901587
15911588Both (A) and (B) depend on the trimming threshold, and the shape of the NGWFs. Both tend to increase
15921589during the NGWF optimisation as the NGWFs delocalise somewhat.
@@ -1680,7 +1677,7 @@ with respect to one node. It is clear that the fast approach is quite a bit fast
16801677approach, although it does not scale that well to high core counts. Keep in mind that we pushed this
16811678system quite far by running 701 atoms on over 1600 CPU cores.
16821679
1683- More detailed benchmarks of ``fast_density_method 2 `` and `` fast_density_method 3 `` will follow soon .
1680+ More detailed benchmarks of ``fast_density_method 2 `` will follow at some point .
16841681
16851682Keywords
16861683--------
@@ -1700,11 +1697,9 @@ This approach could be improved in a number of ways:
17001697 for more control over determinism vs efficiency. Currently we use ``SCHEDULE(STATIC) `` to get
17011698 more deterministic results, but ``SCHEDULE(DYNAMIC) `` offers better efficiency. Toggling this
17021699 at runtime is not trivial (``omp_set_schedule() ``).
1703- 4. Having a dynamic ``trimmed_boxes_threshold `` -- we could probably start the NGWF optimisation
1704- with a cruder approximation, tightening it as we go along.
1705- 5. Dynamically selecting ``MAX_TNGWF_SIZE ``. It's currently a constant, and ``persistent_packed_tngwf ``
1700+ 4. Dynamically selecting ``MAX_TNGWF_SIZE ``. It's currently a constant, and ``persistent_packed_tngwf ``
17061701 is not an allocatable.
1707- 6 . A smarter way to flatten the computed density. Currently each process has their own density that
1702+ 5 . A smarter way to flatten the computed density. Currently each process has their own density that
17081703 spans the entire cell and only contains contributions from the *Aa * NGWF it owns. We flatten it
17091704 by a series of reduce operations over all nodes. This is the main killer of parallel performance.
17101705
@@ -1720,7 +1715,8 @@ Fast local potential integrals (for developers)
17201715
17211716This section describes the "fast locpot int" approach introduced in ONETEP 7.1.50 in July 2024.
17221717This is developer-oriented material -- for a user manual, see :ref: `user_fast_locpot_int `.
1723- This documentation pertains to ONETEP 7.1.50 and later.
1718+ This documentation pertains to ONETEP 7.3.50 and later. If you are using a version of
1719+ ONETEP older than 7.3.50, please update -- not everything you read here will be applicable otherwise.
17241720
17251721We focus on the calculation on the double grid. If ``fine_grid_scale `` is different from 2.0,
17261722the local potential first gets filtered from the fine to the double grid, regardless of the approach
@@ -1789,12 +1785,12 @@ We proceed as follows:
17891785
17901786When using CPUs only, much of the time is spent in the Fourier filtering.
17911787With a GPU, this becomes much faster. Copyin is avoided at all times. Copyout
1792- is avoided when ``fast_ngwfs T `` is in use.
1788+ is avoided when ``fast_locpot_int_fast_ngwfs T `` is in use.
17931789
17941790Performance
17951791-----------
17961792
1797- Two testcases were benchmark so far -- a ~2600-atom lysozyme protein with LNV,
1793+ Two testcases were benchmarked so far -- a ~2600-atom lysozyme protein with LNV,
17981794and a 353-atom Pt cluster with EDFT. Only the time for the calculation of the
17991795local potential integrals was measured. Measurements were done on a 48-core
18001796node with and without an A100 GPU.
@@ -1816,7 +1812,9 @@ Fast NGWFs (for developers)
18161812This section describes the "fast ngwfs" approach introduced in ONETEP 7.3.26
18171813in December 2024. This is developer-oriented material -- for a user manual,
18181814see :ref: `user_fast_ngwfs `.
1819- This documentation pertains to ONETEP 7.3.26 and later.
1815+ This documentation pertains to ONETEP 7.3.50 and later. If you are using a version of
1816+ ONETEP older than 7.3.50, please update -- not everything you read here will be
1817+ applicable otherwise.
18201818
18211819
18221820Rationale
@@ -1844,7 +1842,13 @@ is actually a periodic image and needs to be unwrapped back from the box to the
18441842image. Such PPDs are sometimes termed *improper *. The limited contiguity (a PPD
18451843is typically only 5-7 points long) and no GPU support are further drawbacks.
18461844
1847- With ``fast_ngwfs T `` we switch to a *rod * representation for NGWFs. A *rod * is
1845+ With ``fast_density_fast_ngwfs T `` we switch to a *rod * representation for NGWFs
1846+ in the calculation of fast density.
1847+
1848+ With ``fast_locpot_int_fast_ngwfs T `` we switch to a *rod * representation for NGWFs
1849+ in the calculation of fast local potential integrals.
1850+
1851+ A *rod * is
18481852oriented along the *a1 * direction and spans an integer number of PPDs.
18491853Its width along *a2 * and *a3 * is one point.
18501854
@@ -1856,16 +1860,17 @@ overlap. Finally, rod operations have been GPU ported.
18561860Details
18571861-------
18581862
1859- For more details, see the banner in ``rod_rep_mod.F90 ``, where *rods *, *bunches *,
1863+ For more details, see the banner in ``rod_rep_mod.F90 ``, where *rods *, *bunches *, * slots *,
18601864and handling of periodicity are described.
18611865
18621866State of the art
18631867----------------
18641868
1865- Currently (January 2025, v7.3.27), fast NGWFs are only use in fast local potential
1866- integrals (``fast_locpot_int T ``). There is potential to employ them in the fast
1867- density calculation, and time will tell if they can beat the *rowsum booster *
1868- approach. The rest of ONETEP certainly does not benefit from fast NGWFs, yet.
1869+ Currently (February 2025, v7.3.50), fast NGWFs can be used in fast local potential
1870+ integrals (``fast_locpot_int T ``), and in the fast
1871+ density calculation (``fast_density T ``).
1872+
1873+ The rest of ONETEP certainly does not benefit from fast NGWFs, yet.
18691874
18701875Performance
18711876-----------
0 commit comments