Skip to content

Commit d2f2b53

Browse files
Updates to fast_density, fast_locpot_int, fast_ngwfs.
1 parent b4b1942 commit d2f2b53

File tree

2 files changed

+134
-90
lines changed

2 files changed

+134
-90
lines changed

developer_area.rst

Lines changed: 72 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1431,14 +1431,13 @@ Fast density calculation (for developers)
14311431

14321432
:Author: Jacek Dziedzic, University of Southampton
14331433

1434-
This section describes the "fast density" approach introduced in ONETEP 7.1.8 in January 2024,
1435-
and extended in ONETEP 7.1.50 in July 2024.
1434+
This section describes the "fast density" approach introduced in ONETEP 7.1.8 in January 2024.
14361435
This is developer-oriented material -- for a user manual, see :ref:`user_fast_density`.
1437-
This documentation pertains to ONETEP 7.1.50 and later.
1436+
This documentation pertains to ONETEP 7.3.50 and later. If you are using a version of
1437+
ONETEP older than 7.3.50, please update -- not everything you read here will be applicable otherwise.
14381438

1439-
There are three slightly different methods for "fast density", selected via
1440-
``fast_density_method 1``, ``fast_density_method 2``, and ``fast_density_method 3``.
1441-
The first one is the default.
1439+
There are two slightly different methods for "fast density", selected via
1440+
``fast_density_method 1`` and ``fast_density_method 2``, with the latter now the default.
14421441

14431442
We focus on the calculation on the double grid. If ``fine_grid_scale`` is different from 2.0,
14441443
the density gets interpolated from the double to the fine grid, regardless of the approach
@@ -1500,7 +1499,7 @@ We address (2) and (3) by first interpolating only ``\phi_Aa``, and then
15001499
communicating them to where they are needed (and where they become ``\phi_Bb``).
15011500
We use ``remote_mod`` for that, which separates the comms from the FFTs.
15021501
In ``fast_density_method 1`` we communicate ``\phi_Bb`` in the form of trimmed boxes.
1503-
In ``fast_density_method 2`` and ``3`` we communicate ``\phi_Bb`` in PPDs on the coarse grid,
1502+
In ``fast_density_method 2`` we communicate ``\phi_Bb`` in PPDs on the coarse grid,
15041503
and process them at destination. Of course, when communicating trimmed boxes, we only communicate the relevant points,
15051504
not the entire double FFT-boxes.
15061505

@@ -1515,14 +1514,7 @@ we need memory to store the trimmed NGWFs, we have to communicate trimmed NGWFs
15151514
and we need to do ``rowsum_Aa = \sum_\Bb K^Aa,B \phi_Bb`` on the new representation somehow. If the latter can be done efficiently,
15161515
we are addressing (4) above, too.
15171516

1518-
In ``fast_density_method 2`` we do the FFTs in the inner loop, but only for the rowsums,
1519-
as trimmed ``\phi_Aa`` are stored for the entire duration of the inner loop.
1520-
We use *bursts* (described earlier) to calculate products between ``\phi_Aa`` and
1521-
``rowsum_Aa``, which is not very efficient. This method is more FFT-heavy, but
1522-
does less comms, as we communicate NGWFs on the coarse grid. Memory footprint
1523-
is still rather high, because the bursts are memory-hungry.
1524-
1525-
In ``fast_density_method 3`` we similarly do the FFTs in the inner loop, but only for the rowsums,
1517+
In ``fast_density_method 2`` we similarly do the FFTs in the inner loop, but only for the rowsums,
15261518
as trimmed ``\phi_Aa`` are stored for the entire duration of the inner loop.
15271519
However, we are smart and realize that we do not need the entire ``rowsum_Aa``,
15281520
but only its part that overlaps with ``\phi_Aa`` -- as these are multiplied
@@ -1531,62 +1523,67 @@ we only keep a ``\phi_Aa``-shaped fragment. This process is called *moulding* (d
15311523
-- we take data from a double FFT-box and mould it to a shape of a previously
15321524
trimmed NGWF. In so doing, we avoid bursts altogether -- multiplying two
15331525
trimmed quantities with the same shape ("mask") is a simple pointwise job.
1534-
This method has a vastly smaller memory footprint, is as FFT-heavy as ``fast_density_method 2``,
1535-
and is light on comms, because it only transmits NGWFs in PPDs on the coarse grid.
1526+
This method has a vastly smaller memory footprint, and is light on comms, because it only
1527+
transmits NGWFs in PPDs on the coarse grid.
15361528
Finally, this method GPU-ports well.
15371529

15381530
The fast density approach thus proceeds in two stages -- one that is performed every time NGWFs change, and one that is performed
15391531
in the inner loop. The details of the stages depend on ``fast_density_method``.
15401532

15411533
The following table summarizes the main differences between the three methods.
15421534

1543-
+--------------------------+--------------------+-------------------+-------------------------+
1544-
| Detail | method 1 | method 2 | method 3 |
1545-
+==========================+====================+===================+=========================+
1546-
| FFTs done for | ``\phi_Aa`` | ``\phi_Aa`` in outer loop |
1547-
| | | |
1548-
| | in outer loop only | ``rowsum_Aa`` in inner loop |
1549-
+--------------------------+--------------------+-------------------+-------------------------+
1550-
| FFT load | minimal | ~half of original | ~half of original |
1551-
| | | | |
1552-
| | | | can be done on GPU |
1553-
+--------------------------+--------------------+-------------------+-------------------------+
1554-
| Communicated NGWFs | all required ``\phi_Bb`` |
1555-
+--------------------------+--------------------+-------------------+-------------------------+
1556-
| Communicated how | as trimmed boxes | in PPDs on coarse |
1557-
+--------------------------+--------------------+-------------------+-------------------------+
1558-
| Comms load | significant | minimal |
1559-
+--------------------------+--------------------+-------------------+-------------------------+
1560-
| Trimmed storage | ``\phi_Aa`` | ``\phi_Aa`` | ``\phi_Aa`` |
1561-
| | | | |
1562-
| | ``\phi_Bb`` | ``rowsum_Aa`` | ``rowsum_Aa`` (moulded) |
1563-
+--------------------------+--------------------+-------------------+-------------------------+
1564-
| Bursts | yes, many | yes, few | no |
1565-
| | | | |
1566-
| | (pairs *Aa-Bb*) | (pairs *Aa-Aa*) | |
1567-
+--------------------------+--------------------+-------------------+-------------------------+
1568-
| Memory load | high | moderate | low |
1569-
+--------------------------+--------------------+-------------------+-------------------------+
1570-
| Expected CPU performance | very good | poor | good |
1571-
+--------------------------+--------------------+-------------------+-------------------------+
1572-
| Expected GPU performance | very good | poor | excellent |
1573-
+--------------------------+--------------------+-------------------+-------------------------+
1574-
1575-
Typical speed-ups obtained using fast density range from 2x to 6x for the total time spent
1535+
+--------------------------+--------------------+-----------------------------+
1536+
| Detail | method 1 | method 2 |
1537+
+==========================+====================+=============================+
1538+
| FFTs done for | ``\phi_Aa`` | ``\phi_Aa`` in outer loop |
1539+
| | | |
1540+
| | in outer loop only | ``rowsum_Aa`` in inner loop |
1541+
+--------------------------+--------------------+-----------------------------+
1542+
| FFT load | minimal | ~half of original |
1543+
| | | |
1544+
| | | can be done on GPU |
1545+
+--------------------------+--------------------+-----------------------------+
1546+
| Communicated NGWFs | all required ``\phi_Bb`` |
1547+
+--------------------------+--------------------+-----------------------------+
1548+
| Communicated how | as trimmed boxes | in PPDs on coarse |
1549+
+--------------------------+--------------------+-----------------------------+
1550+
| Comms load | significant | minimal |
1551+
+--------------------------+--------------------+-----------------------------+
1552+
| Trimmed storage | ``\phi_Aa`` | ``\phi_Aa`` |
1553+
| | | |
1554+
| | ``\phi_Bb`` | ``rowsum_Aa`` (moulded) |
1555+
+--------------------------+--------------------+-----------------------------+
1556+
| Bursts | yes, many | no |
1557+
| | | |
1558+
| | (pairs *Aa-Bb*) | |
1559+
+--------------------------+--------------------+-----------------------------+
1560+
| Memory load | high | low |
1561+
+--------------------------+--------------------+-----------------------------+
1562+
| Expected CPU performance | very good | good |
1563+
+--------------------------+--------------------+-----------------------------+
1564+
| Expected GPU performance | very good | excellent |
1565+
+--------------------------+--------------------+-----------------------------+
1566+
1567+
On a CPU, typical speed-ups obtained using fast density range from 2x to 6x for the total time spent
15761568
calculating the density, and between 10% and 50% can be shaved off the total calculation walltime.
15771569

1570+
On a GPU, typical speed-ups would be between 10x and 16x for the total time spent
1571+
calculating the density, and between 50% and 70% shaved off the total calculation walltime.
1572+
This measured is in a *fair comparison* -- a CPU-only compute node vs. the same compute node
1573+
with a reasonable GPU (e.g. an NVIDIA A100).
1574+
15781575
Cost
15791576
----
15801577

15811578
The main drawback of fast density is increased memory consumption. There are two main components:
15821579
(A) The trimmed NGWF data itself, which is, to a large extent, replicated.
15831580
In ``fast_density_method 1`` a single trimmed
15841581
NGWF can be needed on many processes, because it could be a ``\phi_Bb`` to many NGWFs Aa.
1585-
The same holds for NGWFs in PPDs for ``fast_density_method 2`` and ``3``.
1582+
The same holds for NGWFs in PPDs for ``fast_density_method 2``.
15861583
Moreover, this memory requirement does not scale inverse-linearly with the number of processes.
15871584
That is, increasing the node count by a factor of two doesn't reduce the memory requirement
15881585
by a factor of two, because there is more replication.
1589-
(B) The burst data (in ``fast_density_method 1``, and, to a smaller extent, in ``fast_density_method 2``).
1586+
(B) The burst data (in ``fast_density_method 1``).
15901587

15911588
Both (A) and (B) depend on the trimming threshold, and the shape of the NGWFs. Both tend to increase
15921589
during the NGWF optimisation as the NGWFs delocalise somewhat.
@@ -1680,7 +1677,7 @@ with respect to one node. It is clear that the fast approach is quite a bit fast
16801677
approach, although it does not scale that well to high core counts. Keep in mind that we pushed this
16811678
system quite far by running 701 atoms on over 1600 CPU cores.
16821679

1683-
More detailed benchmarks of ``fast_density_method 2`` and ``fast_density_method 3`` will follow soon.
1680+
More detailed benchmarks of ``fast_density_method 2`` will follow at some point.
16841681

16851682
Keywords
16861683
--------
@@ -1700,11 +1697,9 @@ This approach could be improved in a number of ways:
17001697
for more control over determinism vs efficiency. Currently we use ``SCHEDULE(STATIC)`` to get
17011698
more deterministic results, but ``SCHEDULE(DYNAMIC)`` offers better efficiency. Toggling this
17021699
at runtime is not trivial (``omp_set_schedule()``).
1703-
4. Having a dynamic ``trimmed_boxes_threshold`` -- we could probably start the NGWF optimisation
1704-
with a cruder approximation, tightening it as we go along.
1705-
5. Dynamically selecting ``MAX_TNGWF_SIZE``. It's currently a constant, and ``persistent_packed_tngwf``
1700+
4. Dynamically selecting ``MAX_TNGWF_SIZE``. It's currently a constant, and ``persistent_packed_tngwf``
17061701
is not an allocatable.
1707-
6. A smarter way to flatten the computed density. Currently each process has their own density that
1702+
5. A smarter way to flatten the computed density. Currently each process has their own density that
17081703
spans the entire cell and only contains contributions from the *Aa* NGWF it owns. We flatten it
17091704
by a series of reduce operations over all nodes. This is the main killer of parallel performance.
17101705

@@ -1720,7 +1715,8 @@ Fast local potential integrals (for developers)
17201715

17211716
This section describes the "fast locpot int" approach introduced in ONETEP 7.1.50 in July 2024.
17221717
This is developer-oriented material -- for a user manual, see :ref:`user_fast_locpot_int`.
1723-
This documentation pertains to ONETEP 7.1.50 and later.
1718+
This documentation pertains to ONETEP 7.3.50 and later. If you are using a version of
1719+
ONETEP older than 7.3.50, please update -- not everything you read here will be applicable otherwise.
17241720

17251721
We focus on the calculation on the double grid. If ``fine_grid_scale`` is different from 2.0,
17261722
the local potential first gets filtered from the fine to the double grid, regardless of the approach
@@ -1789,12 +1785,12 @@ We proceed as follows:
17891785

17901786
When using CPUs only, much of the time is spent in the Fourier filtering.
17911787
With a GPU, this becomes much faster. Copyin is avoided at all times. Copyout
1792-
is avoided when ``fast_ngwfs T`` is in use.
1788+
is avoided when ``fast_locpot_int_fast_ngwfs T`` is in use.
17931789

17941790
Performance
17951791
-----------
17961792

1797-
Two testcases were benchmark so far -- a ~2600-atom lysozyme protein with LNV,
1793+
Two testcases were benchmarked so far -- a ~2600-atom lysozyme protein with LNV,
17981794
and a 353-atom Pt cluster with EDFT. Only the time for the calculation of the
17991795
local potential integrals was measured. Measurements were done on a 48-core
18001796
node with and without an A100 GPU.
@@ -1816,7 +1812,9 @@ Fast NGWFs (for developers)
18161812
This section describes the "fast ngwfs" approach introduced in ONETEP 7.3.26
18171813
in December 2024. This is developer-oriented material -- for a user manual,
18181814
see :ref:`user_fast_ngwfs`.
1819-
This documentation pertains to ONETEP 7.3.26 and later.
1815+
This documentation pertains to ONETEP 7.3.50 and later. If you are using a version of
1816+
ONETEP older than 7.3.50, please update -- not everything you read here will be
1817+
applicable otherwise.
18201818

18211819

18221820
Rationale
@@ -1844,7 +1842,13 @@ is actually a periodic image and needs to be unwrapped back from the box to the
18441842
image. Such PPDs are sometimes termed *improper*. The limited contiguity (a PPD
18451843
is typically only 5-7 points long) and no GPU support are further drawbacks.
18461844

1847-
With ``fast_ngwfs T`` we switch to a *rod* representation for NGWFs. A *rod* is
1845+
With ``fast_density_fast_ngwfs T`` we switch to a *rod* representation for NGWFs
1846+
in the calculation of fast density.
1847+
1848+
With ``fast_locpot_int_fast_ngwfs T`` we switch to a *rod* representation for NGWFs
1849+
in the calculation of fast local potential integrals.
1850+
1851+
A *rod* is
18481852
oriented along the *a1* direction and spans an integer number of PPDs.
18491853
Its width along *a2* and *a3* is one point.
18501854

@@ -1856,16 +1860,17 @@ overlap. Finally, rod operations have been GPU ported.
18561860
Details
18571861
-------
18581862

1859-
For more details, see the banner in ``rod_rep_mod.F90``, where *rods*, *bunches*,
1863+
For more details, see the banner in ``rod_rep_mod.F90``, where *rods*, *bunches*, *slots*,
18601864
and handling of periodicity are described.
18611865

18621866
State of the art
18631867
----------------
18641868

1865-
Currently (January 2025, v7.3.27), fast NGWFs are only use in fast local potential
1866-
integrals (``fast_locpot_int T``). There is potential to employ them in the fast
1867-
density calculation, and time will tell if they can beat the *rowsum booster*
1868-
approach. The rest of ONETEP certainly does not benefit from fast NGWFs, yet.
1869+
Currently (February 2025, v7.3.50), fast NGWFs can be used in fast local potential
1870+
integrals (``fast_locpot_int T``), and in the fast
1871+
density calculation (``fast_density T``).
1872+
1873+
The rest of ONETEP certainly does not benefit from fast NGWFs, yet.
18691874

18701875
Performance
18711876
-----------

0 commit comments

Comments
 (0)