Skip to content

Commit eae36d7

Browse files
Description of fast_ngwfs. Updates to fast_locpot_int.
1 parent 605d523 commit eae36d7

File tree

2 files changed

+126
-22
lines changed

2 files changed

+126
-22
lines changed

developer_area.rst

Lines changed: 103 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1727,7 +1727,7 @@ It multiplies the local potential with ``\phi_Bb`` in double FFT-boxes
17271727
in ``potential_apply_to_ngwf_batch()``.
17281728

17291729
Our goal is to improve on that, leveraging
1730-
`remote_mod` for comms, and using :ref:`dev_trimmed_boxes` for operating on double-grid
1730+
``remote_mod`` for comms, and using :ref:`dev_trimmed_boxes` for operating on double-grid
17311731
quantities. The implementation, in ``integrals_fast_mod`` is remarkably lean,
17321732
and takes place almost exclusively in ``integrals_fast_locpot_dbl_grid()``.
17331733

@@ -1743,12 +1743,11 @@ We proceed as follows:
17431743
by local NGWFs *Aa* (via ``trimmed_boxes_all_local_positions_in_cell()``).
17441744
These are the points of interest, e.g. we will need the local potential only
17451745
for these points. By establishing the union, we avoid communicating the same
1746-
points many times, as part of multiple different NGWFs.
1747-
(3) Rather naively, we communicate the entire local potential on the double grid
1748-
(``potential_dbl``) to everyone. This is the replicated double-grid representation.
1749-
This will soon be superseded by an approach, where only the requisite points
1750-
(just determined) are communicated. This will be much faster and use less
1751-
memory.
1746+
points many times, as part of multiple different NGWFs. Subsequently,
1747+
if ``fast_ngwfs T``, the communicated NGWFs *Bb* are converted to the rod
1748+
representation. If GPUs are in use, NGWFs *Bb* are copied to the device.
1749+
(3) We communicate the local potential, only for the points of interest, to
1750+
whichever ranks need them. This happens in ``integrals_fast_extract_locpot()``.
17521751
(4) Using ``trimmed_boxes_mould_set_from_cell()``, the previously trimmed local
17531752
NGWFs *Aa* are used to mould corresponding trimmed locpots from the cell,
17541753
for each local NGWF *Aa*. This happens in an OMP loop over *Aa*. At this
@@ -1758,27 +1757,112 @@ We proceed as follows:
17581757
(5) In an OMP loop over ``\phi_Aa``, we
17591758
- Put the product ``\phi_Aa`` * ``locpot_Aa`` in double FFT-box.
17601759
- Fourier filter to a coarse FFT-box.
1761-
- Dot with all S-overlapping ``\phi_Bb`` in PPDs, store in a SPAM3 matrix.
1762-
This is done by ``integrals_fast_brappd_ketfftbox()``.
1760+
- Dot with all S-overlapping ``\phi_Bb``, store in a SPAM3 matrix.
1761+
With ``fast_ngwfs F``, this is done by dotting PPDs with an FFT-box
1762+
in ``integrals_fast_bra_ketfftbox()``. With ``fast_ngwfs T``, we
1763+
dot `rods` with an FFT-box in ``rod_rep_dot_with_box()``, using GPUs
1764+
if available.
17631765

17641766
Notably, ``\phi_Bb`` have been made available by ``remote_mod``, so no
1765-
comms are needed. We can simply use ``basis_dot_function_with_box()``.
1767+
comms are needed. We can simply use ``basis_dot_function_with_box_fast()``
1768+
when working with PPDs and ``rod_rep_dot_with_box()`` when working with rods.
1769+
17661770
(6) Symmetrise the SPAM3 matrix.
17671771

17681772

1769-
Most of the time is spent in the Fourier filtering. This, however, uses the GPU
1770-
if available. Currently, this is done in the simplest possible fashion, with
1771-
copyin from the host to the device, and copyout from the device to the host,
1772-
so it is not very efficient. Most of the cost is the copyin, as the data on
1773-
the double grid is 8 times as large. This will soon be avoided, it's just a
1774-
matter of putting the product in a double FFT-box directly on the device.
1773+
When using CPUs only, much of the time is spent in the Fourier filtering.
1774+
With a GPU, this becomes much faster. Copyin is avoided at all times. Copyout
1775+
is avoided when ``fast_ngwfs T`` is in use.
1776+
1777+
Performance
1778+
-----------
1779+
1780+
Two testcases were benchmark so far -- a ~2600-atom lysozyme protein with LNV,
1781+
and a 353-atom Pt cluster with EDFT. Only the time for the calculation of the
1782+
local potential integrals was measured. Measurements were done on a 48-core
1783+
node with and without an A100 GPU.
1784+
1785+
For the LNV testcase I obtained a speed-up of 5.3x on a CPU, and a *further*
1786+
2.7x speed-up once the GPU was used, for a total speed-up of 14.3x.
1787+
For the EDFT testcase I obtained a speed-up of 3.6x on a CPU, and a *further*
1788+
3.2x speed-up once the GPU was used, for a total speed-up of 11.5x.
1789+
1790+
------
1791+
1792+
.. _dev_fast_ngwfs:
1793+
1794+
Fast NGWFs (for developers)
1795+
===========================
1796+
1797+
:Author: Jacek Dziedzic, University of Southampton
1798+
1799+
This section describes the "fast ngwfs" approach introduced in ONETEP 7.3.26
1800+
in December 2024. This is developer-oriented material -- for a user manual,
1801+
see :ref:`user_fast_ngwfs`.
1802+
This documentation pertains to ONETEP 7.3.26 and later.
1803+
1804+
1805+
Rationale
1806+
---------
1807+
1808+
The usual ("slow") method for working with NGWFs on the coarse grid uses *PPDs*
1809+
-- parallelepipeds with axes parallel to those of the simulation cell,
1810+
spanning an integer number of grid points.
1811+
In practice we use flat PPDs, as the default number of points along *a3* is 1,
1812+
unless HFx is in use (where it's more beneficial to use larger PPDs).
1813+
Any NGWF sphere can be covered fully
1814+
with a number of PPDs. The coarse grid data is then stored as points in PPDs.
1815+
Operations on PPDs are straightforward and fast -- there is data
1816+
contiguity because the data in a PPD is stored as a linear 1D array.
1817+
1818+
1819+
1820+
Operations that mix PPDs and FFT-boxes are less straightforward and not as fast
1821+
-- there is data contiguity for the entire length of the PPD along *a1*, but
1822+
not *a2* or *a3*. Furthermore, every time a PPD intersects with an FFT-box,
1823+
we need to establish which parts of the PPD overlap
1824+
with the FFT-box, and which ones stick out and need to be ignored. This is
1825+
further complicated by ``ppd_loc`` -- a feature for remembering of a PPD
1826+
is actually a periodic image and needs to be unwrapped back from the box to the
1827+
image. Such PPDs are sometimes termed *improper*. The limited contiguity (a PPD
1828+
is typically only 5-7 points long) and no GPU support are further drawbacks.
1829+
1830+
With ``fast_ngwfs T`` we switch to a *rod* representation for NGWFs. A *rod* is
1831+
oriented along the *a1* direction and spans an integer number of PPDs.
1832+
Its width along *a2* and *a3* is one point.
1833+
1834+
Operations mixing rods and FFT-boxes are much faster, because they leverage
1835+
contiguity -- a rod is typically ~40-points long. There are also fewer operations
1836+
to determine which parts of a rod stick out of the FFT-box and which parts
1837+
overlap. Finally, rod operations have been GPU ported.
1838+
1839+
Details
1840+
-------
1841+
1842+
For more details, see the banner in ``rod_rep_mod.F90``, where *rods*, *bunches*,
1843+
and handling of periodicity are described.
1844+
1845+
State of the art
1846+
----------------
1847+
1848+
Currently (January 2025, v7.3.27), fast NGWFs are only use in fast local potential
1849+
integrals (``fast_locpot_int T``). There is potential to employ them in the fast
1850+
density calculation, and time will tell if they can beat the *rowsum booster*
1851+
approach. The rest of ONETEP certainly does not benefit from fast NGWFs, yet.
17751852

17761853
Performance
17771854
-----------
17781855

1779-
A detailed performance analysis is not available yet, but is expected before
1780-
the end of 2024. Preliminary testing reveals a speed-up of 2.2x on a CPU, and
1781-
3.6x with a GPU, even with the naive things we do in points 3 and 5 above.
1856+
No detailed performance analysis is available, but here are some tentative numbers.
1857+
1858+
Dotting two 9a0 NGWFs takes about:
1859+
- 13 us when using PPDs for the bra and an FFT-box for the ket,
1860+
- 4.2 us when using PPDs for both,
1861+
- 2.0 us when using rods for the bra and an FFT-box for the ket,
1862+
- 0.6 us on a GPU when using rods for the bra and an FFT-box for the ket.
1863+
1864+
In practice you will likely see very limited gains from fast NGWFs on a CPU,
1865+
it's mostly meant to speed up GPU calculations.
17821866

17831867
------
17841868

performance.rst

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -277,7 +277,7 @@ The fast approach for local potential integrals uses similar techniques as :ref:
277277
that is *trimming* of data in double-grid FFT-boxes, which is a well-controllable approximation,
278278
but an approximation nevertheless. It would be prudent to read the section on :ref:`user_fast_density`,
279279
and the part about controlling accuracy in particular. The same mechanism is
280-
used here (`trimmed_boxes_threshold`).
280+
used here (``trimmed_boxes_threshold``).
281281

282282
**The fast approach works best for "serious" systems, it's not meant to address
283283
scenarios with KE cutoffs below 700-800 eV or NGWFs smaller than 8.0 a0. It will
@@ -294,9 +294,29 @@ The fast locpot int approach works best when `fast_density T` is in use (regardl
294294
`fast_density_method`), as they share some of the workload and memory requirement.
295295
You can expect good synergy when using both approaches at the same time.
296296

297-
There are *no* additional settings for fast locpot int at this point, simply turning
297+
There are *no* additional settings for fast local potential integrals at this point
298+
(apart from ``trimmed_boxes_threshold``), simply turning
298299
it on is sufficient. For pointers about about settings, see the suggested settings
299300
in :ref:`user_fast_density`, just add `fast_locpot_int T` to any of them.
300301

301-
A preliminary GPU port of fast locpot int is in place (starting from ONETEP 7.1.50).
302+
A GPU port of fast local potential integrals is in place (starting from ONETEP 7.1.50).
302303
It is activated automatically if you run a GPU-capable binary.
304+
305+
.. _user_fast_ngwfs:
306+
307+
Fast NGWFs (for users)
308+
======================
309+
310+
This is a user-level explanation -- for developer-oriented material,
311+
see :ref:`dev_fast_ngwfs`.
312+
313+
This is an experimental feature at this point (January 2025).
314+
The PPD representation of NGWFs in ONETEP can be replaced by a faster representation
315+
known as the *rod* representation. This can be done with ``fast_ngwfs T``.
316+
317+
Currently this is only used when ``fast_locpot_int T`` is in effect,
318+
and you will see zero effect otherwise. Even with ``fast_locpot_int T``, you are
319+
unlikely to see much benefit at this point, unless you are running on a GPU. On
320+
a GPU you can expect modest improvements in performance.
321+
322+
The default is ``fast_ngwfs F``.

0 commit comments

Comments
 (0)