@@ -1727,7 +1727,7 @@ It multiplies the local potential with ``\phi_Bb`` in double FFT-boxes
17271727in ``potential_apply_to_ngwf_batch() ``.
17281728
17291729Our goal is to improve on that, leveraging
1730- `remote_mod ` for comms, and using :ref: `dev_trimmed_boxes ` for operating on double-grid
1730+ `` remote_mod ` ` for comms, and using :ref: `dev_trimmed_boxes ` for operating on double-grid
17311731quantities. The implementation, in ``integrals_fast_mod `` is remarkably lean,
17321732and takes place almost exclusively in ``integrals_fast_locpot_dbl_grid() ``.
17331733
@@ -1743,12 +1743,11 @@ We proceed as follows:
17431743 by local NGWFs *Aa * (via ``trimmed_boxes_all_local_positions_in_cell() ``).
17441744 These are the points of interest, e.g. we will need the local potential only
17451745 for these points. By establishing the union, we avoid communicating the same
1746- points many times, as part of multiple different NGWFs.
1747- (3) Rather naively, we communicate the entire local potential on the double grid
1748- (``potential_dbl ``) to everyone. This is the replicated double-grid representation.
1749- This will soon be superseded by an approach, where only the requisite points
1750- (just determined) are communicated. This will be much faster and use less
1751- memory.
1746+ points many times, as part of multiple different NGWFs. Subsequently,
1747+ if ``fast_ngwfs T ``, the communicated NGWFs *Bb * are converted to the rod
1748+ representation. If GPUs are in use, NGWFs *Bb * are copied to the device.
1749+ (3) We communicate the local potential, only for the points of interest, to
1750+ whichever ranks need them. This happens in ``integrals_fast_extract_locpot() ``.
17521751(4) Using ``trimmed_boxes_mould_set_from_cell() ``, the previously trimmed local
17531752 NGWFs *Aa * are used to mould corresponding trimmed locpots from the cell,
17541753 for each local NGWF *Aa *. This happens in an OMP loop over *Aa *. At this
@@ -1758,27 +1757,112 @@ We proceed as follows:
17581757(5) In an OMP loop over ``\phi_Aa ``, we
17591758 - Put the product ``\phi_Aa `` * ``locpot_Aa `` in double FFT-box.
17601759 - Fourier filter to a coarse FFT-box.
1761- - Dot with all S-overlapping ``\phi_Bb `` in PPDs, store in a SPAM3 matrix.
1762- This is done by ``integrals_fast_brappd_ketfftbox() ``.
1760+ - Dot with all S-overlapping ``\phi_Bb ``, store in a SPAM3 matrix.
1761+ With ``fast_ngwfs F ``, this is done by dotting PPDs with an FFT-box
1762+ in ``integrals_fast_bra_ketfftbox() ``. With ``fast_ngwfs T ``, we
1763+ dot `rods ` with an FFT-box in ``rod_rep_dot_with_box() ``, using GPUs
1764+ if available.
17631765
17641766 Notably, ``\phi_Bb `` have been made available by ``remote_mod ``, so no
1765- comms are needed. We can simply use ``basis_dot_function_with_box() ``.
1767+ comms are needed. We can simply use ``basis_dot_function_with_box_fast() ``
1768+ when working with PPDs and ``rod_rep_dot_with_box() `` when working with rods.
1769+
17661770(6) Symmetrise the SPAM3 matrix.
17671771
17681772
1769- Most of the time is spent in the Fourier filtering. This, however, uses the GPU
1770- if available. Currently, this is done in the simplest possible fashion, with
1771- copyin from the host to the device, and copyout from the device to the host,
1772- so it is not very efficient. Most of the cost is the copyin, as the data on
1773- the double grid is 8 times as large. This will soon be avoided, it's just a
1774- matter of putting the product in a double FFT-box directly on the device.
1773+ When using CPUs only, much of the time is spent in the Fourier filtering.
1774+ With a GPU, this becomes much faster. Copyin is avoided at all times. Copyout
1775+ is avoided when ``fast_ngwfs T `` is in use.
1776+
1777+ Performance
1778+ -----------
1779+
1780+ Two testcases were benchmark so far -- a ~2600-atom lysozyme protein with LNV,
1781+ and a 353-atom Pt cluster with EDFT. Only the time for the calculation of the
1782+ local potential integrals was measured. Measurements were done on a 48-core
1783+ node with and without an A100 GPU.
1784+
1785+ For the LNV testcase I obtained a speed-up of 5.3x on a CPU, and a *further *
1786+ 2.7x speed-up once the GPU was used, for a total speed-up of 14.3x.
1787+ For the EDFT testcase I obtained a speed-up of 3.6x on a CPU, and a *further *
1788+ 3.2x speed-up once the GPU was used, for a total speed-up of 11.5x.
1789+
1790+ ------
1791+
1792+ .. _dev_fast_ngwfs :
1793+
1794+ Fast NGWFs (for developers)
1795+ ===========================
1796+
1797+ :Author: Jacek Dziedzic, University of Southampton
1798+
1799+ This section describes the "fast ngwfs" approach introduced in ONETEP 7.3.26
1800+ in December 2024. This is developer-oriented material -- for a user manual,
1801+ see :ref: `user_fast_ngwfs `.
1802+ This documentation pertains to ONETEP 7.3.26 and later.
1803+
1804+
1805+ Rationale
1806+ ---------
1807+
1808+ The usual ("slow") method for working with NGWFs on the coarse grid uses *PPDs *
1809+ -- parallelepipeds with axes parallel to those of the simulation cell,
1810+ spanning an integer number of grid points.
1811+ In practice we use flat PPDs, as the default number of points along *a3 * is 1,
1812+ unless HFx is in use (where it's more beneficial to use larger PPDs).
1813+ Any NGWF sphere can be covered fully
1814+ with a number of PPDs. The coarse grid data is then stored as points in PPDs.
1815+ Operations on PPDs are straightforward and fast -- there is data
1816+ contiguity because the data in a PPD is stored as a linear 1D array.
1817+
1818+
1819+
1820+ Operations that mix PPDs and FFT-boxes are less straightforward and not as fast
1821+ -- there is data contiguity for the entire length of the PPD along *a1 *, but
1822+ not *a2 * or *a3 *. Furthermore, every time a PPD intersects with an FFT-box,
1823+ we need to establish which parts of the PPD overlap
1824+ with the FFT-box, and which ones stick out and need to be ignored. This is
1825+ further complicated by ``ppd_loc `` -- a feature for remembering of a PPD
1826+ is actually a periodic image and needs to be unwrapped back from the box to the
1827+ image. Such PPDs are sometimes termed *improper *. The limited contiguity (a PPD
1828+ is typically only 5-7 points long) and no GPU support are further drawbacks.
1829+
1830+ With ``fast_ngwfs T `` we switch to a *rod * representation for NGWFs. A *rod * is
1831+ oriented along the *a1 * direction and spans an integer number of PPDs.
1832+ Its width along *a2 * and *a3 * is one point.
1833+
1834+ Operations mixing rods and FFT-boxes are much faster, because they leverage
1835+ contiguity -- a rod is typically ~40-points long. There are also fewer operations
1836+ to determine which parts of a rod stick out of the FFT-box and which parts
1837+ overlap. Finally, rod operations have been GPU ported.
1838+
1839+ Details
1840+ -------
1841+
1842+ For more details, see the banner in ``rod_rep_mod.F90 ``, where *rods *, *bunches *,
1843+ and handling of periodicity are described.
1844+
1845+ State of the art
1846+ ----------------
1847+
1848+ Currently (January 2025, v7.3.27), fast NGWFs are only use in fast local potential
1849+ integrals (``fast_locpot_int T ``). There is potential to employ them in the fast
1850+ density calculation, and time will tell if they can beat the *rowsum booster *
1851+ approach. The rest of ONETEP certainly does not benefit from fast NGWFs, yet.
17751852
17761853Performance
17771854-----------
17781855
1779- A detailed performance analysis is not available yet, but is expected before
1780- the end of 2024. Preliminary testing reveals a speed-up of 2.2x on a CPU, and
1781- 3.6x with a GPU, even with the naive things we do in points 3 and 5 above.
1856+ No detailed performance analysis is available, but here are some tentative numbers.
1857+
1858+ Dotting two 9a0 NGWFs takes about:
1859+ - 13 us when using PPDs for the bra and an FFT-box for the ket,
1860+ - 4.2 us when using PPDs for both,
1861+ - 2.0 us when using rods for the bra and an FFT-box for the ket,
1862+ - 0.6 us on a GPU when using rods for the bra and an FFT-box for the ket.
1863+
1864+ In practice you will likely see very limited gains from fast NGWFs on a CPU,
1865+ it's mostly meant to speed up GPU calculations.
17821866
17831867------
17841868
0 commit comments