flatironinstitute
diff --git a/‎CHANGELOG‎
Lines changed: 29 additions & 26 deletions b/‎CHANGELOG‎
Lines changed: 29 additions & 26 deletions
diff --git a/‎cmake/setupCPM.cmake‎
Lines changed: 11 additions & 8 deletions b/‎cmake/setupCPM.cmake‎
Lines changed: 11 additions & 8 deletions
diff --git a/‎cmake/setupXSIMD.cmake‎
Lines changed: 27 additions & 19 deletions b/‎cmake/setupXSIMD.cmake‎
Lines changed: 27 additions & 19 deletions
diff --git a/‎docs/devnotes.rst‎
Lines changed: 6 additions & 4 deletions b/‎docs/devnotes.rst‎
Lines changed: 6 additions & 4 deletions
@@ -1,22 +1,25 @@
 List of features / changes made / release notes, in reverse chronological order.
 If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).
 
-V 2.3.0beta (7/24/24)
+V 2.3.0-rc1 (8/6/24)
 
-* python build modernized to pyproject.toml (both CPU and GPU).
-  PRs 507 (Anden, Lu, Barbone)
-* switchable FFT: either FFTW or DUCC0 (latter need no plan stage; also it is
+* Switched C++ standards from C++14 to C++17, allowing various templating
+  improvements (Barbone).
+* Python build modernized to pyproject.toml (for both CPU and GPU).
+  PR 507 (Anden, Lu, Barbone). Compiles from source for the local build.
+* Switchable FFT: either FFTW or DUCC0 (latter needs no plan stage; also it is
   used to exploit sparsity pattern to achieve FFT speedups 1-3x in 2D and 3D).
-  PR463, Martin Reinecke.
+  PR463, Martin Reinecke. Both CMake and makefile includes this DUCC0 option
+  (makefile PR511 by Barnett; CMake by Barbone).
 * ES kernel rescaled to max value 1, reduced poly degrees for upsampfac=1.25,
   cleaner Horner coefficient generation PR499 (fixes fp32 overflow issue #454).
 * Major manual acceleration of spread/interp kernels via XSIMD header-only lib,
   kernel evaluation, templating by ns with AVX-width-dependent decisions.
   Up to 80% faster, dep on compiler. (Marco Barbone with help from Libin Lu).
-  PRs 459, 471, 502.
-  NOTE: introduces new dependency (XSIMD), added to cMake and makefile.
+  A large chunk of work: PRs 459, 471, 502.
+  NOTE: introduces new dependency (XSIMD), added to CMake and makefile.
 * Exploiting even/odd symmetry for 10% faster xsimd-accel kernel poly eval
-  Libin Lu based on idea of Martin Reinecke (PR477,492,493).
+  (Libin Lu based on idea of Martin Reinecke; PR477,492,493).
 * new test/finufft3dkernel_test checks kerevalmeth=0 and 1 agree to tolerance
   PR 473 (M Barbone).
 * new perftest/compare_spreads.jl compares two spreadinterp libs (A Barnett).
@@ -47,24 +50,24 @@ V 2.3.0beta (7/24/24)
   any 32-bit integers to 64-bit when calling cufinufft(f)_setpts. Note that
   internally, 32-bit integers are still used, so calling cufinufft with more
   than 2e9 points will fail. This restriction may be lifted in the future.
-* cmake build system revamped completely, more modern practices.
-  It auto selects compiler flags based on the supported ones on all operating systems.
-  Added support for Windows (llvm, msvc), Linux (llvm, gcc) and MacOS (llvm, gcc).
-* cmake support for both ducc0 and fftw
-* cmake adding nvcc and msvc optimization flags
-* cmake supports sphinx
-* updated install docs
-* cuFINUFFT binsize is now a function of the shared memory available where
-  possible.
-* cuFINUFFT GM 1D sorts using thrust::sort instead of bin-sort.
-* cuFINUFFT using the new normalized Horner coefficients and added support
-  for 1.25.
-* cuFINUFFT new compile flags for extra-vectorization, flushing single
-  precision denormals to 0 and using fma where possible.
-* cuFINUFFT using intrinsics in foldrescale and other places to increase
-  performance
-* cuFINUFFT using SM90 float2 vector atomicAdd where supported
-* cuFINUFFT making default binsize = 0
+* CMake build system revamped completely, using more modern practices (Barbone).
+  It now auto-selects compiler flags based on those supported on all OSes, and
+  has support for Windows (llvm, msvc), Linux (llvm, gcc) and MacOS (llvm, gcc).
+* CMake added nvcc and msvc optimization flags.
+* sphinx local doc build also using CMake. (Barbone)
+* updated install docs, including for DUCC0 FFT and new python build.
+* updated install docs (Barnett)
+* Major acceleration effort for the GPU library cufinufft (M Barbone, PR488):
+  - binsize is now a function of the shared memory available where possible.
+  - GM 1D sorts using thrust::sort instead of bin-sort.
+  - uses the new normalized Horner coefficients and added support for
+    upsampfac=1.25 on GPU, for first time.
+  - new compile flags for extra-vectorization, flushing single
+    precision denormals to 0 and using fma where possible.
+  -  using intrinsics (eg FMA) in foldrescale and other places to increase
+    performance
+  - using SM90 float2 vector atomicAdd where supported
+  - make default binsize = 0
 
 V 2.2.0 (12/12/23)
 
 
@@ -1,18 +1,21 @@
 # USING CPM TO HANDLE DEPENDENCIES
 if(CPM_SOURCE_CACHE)
-    set(CPM_DOWNLOAD_LOCATION "${CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
+  set(CPM_DOWNLOAD_LOCATION
+      "${CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
 elseif(DEFINED ENV{CPM_SOURCE_CACHE})
-    set(CPM_DOWNLOAD_LOCATION "$ENV{CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
+  set(CPM_DOWNLOAD_LOCATION
+      "$ENV{CPM_SOURCE_CACHE}/cpm/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
 else()
-    set(CPM_DOWNLOAD_LOCATION "${CMAKE_BINARY_DIR}/cmake/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
+  set(CPM_DOWNLOAD_LOCATION
+      "${CMAKE_BINARY_DIR}/cmake/CPM_${CPM_DOWNLOAD_VERSION}.cmake")
 endif()
 
 if(NOT (EXISTS ${CPM_DOWNLOAD_LOCATION}))
-    message(STATUS "Downloading CPM.cmake to ${CPM_DOWNLOAD_LOCATION}")
-    file(DOWNLOAD
-        https://github.com/cpm-cmake/CPM.cmake/releases/download/v${CPM_DOWNLOAD_VERSION}/CPM.cmake
-        ${CPM_DOWNLOAD_LOCATION}
-    )
+  message(STATUS "Downloading CPM.cmake to ${CPM_DOWNLOAD_LOCATION}")
+  file(
+    DOWNLOAD
+    https://github.com/cpm-cmake/CPM.cmake/releases/download/v${CPM_DOWNLOAD_VERSION}/CPM.cmake
+    ${CPM_DOWNLOAD_LOCATION})
 endif()
 
 include(${CPM_DOWNLOAD_LOCATION})
@@ -1,20 +1,28 @@
-CPMAddPackage(
-        NAME xtl
-        GIT_REPOSITORY "https://github.com/xtensor-stack/xtl.git"
-        GIT_TAG ${XTL_VERSION}
-        EXCLUDE_FROM_ALL YES
-        GIT_SHALLOW YES
-        OPTIONS "XTL_DISABLE_EXCEPTIONS YES"
-)
-
-CPMAddPackage(
-        NAME xsimd
-        GIT_REPOSITORY "https://github.com/xtensor-stack/xsimd.git"
-        GIT_TAG ${XSIMD_VERSION}
-        EXCLUDE_FROM_ALL YES
-        GIT_SHALLOW YES
-        OPTIONS
-            "XSIMD_SKIP_INSTALL YES"
-            "XSIMD_ENABLE_XTL_COMPLEX YES"
-)
+cpmaddpackage(
+  NAME
+  xtl
+  GIT_REPOSITORY
+  "https://github.com/xtensor-stack/xtl.git"
+  GIT_TAG
+  ${XTL_VERSION}
+  EXCLUDE_FROM_ALL
+  YES
+  GIT_SHALLOW
+  YES
+  OPTIONS
+  "XTL_DISABLE_EXCEPTIONS YES")
 
+cpmaddpackage(
+  NAME
+  xsimd
+  GIT_REPOSITORY
+  "https://github.com/xtensor-stack/xsimd.git"
+  GIT_TAG
+  ${XSIMD_VERSION}
+  EXCLUDE_FROM_ALL
+  YES
+  GIT_SHALLOW
+  YES
+  OPTIONS
+  "XSIMD_SKIP_INSTALL YES"
+  "XSIMD_ENABLE_XTL_COMPLEX YES")
@@ -27,11 +27,11 @@ Developer notes
 
 * The kernel function in spreadinterp is evaluated via piecewise-polynomial approximation (Horner's rule). The code for this is auto-generated in MATLAB, for all upsampling factors. There are two versions supported:
 
-  - 2018--2024 vintage: no explicit SIMD vectorization, C code is generated code for the Horner evaluation loop, by running from MATLAB `gen_all_horner_C_code.m`
+  - 2018--2024 vintage: no explicit SIMD vectorization, C code is generated code for the Horner evaluation loop, by running from MATLAB ``gen_all_horner_C_code.m``
 
-  - post-2024 vintage: explicit SIMD and many other acceleration tricks, and the generated code is a static C++ array of coefficients, and their sizes (`nc` or number of coefficients) for each width `w`. Run from MATLAB `gen_ker_horner_loop_cpp_code.m`
+  - post-2024 vintage: explicit SIMD and many other acceleration tricks, and the generated code is a static C++ array of coefficients, and their sizes (``nc`` or number of coefficients) for each width ``w``. Run from MATLAB ``gen_ker_horner_loop_cpp_code.m``
 
-  See `devel/README` for more details. The ES kernel coefficient and poly approx degree for both of the above are defined in a single location, `devel/get_degree_and_beta.m`, which must match the C++ `setup_spreader()` function.
+  See ``devel/README`` for more details. The ES kernel coefficient and poly approx degree for both of the above are defined in a single location, ``devel/get_degree_and_beta.m``, which must match the C++ ``setup_spreader()`` function.
 
 * Continuous Integration (CI). See files for this in ``.github/workflows/``. It currently tests the default ``makefile`` settings in linux, and three other ``make.inc.*`` files covering OSX and Windows (MinGW). CI does not test build the variant OMP=OFF. The dev should test these locally. Likewise, the Julia wrapper is separate and thus not tested in CI. We have added ``JenkinsFile`` for the GPU CI via python wrappers.
 
@@ -49,7 +49,9 @@ Developer notes
 
 * The cufinufft Python wheels are generated using Docker based on the manylinux2014 image. For instructions, see ``tools/cufinufft/distribution_helper.sh``. These are binary wheels that are built using CUDA 11 (or optionally CUDA 12, but these are not distributed on PyPI) and bundled with the necessary libraries.
 
-* Testing cufinufft (for FI, mostly)
+* CMake compiling on linux at Flatiron Institute (Rusty cluster): We have had a report that if you want to use LLVM, you need to ``module load llvm/16.0.3`` otherwise the default ``llvm/14.0.6`` does not find ``OpenMP_CXX``.
+
+* Testing cufinufft (for FI, mostly):
 
 .. code-block:: sh