|
1 | 1 | List of features / changes made / release notes, in reverse chronological order. |
2 | 2 | If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately). |
3 | 3 |
|
4 | | -V 2.3.0beta (7/24/24) |
| 4 | +V 2.3.0-rc1 (8/6/24) |
5 | 5 |
|
6 | | -* python build modernized to pyproject.toml (both CPU and GPU). |
7 | | - PRs 507 (Anden, Lu, Barbone) |
8 | | -* switchable FFT: either FFTW or DUCC0 (latter need no plan stage; also it is |
| 6 | +* Switched C++ standards from C++14 to C++17, allowing various templating |
| 7 | + improvements (Barbone). |
| 8 | +* Python build modernized to pyproject.toml (for both CPU and GPU). |
| 9 | + PR 507 (Anden, Lu, Barbone). Compiles from source for the local build. |
| 10 | +* Switchable FFT: either FFTW or DUCC0 (latter needs no plan stage; also it is |
9 | 11 | used to exploit sparsity pattern to achieve FFT speedups 1-3x in 2D and 3D). |
10 | | - PR463, Martin Reinecke. |
| 12 | + PR463, Martin Reinecke. Both CMake and makefile includes this DUCC0 option |
| 13 | + (makefile PR511 by Barnett; CMake by Barbone). |
11 | 14 | * ES kernel rescaled to max value 1, reduced poly degrees for upsampfac=1.25, |
12 | 15 | cleaner Horner coefficient generation PR499 (fixes fp32 overflow issue #454). |
13 | 16 | * Major manual acceleration of spread/interp kernels via XSIMD header-only lib, |
14 | 17 | kernel evaluation, templating by ns with AVX-width-dependent decisions. |
15 | 18 | Up to 80% faster, dep on compiler. (Marco Barbone with help from Libin Lu). |
16 | | - PRs 459, 471, 502. |
17 | | - NOTE: introduces new dependency (XSIMD), added to cMake and makefile. |
| 19 | + A large chunk of work: PRs 459, 471, 502. |
| 20 | + NOTE: introduces new dependency (XSIMD), added to CMake and makefile. |
18 | 21 | * Exploiting even/odd symmetry for 10% faster xsimd-accel kernel poly eval |
19 | | - Libin Lu based on idea of Martin Reinecke (PR477,492,493). |
| 22 | + (Libin Lu based on idea of Martin Reinecke; PR477,492,493). |
20 | 23 | * new test/finufft3dkernel_test checks kerevalmeth=0 and 1 agree to tolerance |
21 | 24 | PR 473 (M Barbone). |
22 | 25 | * new perftest/compare_spreads.jl compares two spreadinterp libs (A Barnett). |
@@ -47,24 +50,24 @@ V 2.3.0beta (7/24/24) |
47 | 50 | any 32-bit integers to 64-bit when calling cufinufft(f)_setpts. Note that |
48 | 51 | internally, 32-bit integers are still used, so calling cufinufft with more |
49 | 52 | than 2e9 points will fail. This restriction may be lifted in the future. |
50 | | -* cmake build system revamped completely, more modern practices. |
51 | | - It auto selects compiler flags based on the supported ones on all operating systems. |
52 | | - Added support for Windows (llvm, msvc), Linux (llvm, gcc) and MacOS (llvm, gcc). |
53 | | -* cmake support for both ducc0 and fftw |
54 | | -* cmake adding nvcc and msvc optimization flags |
55 | | -* cmake supports sphinx |
56 | | -* updated install docs |
57 | | -* cuFINUFFT binsize is now a function of the shared memory available where |
58 | | - possible. |
59 | | -* cuFINUFFT GM 1D sorts using thrust::sort instead of bin-sort. |
60 | | -* cuFINUFFT using the new normalized Horner coefficients and added support |
61 | | - for 1.25. |
62 | | -* cuFINUFFT new compile flags for extra-vectorization, flushing single |
63 | | - precision denormals to 0 and using fma where possible. |
64 | | -* cuFINUFFT using intrinsics in foldrescale and other places to increase |
65 | | - performance |
66 | | -* cuFINUFFT using SM90 float2 vector atomicAdd where supported |
67 | | -* cuFINUFFT making default binsize = 0 |
| 53 | +* CMake build system revamped completely, using more modern practices (Barbone). |
| 54 | + It now auto-selects compiler flags based on those supported on all OSes, and |
| 55 | + has support for Windows (llvm, msvc), Linux (llvm, gcc) and MacOS (llvm, gcc). |
| 56 | +* CMake added nvcc and msvc optimization flags. |
| 57 | +* sphinx local doc build also using CMake. (Barbone) |
| 58 | +* updated install docs, including for DUCC0 FFT and new python build. |
| 59 | +* updated install docs (Barnett) |
| 60 | +* Major acceleration effort for the GPU library cufinufft (M Barbone, PR488): |
| 61 | + - binsize is now a function of the shared memory available where possible. |
| 62 | + - GM 1D sorts using thrust::sort instead of bin-sort. |
| 63 | + - uses the new normalized Horner coefficients and added support for |
| 64 | + upsampfac=1.25 on GPU, for first time. |
| 65 | + - new compile flags for extra-vectorization, flushing single |
| 66 | + precision denormals to 0 and using fma where possible. |
| 67 | + - using intrinsics (eg FMA) in foldrescale and other places to increase |
| 68 | + performance |
| 69 | + - using SM90 float2 vector atomicAdd where supported |
| 70 | + - make default binsize = 0 |
68 | 71 |
|
69 | 72 | V 2.2.0 (12/12/23) |
70 | 73 |
|
|
0 commit comments