|
1 | 1 | List of features / changes made / release notes, in reverse chronological order. |
2 | 2 | If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately). |
3 | 3 |
|
4 | | -V 2.3.0-rc1 (8/2/24) |
| 4 | +V 2.3.0-rc1 (8/6/24) |
5 | 5 |
|
6 | 6 | * Switched C++ standards from C++14 to C++17, allowing various templating |
7 | 7 | improvements (Barbone). |
8 | | -* python build modernized to pyproject.toml (for both CPU and GPU). |
9 | | - PR 507 (Anden, Lu, Barbone) |
10 | | -* switchable FFT: either FFTW or DUCC0 (latter needs no plan stage; also it is |
| 8 | +* Python build modernized to pyproject.toml (for both CPU and GPU). |
| 9 | + PR 507 (Anden, Lu, Barbone). Compiles from source for the local build. |
| 10 | +* Switchable FFT: either FFTW or DUCC0 (latter needs no plan stage; also it is |
11 | 11 | used to exploit sparsity pattern to achieve FFT speedups 1-3x in 2D and 3D). |
12 | 12 | PR463, Martin Reinecke. Both CMake and makefile includes this DUCC0 option |
13 | 13 | (makefile PR511 by Barnett; CMake by Barbone). |
@@ -54,8 +54,20 @@ V 2.3.0-rc1 (8/2/24) |
54 | 54 | It now auto-selects compiler flags based on those supported on all OSes, and |
55 | 55 | has support for Windows (llvm, msvc), Linux (llvm, gcc) and MacOS (llvm, gcc). |
56 | 56 | * CMake added nvcc and msvc optimization flags. |
57 | | -* sphinx local doc build also using CMake. |
58 | | -* updated install docs, including for DUCC0 FFT. |
| 57 | +* sphinx local doc build also using CMake. (Barbone) |
| 58 | +* updated install docs, including for DUCC0 FFT and new python build. |
| 59 | +* updated install docs (Barnett) |
| 60 | +* Major acceleration effort for the GPU library cufinufft (M Barbone, PR488): |
| 61 | + - binsize is now a function of the shared memory available where possible. |
| 62 | + - GM 1D sorts using thrust::sort instead of bin-sort. |
| 63 | + - uses the new normalized Horner coefficients and added support for |
| 64 | + upsampfac=1.25 on GPU, for first time. |
| 65 | + - new compile flags for extra-vectorization, flushing single |
| 66 | + precision denormals to 0 and using fma where possible. |
| 67 | + - using intrinsics (eg FMA) in foldrescale and other places to increase |
| 68 | + performance |
| 69 | + - using SM90 float2 vector atomicAdd where supported |
| 70 | + - make default binsize = 0 |
59 | 71 |
|
60 | 72 | V 2.2.0 (12/12/23) |
61 | 73 |
|
|
0 commit comments