Skip to content

Commit ca9c243

Browse files
authored
Merge branch 'flatironinstitute:master' into interp-vectorization
2 parents 53851ca + 98637bc commit ca9c243

File tree

10 files changed

+95
-40
lines changed

10 files changed

+95
-40
lines changed

.github/workflows/python_build_win.ps1

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Add-Content -Path make.inc -Value "CXXFLAGS+= -march=x86-64"
1515
Set-Variable libvcruntime140_a -Value ([IO.Path]::Combine((Split-Path -Path $PYTHON), "libs", 'libvcruntime140.a'))
1616
Copy-Item -Path .\.github\workflows\libvcruntime140.a -Destination $libvcruntime140_a
1717

18-
python -m pip install --upgrade setuptools wheel numpy pip
18+
python -m pip install --upgrade setuptools==70.1.1 wheel numpy pip
1919
if (-not $?) {throw "Failed pip install"}
2020

2121
# mingw gcc compiler pacth to work with python

docs/c_gpu.rst

Lines changed: 58 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ There are four steps needed to call cuFINUFFT from C: 1) making a plan, 2) setti
55
The simplest case is to call them in order, i.e., 1234.
66
However, it is possible to repeat 3 with new strength or coefficient data, or to repeat 2 to choose new nonuniform points in order to do one or more step 3's again, before destroying.
77
For instance, 123334 and 1232334 are allowed.
8-
If non-standard algorithm options are desired, an extra function is needed before making the plan (see bottom of this page).
8+
If non-standard algorithm options are desired, an extra function is needed before making the plan; see bottom of this page for options.
99

1010
This API matches very closely that of the plan interface to FINUFFT (in turn modeled on those of FFTW and NFFT).
1111
Here is the full documentation for these functions.
@@ -289,8 +289,8 @@ This deallocates all arrays inside the ``plan`` struct, freeing all internal mem
289289
Note: the plan (being just a pointer to the plan struct) is not actually "destroyed"; rather, its internal struct is destroyed.
290290
There is no need for further deallocation of the plan.
291291

292-
Non-standard options
293-
~~~~~~~~~~~~~~~~~~~~
292+
Options for GPU code
293+
--------------------
294294

295295
The last argument in the above plan stage accepts a pointer to an options structure, which is the same in both single and double precision.
296296
To create such a structure, use:
@@ -300,7 +300,59 @@ To create such a structure, use:
300300
cufinufft_opts opts;
301301
cufinufft_default_opts(&opts);
302302
303-
Then you may change fields of ``opts`` by hand, finally pass ``&opts`` in as the last argument to ``cufinufft_makeplan`` or ``cufinufftf_makeplan``.
304-
The options fields are currently only documented in the ``include/cufinufft_opts.h``.
303+
Then you may change fields of ``opts`` by hand, finally pass ``&opts`` in as the last argument to ``cufinufft_makeplan`` or ``cufinufftf_makeplan``. Here are the options, with the important user-controllable ones documented. For their default values, see below.
305304

306-
For examples of this advanced usage, see ``test/cuda/cufinufft*.cu``
305+
Data handling options
306+
~~~~~~~~~~~~~~~~~~~~~
307+
308+
**modeord**: Fourier coefficient frequency index ordering; see the CPU option of the same name :ref:`modeord<modeord>`.
309+
As a reminder, ``modeord=0`` selects increasing frequencies (negative through positive) in each dimension,
310+
while ``modeord=1`` selects FFT-style ordering starting at zero and wrapping over to negative frequencies half way through.
311+
312+
**gpu_device_id**: Sets the GPU device ID. Leave at default unless you know what you're doing. [To be documented]
313+
314+
Diagnostic options
315+
~~~~~~~~~~~~~~~~~~
316+
317+
**gpu_spreadinterponly**: if ``0`` do the NUFFT as intended. If ``1``, omit the FFT and kernel FT deconvolution steps and return garbage answers.
318+
Nonzero value is *only* to be used to aid timing tests (although currently there are no timing codes that exploit this option), and will give wrong or undefined answers for the NUFFT transforms!
319+
320+
321+
Algorithm performance options
322+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
323+
324+
**gpu_method**: Spreader/interpolator algorithm.
325+
326+
* ``gpu_method=0`` : makes an automatic choice of one of the below methods, based on our heuristics.
327+
328+
* ``gpu_method=1`` : uses a nonuniform points-driven method, either unsorted which is referred to as GM in our paper, or sorted which is called GM-sort in our paper, depending on option ``gpu_sort`` below
329+
330+
* ``gpu_method=2`` : for spreading only, ie, type 1 transforms, uses a shared memory output-block driven method, referred to as SM in our paper. Has no effect for interpolation (type 2 transforms)
331+
332+
* ``gpu_method>2`` : (various upsupported experimental methods due to Melody Shih, not for regular users. Eg ``3`` tests an idea of Paul Springer's to group NU points when spreading, ``4`` is a block gather method of possible interest.)
333+
334+
**gpu_sort**: ``0`` do not sort nonuniform points, ``1`` do sort nonuniform points. Only has an effect when ``gpu_method=1`` (or if this method has been internally chosen when ``gpu_method=0``). Unlike the CPU code, there is no auto-choice since in our experience sorting is fast and always helps. It is possible for structured NU point inputs that ``gpu_sort=0`` may be the faster.
335+
336+
**gpu_kerevalmeth**: ``0`` use direct (reference) kernel evaluation, which is not recommended for speed (however, it allows nonstandard ``opts.upsampfac`` to be used). ``1`` use Horner piecewise polynomial evaluation (recommended, and enforces ``upsampfac=2.0``)
337+
338+
**upsampfac**: set upsampling factor. For the recommended ``kerevalmeth=1`` you must choose the standard ``upsampfac=2.0``. If you are willing to risk a slower kernel evaluation, you may set any ``upsampfac>1.0``, but this is experimental and unsupported.
339+
340+
**gpu_maxsubprobsize**: maximum number of NU points to be handled in a single subproblem in the spreading SM method (``gpu_method=2`` only)
341+
342+
**gpu_{o}binsize{x,y,z}**: various bisizes for sorting (GM-sort) or SM subproblem methods. Values of ``-1`` trigger the heuristically set default values. Leave at default unless you know what you're doing. [To be documented]
343+
344+
**gpu_maxbatchsize**: ``0`` use heuristically defined batch size for vectorized (many-transforms with same NU points) interface, else set this batch size.
345+
346+
**gpu_stream**: CUDA stream to use. Leave at default unless you know what you're doing. [To be documented]
347+
348+
349+
For all GPU option default values we refer to the source code in
350+
``src/cuda/cufinufft.cu:cufinufft_default_opts``):
351+
352+
.. literalinclude:: ../src/cuda/cufinufft.cu
353+
:start-after: @gpu_defopts_start
354+
:end-before: @gpu_defopts_end
355+
356+
For examples of advanced options-switching usage, see ``test/cuda/cufinufft*.cu`` and ``perftest/cuda/cuperftest.cu``.
357+
358+
You may notice a lack of debugging/timing options in the GPU code. This is to avoid CUDA writing to stdout. Please help us out by adding some of these.

docs/julia.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
Julia interfaces
2-
================
1+
Julia interfaces (CPU and GPU)
2+
==============================
33

4-
Ludvig af Klinteberg, Libin Lu, and others, have built `FINUFFT.jl <https://github.com/ludvigak/FINUFFT.jl>`_, an interface from the `Julia <https://julialang.org/>`_ language. This package supports 32-bit and 64-bit precision, and automatically downloads and runs pre-built binaries of the FINUFFT library for Linux, macOS, Windows and FreeBSD (for a full list see `finufft_jll <https://github.com/JuliaBinaryWrappers/finufft_jll.jl>`_).
4+
Principal author Ludvig af Klinteberg and others have built and maintain `FINUFFT.jl <https://github.com/ludvigak/FINUFFT.jl>`_, an interface from the `Julia <https://julialang.org/>`_ language. This official Julia package supports 32-bit and 64-bit precision, now on both CPU and GPU (via `CUDA.jl`), via a common interface.
5+
The Julia package installation automatically downloads pre-built CPU binaries of the FINUFFT library for Linux, macOS, Windows and FreeBSD (for a full list see `finufft_jll <https://github.com/JuliaBinaryWrappers/finufft_jll.jl>`_), and the GPU binary for Linux (see `cufinufft_jll <https://github.com/JuliaBinaryWrappers/cufinufft_jll.jl>`_).
56

6-
`FINUFFT.jl` has now (in 2022) itself been wrapped as part of `NFFT.jl <https://juliamath.github.io/NFFT.jl/dev/performance/>`_, which contains an "abstract" interface
7+
`FINUFFT.jl` has itself been wrapped as part of `NFFT.jl <https://juliamath.github.io/NFFT.jl/dev/performance/>`_, which contains an "abstract" interface
78
to any NUFFT in Julia, with FINUFFT as an example.
89
Their
910
`performance comparison page <https://juliamath.github.io/NFFT.jl/dev/performance/>`_

docs/opts.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
.. _opts:
22

3-
Options parameters
4-
==================
3+
Options parameters (CPU)
4+
========================
55

66
Aside from the mandatory inputs (dimension, type,
77
nonuniform points, strengths or coefficients, and, in C++/C/Fortran/MATLAB,
@@ -140,7 +140,7 @@ automatically from call to call in the same executable (incidentally, also in th
140140
* ``spread_sort=2`` : uses a heuristic to decide whether to sort or not.
141141

142142
The heuristic bakes in empirical findings such as: generally it is not worth sorting in 1D type 2 transforms, or when the number of nonuniform points is small.
143-
Do not change this from its default unless you obsever.
143+
Feel free to try experimenting here; if you have highly-structured nonuniform point ordering (such as coming from polar-grid or propeller-type MRI k-points) it may be advantageous not to sort.
144144

145145
**spread_kerevalmeth**: Kernel evaluation method in spreader/interpolator.
146146
This should not be changed from its default value, unless you are an

makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -471,7 +471,7 @@ ifneq ($(MINGW),ON)
471471
rm -f $(STATICLIB) $(DYNLIB)
472472
rm -f matlab/*.mex*
473473
rm -f $(TESTS) test/results/*.out perftest/results/*.out
474-
rm -f $(EXAMPLES) $(FE) $(ST) $(STF) $(GTT) $(GTTF)
474+
rm -f $(EXAMPLES) $(FE) $(ST) $(STF) $(STA) $(STAF) $(GTT) $(GTTF)
475475
rm -f perftest/manysmallprobs
476476
rm -f examples/core test/core perftest/core $(FE_DIR)/core
477477
else
@@ -480,7 +480,7 @@ else
480480
del matlab\*.mex*
481481
for %%f in ($(subst /,\, $(TESTS))) do ((if exist %%f del %%f) & (if exist %%f.exe del %%f.exe))
482482
del test\results\*.out perftest\results\*.out
483-
for %%f in ($(subst /,\, $(EXAMPLES)), $(subst /,\,$(FE)), $(subst /,\,$(ST)), $(subst /,\,$(STF)), $(subst /,\,$(GTT)), $(subst /,\,$(GTTF))) do ((if exist %%f del %%f) & (if exist %%f.exe del %%f.exe))
483+
for %%f in ($(subst /,\, $(EXAMPLES)), $(subst /,\,$(FE)), $(subst /,\,$(ST)), $(subst /,\,$(STF)), $(subst /,\,$(STA)), $(subst /,\,$(STAF)), $(subst /,\,$(GTT)), $(subst /,\,$(GTTF))) do ((if exist %%f del %%f) & (if exist %%f.exe del %%f.exe))
484484
del perftest\manysmallprobs
485485
del examples\core, test\core, perftest\core, $(subst /,\, $(FE_DIR))\core
486486
endif

perftest/compare_spreads.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ using CairoMakie
44
using JLD2 # for load/save arrays to file
55
using UnPack
66

7-
fnam = "results/master-vs-svec2_gcc114_5700U_nthr1" # outfile head
7+
fnam = "/home/alex/numerics/finufft/perftest/results/7-1-24-vs-ivec_gcc114_5700U_nthr1" # outfile head
88
# locations of pair of FINUFFT repos to compare...
99
repo1 = "/home/alex/numerics/finufft"
1010
repo2 = "/home/alex/numerics/nufft/finufft-svec2"

src/cuda/cufinufft.cu

Lines changed: 18 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -92,39 +92,36 @@ void cufinufft_default_opts(cufinufft_opts *opts)
9292
Options with prefix "gpu_" are used for gpu code.
9393
9494
Notes:
95-
Values set in this function for different type and dimensions are preferable
96-
based on experiments. User can experiement with different settings by
97-
replacing them after calling this function.
95+
1) Values set in this function for different type and dimensions are preferable
96+
based on experiments. User can experiment with different settings by
97+
changing them after calling this function.
98+
2) Sphinx sucks the below code block into the web docs, hence keep it clean.
9899
99-
Melody Shih 07/25/19; Barnett 2/5/21.
100+
Melody Shih 07/25/19; Barnett 2/5/21, tidied for sphinx 7/2/24.
100101
*/
101102
{
102-
opts->upsampfac = 2.0;
103+
// sphinx tag (don't remove): @gpu_defopts_start
104+
// data handling opts...
105+
opts->modeord = 0;
106+
opts->gpu_device_id = 0;
103107

104-
/* following options are for gpu */
105-
opts->gpu_sort = 1; // access nupts in an ordered way for nupts driven method
108+
// diagnostic opts...
109+
opts->gpu_spreadinterponly = 0;
106110

111+
// algorithm performance opts...
112+
opts->gpu_method = 0;
113+
opts->gpu_sort = 1;
114+
opts->gpu_kerevalmeth = 1;
115+
opts->upsampfac = 2.0;
107116
opts->gpu_maxsubprobsize = 1024;
108117
opts->gpu_obinsizex = -1;
109118
opts->gpu_obinsizey = -1;
110119
opts->gpu_obinsizez = -1;
111-
112120
opts->gpu_binsizex = -1;
113121
opts->gpu_binsizey = -1;
114122
opts->gpu_binsizez = -1;
115-
116-
opts->gpu_spreadinterponly = 0; // default to do the whole nufft
117-
118-
opts->gpu_maxbatchsize = 0; // Heuristically set
123+
opts->gpu_maxbatchsize = 0;
119124
opts->gpu_stream = cudaStreamDefault;
120-
121-
opts->gpu_kerevalmeth = 1; // Horner
122-
123-
opts->gpu_method = 0; // Auto method (2 for type 1, 2 for type 2).
124-
125-
// By default, only use device 0
126-
opts->gpu_device_id = 0;
127-
128-
opts->modeord = 0;
125+
// sphinx tag (don't remove): @gpu_defopts_end
129126
}
130127
}

src/finufft.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -550,7 +550,7 @@ void FINUFFT_DEFAULT_OPTS(finufft_opts *o)
550550
o->showwarn = 1;
551551

552552
o->nthreads = 0;
553-
o->fftw = FFTW_ESTIMATE; //
553+
o->fftw = FFTW_ESTIMATE;
554554
o->spread_sort = 2;
555555
o->spread_kerevalmeth = 1;
556556
o->spread_kerpad = 1;

src/spreadinterp.cpp

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -752,6 +752,12 @@ void evaluate_kernel_vector(FLT *ker, FLT *args, const finufft_spread_opts &opts
752752
if (!(opts.flags & TF_OMIT_EVALUATE_EXPONENTIAL))
753753
for (int i = 0; i < Npad; i++) // Loop 2: Compute exponentials
754754
ker[i] = exp(ker[i]);
755+
if (opts.kerpad) {
756+
// padded part should be zero, in spread_subproblem_nd_kernels, there are
757+
// out of bound writes to trg arrays
758+
for (int i = N; i < Npad; ++i)
759+
ker[i] = 0.0;
760+
}
755761
} else {
756762
for (int i = 0; i < N; i++) // dummy for timing only
757763
ker[i] = 1.0;

tools/finufft/build-wheels-linux.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ versions=("cp36-cp36m"
2525
"cp310-cp310"
2626
"cp311-cp311"
2727
"cp312-cp312"
28-
"pp38-pypy38_pp73"
2928
"pp39-pypy39_pp73")
3029

3130
pys=()

0 commit comments

Comments
 (0)