-
Notifications
You must be signed in to change notification settings - Fork 12
Change FPO to support only p=1 in configuration space and exploit equivalence of nodal to modal operations for p=1 #782
Description
Currently the work of @jmrodman which added the full Fokker Planck operator in pure advection-diffusion form,

where Γ is

and
At some point, mysteriously, the combination of the gk-g0-app merge with @jmrodman's work made the overall Gkeyll system sufficiently large that nvcc just gives up on the compilation, see my comment on the PR #592
While none of the FPO kernels are particularly large (O(100-400 KBs)) and the lines in these files are not longer than any other kernels present (lines which are at most O(5000) characters), the current likely culprit for the FPO being a problem is that there are a lot of kernels. To understand why there are so many kernels, see the following note in @jmrodman's design review (#591):
"The diagonal elements of the diffusion tensor are computed in an effectively identical manner to the drag coefficient, with 1-cell recovery and a 3-region domain stencil. However the off-diagonal elements utilize two integrations by parts, yielding two surface terms: one that uses 6-cell recovery and one that uses 2-cell recovery. The off-diagonal elements thus require a 9-cell stencil, and the domain indexing now requires 9 regions (interior, 4 edges, 4 corners). To determine integer index of the cell's location in the domain, a 2-D version of idx_to_inloup_ker is used. Relative offsets for this larger stencil are also stored in the gkyl_fpo_vlasov_coeff_recovery struct, and the kernels themselves are responsible for utilizing the correct elements of the stencil array."
my own emphasis added. While we would like to get rid of (or at least optionally not use) 6-cell recovery based on the work of @pcagas and @ammarhakim that suggests 6-cell recovery can suffer from lack of robustness, I emphasize the number of kernels is unlikely to change because even if you only do 2-cell recovery, you still have to distinguish in the two integration by parts between edges and corners in each of the three velocity directions.
I thus propose the following solution which will both reduce the number of kernels, and also make the final kernels smaller:
The nonlinearity in configuration space is quadratic; consider the volume term in the drag term

where I have split the basis functions into their configuration space and velocity space dependence (assuming a tensor basis, which all our p=1 and hybrid p=1, p>1 velocity space are, so the full basis is a tensor product of the configuration space basis with the velocity space basis).
If we are only p=1 in configuration space, the total order of the polynomials we have to integrate for the configuration space variation of the collision operator is p=3. Fortunately, the Guass-Legendre p=1 basis involves two quadrature points and integrates cubics exactly. We can thus update the collision operator per configuration space quadrature point under the assumption of p=1 variation in configuration space.
Thus, independent of whether we are 1x, 2x, or 3x, we just need to perform the collision operator update in the 3V velocity space at each configuration space quadrature point, and then reconstruct the final phase space update after computing both volume and surface integrals.
Advantages of this approach:
- There are a single set of kernels in 3V that are called per quadrature point (instead of separate kernels for 1x, 2x, and 3x).
- The kernels will be smaller because they will not include the configuration space nonlinearity (in 2x, even p=1 leads to Nc^2 more terms from the couplings, so these kernels should be a factor of a few smaller than the current kernels by treating the components as point-wise in configuration space).
- This approach naturally couples to the full Rosenbluth solves, since we can solve for the potentials per configuration space quadrature point (and then the Poisson solves are 3D solves instead of 4D, 5D, or 6D solves).
- All of these updaters can be parallelized over configuration space quadrature points on GPUs, increasing the parallelism by a factor of 2^cdim.
To do:
- Re-create the kernels for drag and diffusion coefficient construction with recovery to assume you are operating in a 3V velocity space (so an inner loop over velocity space and an outer loop over configuration space quadrature points)
- Create new moment calculation kernels which understand how to take moments per configuration space quadrature points (so you can compute the corrections to momentum and energy per configuration space quadrature point).
- Create a new hyper_dg operator that can correctly interpret an outer loop over configuration space quadrature points and an inner loop over velocity space. Create the FPO kernels for this operation per quadrature point.
- Create the kernels to convert the per configuration space quadrature point velocity space expansion into the final RHS increment.