Skip to content

Change FPO to support only p=1 in configuration space and exploit equivalence of nodal to modal operations for p=1 #782

@JunoRavin

Description

@JunoRavin

Currently the work of @jmrodman which added the full Fokker Planck operator in pure advection-diffusion form,
Image
where Γ is
Image
and $\mathbf{a}$ and $\mathbf{D}$ are general drag and diffusion terms which can be obtain in full generality from the Rosenbluth potentials does not compile on GPUs due to the overall size of the Gkeyll system.

At some point, mysteriously, the combination of the gk-g0-app merge with @jmrodman's work made the overall Gkeyll system sufficiently large that nvcc just gives up on the compilation, see my comment on the PR #592

While none of the FPO kernels are particularly large (O(100-400 KBs)) and the lines in these files are not longer than any other kernels present (lines which are at most O(5000) characters), the current likely culprit for the FPO being a problem is that there are a lot of kernels. To understand why there are so many kernels, see the following note in @jmrodman's design review (#591):

"The diagonal elements of the diffusion tensor are computed in an effectively identical manner to the drag coefficient, with 1-cell recovery and a 3-region domain stencil. However the off-diagonal elements utilize two integrations by parts, yielding two surface terms: one that uses 6-cell recovery and one that uses 2-cell recovery. The off-diagonal elements thus require a 9-cell stencil, and the domain indexing now requires 9 regions (interior, 4 edges, 4 corners). To determine integer index of the cell's location in the domain, a 2-D version of idx_to_inloup_ker is used. Relative offsets for this larger stencil are also stored in the gkyl_fpo_vlasov_coeff_recovery struct, and the kernels themselves are responsible for utilizing the correct elements of the stencil array."

my own emphasis added. While we would like to get rid of (or at least optionally not use) 6-cell recovery based on the work of @pcagas and @ammarhakim that suggests 6-cell recovery can suffer from lack of robustness, I emphasize the number of kernels is unlikely to change because even if you only do 2-cell recovery, you still have to distinguish in the two integration by parts between edges and corners in each of the three velocity directions.

I thus propose the following solution which will both reduce the number of kernels, and also make the final kernels smaller:

The nonlinearity in configuration space is quadratic; consider the volume term in the drag term
Image
where I have split the basis functions into their configuration space and velocity space dependence (assuming a tensor basis, which all our p=1 and hybrid p=1, p>1 velocity space are, so the full basis is a tensor product of the configuration space basis with the velocity space basis).

If we are only p=1 in configuration space, the total order of the polynomials we have to integrate for the configuration space variation of the collision operator is p=3. Fortunately, the Guass-Legendre p=1 basis involves two quadrature points and integrates cubics exactly. We can thus update the collision operator per configuration space quadrature point under the assumption of p=1 variation in configuration space.

Thus, independent of whether we are 1x, 2x, or 3x, we just need to perform the collision operator update in the 3V velocity space at each configuration space quadrature point, and then reconstruct the final phase space update after computing both volume and surface integrals.

Advantages of this approach:

  1. There are a single set of kernels in 3V that are called per quadrature point (instead of separate kernels for 1x, 2x, and 3x).
  2. The kernels will be smaller because they will not include the configuration space nonlinearity (in 2x, even p=1 leads to Nc^2 more terms from the couplings, so these kernels should be a factor of a few smaller than the current kernels by treating the components as point-wise in configuration space).
  3. This approach naturally couples to the full Rosenbluth solves, since we can solve for the potentials per configuration space quadrature point (and then the Poisson solves are 3D solves instead of 4D, 5D, or 6D solves).
  4. All of these updaters can be parallelized over configuration space quadrature points on GPUs, increasing the parallelism by a factor of 2^cdim.

To do:

  1. Re-create the kernels for drag and diffusion coefficient construction with recovery to assume you are operating in a 3V velocity space (so an inner loop over velocity space and an outer loop over configuration space quadrature points)
  2. Create new moment calculation kernels which understand how to take moments per configuration space quadrature points (so you can compute the corrections to momentum and energy per configuration space quadrature point).
  3. Create a new hyper_dg operator that can correctly interpret an outer loop over configuration space quadrature points and an inner loop over velocity space. Create the FPO kernels for this operation per quadrature point.
  4. Create the kernels to convert the per configuration space quadrature point velocity space expansion into the final RHS increment.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions