Skip to content

Commit dc3286a

Browse files
Merge pull request #166 from mwarusz/mwarusz/omega-tridiagonal-design
Add tridiagonal solver design document
2 parents cc4eb05 + 30bbafc commit dc3286a

File tree

2 files changed

+288
-0
lines changed

2 files changed

+288
-0
lines changed
Lines changed: 287 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,287 @@
1+
# Tridiagonal Solver
2+
3+
<!--- use table of contents if desired for longer documents -->
4+
**Table of Contents**
5+
1. [Overview](#1-overview)
6+
2. [Requirements](#2-requirements)
7+
3. [Algorithmic Formulation](#3-algorithmic-formulation)
8+
4. [Design](#4-design)
9+
5. [Verification and Testing](#5-verification-and-testing)
10+
11+
## 1 Overview
12+
13+
In geophysical models, implicit time integration of vertical terms often requires solution to tridiagonal systems of equations.
14+
Typically, a separate system needs to be solved in each vertical column, which requires an efficient batched tridiagonal solver.
15+
One common situation where a tridiagonal system arises is implicit treatment of vertical diffusion/mixing.
16+
In principle, this problem results in a symmetric diagonally-dominant tridiagonal system.
17+
However, in ocean modeling, the coefficients of vertical mixing can vary by orders of magnitude and
18+
can become very large in some layers.
19+
This requires specialized algorithms that can handle this situation stably.
20+
21+
## 2 Requirements
22+
23+
### 2.1 Requirement: Modularity
24+
25+
There should be one module that provides tridiagonal solvers that are sufficiently general to handle every need in Omega.
26+
27+
### 2.2 Requirement: Stability for vertical mixing problems
28+
29+
A specialized solver that is stable for vertical mixing problems with large variations in mixing coefficients must be
30+
provided.
31+
32+
### 2.3 Requirement: Top and bottom boundary conditions
33+
34+
When applied to the implicit vertical mixing of momentum, the specialized diffusion solver should be
35+
sufficiently general to be able to incorporate various bottom drag and wind stress formulations.
36+
37+
### 2.4 Requirement: Performance
38+
39+
All solvers must be performant on CPU and GPU architectures. On CPUs this means supporting vectorization.
40+
41+
### 2.5 Desired: Ability to fuse kernels
42+
43+
Implicit vertical mixing will require some pre-processing (e.g. setup of the system) and post-processing work.
44+
It is desirable to handle all of that in one computational kernel. This requires the ability to call the
45+
solvers inside a `parallelFor`.
46+
47+
## 3 Algorithmic Formulation
48+
A general tridiagonal system has the form:
49+
$$
50+
a_i x_{i - 1} + b_i x_i + c_i x_{i + 1} = y_i.
51+
$$
52+
The standard second-order discretization of a diffusion problem leads to a system of the form
53+
$$
54+
-g_{i - 1} x_{i - 1} + (g_{i - 1} + g_i + h_i) x_i - g_{i} x_{i + 1} = y_i,
55+
$$
56+
where $g_i$, $h_i$ are positive and $g_i$ is proportional to the mixing coefficient
57+
and can be much greater than $h_i$, which is the layer thickness.
58+
59+
60+
### 3.1 Thomas algorithm
61+
The Thomas algorithm is a simplified form of Gaussian elimination for tridiagonal systems of equations.
62+
In its typical implementation, the forward elimination phase proceeds as follows
63+
$$
64+
\begin{aligned}
65+
c'_i &= \frac{c_i}{b_i - a_i c'_{i - 1}}, \\
66+
y'_i &= \frac{y_i - a_i y'_{i - 1}}{b_i - a_i c'_{i - 1}}.
67+
\end{aligned}
68+
$$
69+
When applied to the diffusion system this leads to
70+
$$
71+
\begin{aligned}
72+
c'_i &= \frac{-g_i}{h_i + g_{i - 1} (1 + c'_{i - 1}) + g_{i}}, \\
73+
y'_i &= \frac{y_i + g_{i - 1} y'_{i - 1}}{h_i + g_{i - 1} (1 + c'_{i - 1}) + g_i}.
74+
\end{aligned}
75+
$$
76+
Let's consider what happens when the mixing coefficient, and hence $g_i$, abruptly changes.
77+
Suppose that $h_i$ is small, $g_{i - 1}$ is small, but $g_i$ is very large.
78+
Then $c'_i \approx -1$ and in the next iteration the term $g_i (1 + c'_i)$ in
79+
the denominator of both expression will multiply a very small number by a very large number.
80+
81+
Following [Appendix E in Schopf and Loughe](https://journals.ametsoc.org/view/journals/mwre/123/9/1520-0493_1995_123_2839_argiom_2_0_co_2.xml), to remedy that we can introduce
82+
$$
83+
\alpha'_i = g_i (1 + c'_i),
84+
$$
85+
which satisfies the following recursion relation
86+
$$
87+
\alpha'_i = \frac{g_i (h_i + \alpha'_{i - 1})}{h_i + \alpha'_{i - 1} + g_i}.
88+
$$
89+
The above equation together with
90+
$$
91+
\begin{aligned}
92+
c'_i &= \frac{-g_i}{h_i + \alpha'_{i - 1} + g_{i}}, \\
93+
y'_i &= \frac{y_i + g_{i - 1} y'_{i - 1}}{h_i + \alpha'_{i - 1} + g_i},
94+
\end{aligned}
95+
$$
96+
forms the modifed stable algorithm.
97+
98+
### 3.2 (Parallel) cyclic reduction
99+
100+
The Thomas algorithm is work-efficient, but inherently serial.
101+
While systems in different columns can
102+
be solved in parallel, this might not expose enough parallelism on modern GPUs.
103+
There are parallel tridiagonal algorithms that perform better on modern GPUs,
104+
see [Zhang, Cohen, and Owens](https://doi.org/10.1145/1837853.1693472).
105+
The two algorithms best suited for small systems are cyclic reduction and parallel cyclic reduction.
106+
107+
The basic idea of both cyclic reduction algorithms is as follows. Let's consider three consecutive equations corresponding to $y_{i- 1}$, $y_i$, and $y_{i+1}$
108+
$$
109+
\begin{aligned}
110+
a_{i - 1} x_{i - 2} + b_{i - 1} x_{i - 1} + c_{i - 1} x_i &= y_{i - 1}, \\
111+
a_i x_{i - 1} + b_i x_i + c_i x_{i + 1} &= y_i, \\
112+
a_{i + 1} x_{i} + b_{i + 1} x_{i + 1} + c_{i + 1} x_{i + 2} &= y_{i + 1}.
113+
\end{aligned}
114+
$$
115+
We can eliminate $x_{i - 1}$ and $x_{i + 1}$ to obtain a system of the form
116+
$$
117+
\hat{a}_i x_{i - 2} + \hat{b}_i x_i + \hat{c}_i x_{i + 2} = \hat{y}_i,
118+
$$
119+
where the modified coefficients $\hat{a}_i$, $\hat{b}_i$, $\hat{c}_i$ and the modified rhs $\hat{y}_i$ are
120+
$$
121+
\begin{aligned}
122+
\hat{a}_i &= -\frac{a_{i - 1} a_i}{b_{i - 1}}, \\
123+
\hat{b}_i &= b_i - \frac{c_{i - 1} a_i}{b_{i - 1}} - \frac{c_{i} a_{i + 1}}{b_{i + 1}}, \\
124+
\hat{c}_i &= -\frac{c_i c_{i + 1}}{b_{i + 1}}, \\
125+
\hat{y}_i &= y_i - \frac{a_i y_{i - 1}}{b_{i - 1}} - \frac{c_i y_{i + 1}}{b_{i + 1}}.
126+
\end{aligned}
127+
$$
128+
The resulting system of equations for $x_{2j}$ is still tridiagonal and has roughly half the size of the original.
129+
130+
The cyclic reduction algorithm has two phases.
131+
In the first phase, the above elimination step is iterated until the system is reduced to either one or two equations, which can then be directly solved.
132+
In each iteration the computation of modified coefficients can be done in parallel.
133+
The second phase involves finding the rest of the solution by using the final coefficients.
134+
The second phase is also iterative, where at each iteration the number of know solution values increases by a factor of two. A drawback of this algorithm is that the amount of parallel computations available at each iteration is not constant.
135+
136+
The parallel cyclic reduction is based on the same idea, but has only one phase. In the first iteration it reduces the
137+
original system to two systems of half the size. The second iteration reduces it to four systems of quarter the size
138+
and so on. In the final iteration systems of size one or two are solved to obtain the whole solution at once.
139+
In contrast to the cyclic reduction, this algorithm has constant amount of parallelism available at the cost of performing
140+
more redundant work.
141+
142+
A naive application of the cyclic reduction to the diffusion system would result
143+
in the following equation for the modified main diagonal
144+
$$
145+
\hat{b}_i = h_i
146+
+ g_{i - 1} - \frac{g_{i-1}^2}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
147+
+ g_i - \frac{g_i^2}{h_{i + 1} + g_{i} + g_{i + 1}}.
148+
$$
149+
Using this expression can potentially result in catastrophic cancellation errors and overflows if
150+
$g_{i - 1}$ or $g_{i + 1}$ are very large.
151+
To improve its stability, this expression can be rewritten as
152+
$$
153+
\hat{b}_i = h_i
154+
+ h_{i - 1} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
155+
+ h_{i + 1} \frac{g_i}{h_{i + 1} + g_i + g_{i + 1}}
156+
+ g_{i - 1} \frac{g_{i - 2}}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
157+
+ g_{i + 1} \frac{g_i}{h_{i + 1} + g_i + g_{i + 1}}.
158+
$$
159+
This equation can be shown to be in the form
160+
$$
161+
\hat{b}_i = \hat{h}_i + \hat{g}_{i - 2} + \hat{g}_i,
162+
$$
163+
where
164+
$$
165+
\begin{aligned}
166+
\hat{h}_i &= h_i
167+
+ h_{i - 1} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
168+
+ h_{i + 1} \frac{g_i}{h_{i + 1} + g_i + g_{i + 1}}, \\
169+
\hat{g}_i &= g_{i + 1} \frac{g_i}{h_{i + 1} + g_i + g_{i + 1}}.
170+
\end{aligned}
171+
$$
172+
These two equations form the basis of stable (parallel) cyclic reduction for diffusion problems.
173+
174+
175+
## 4 Design
176+
177+
Four different algorithms will be implemented:
178+
- Thomas algorithm for general tridiagonal systems on CPUs
179+
- PCR algorithm for general tridiagonal systems on GPUs
180+
- Thomas algorithm for diffusion systems with improved stability properties on CPUs
181+
- PCR algorithm for diffusion systems with improved stability properties on GPUs
182+
183+
The algorithms will be designed to work within Kokkos team policies, with each team of threads
184+
solving one column system on GPUs and `VecLength` column systems on CPUs.
185+
The user interface for CPU and GPU solvers will be the same. There will be
186+
two type aliases `TriDiagSolver` and `TriDiagDiffSolver` that will resolve to the
187+
optimal solver class based on the chosen architecture.
188+
189+
### 4.1 Data types and parameters
190+
191+
#### 4.1.1 Parameters
192+
193+
No parameters are required.
194+
195+
#### 4.1.2 Class/structs/data types
196+
197+
There will be a solver struct for each algorithm. The solvers inputs and outputs will be encapsulated in
198+
two scratch data structs, one for the general solver and one for the diffusion solver.
199+
200+
#### 4.1.2.1 Scratch data structs
201+
202+
To facilitate constructing the systems on the fly and for performance reasons the solver inputs and outputs will be using the Kokkos scratch memory space. The scratch data for the general tridiagonal solver will be encapsulated in a struct called `TriDiagScratch`
203+
```c++
204+
struct TriDiagScratch {
205+
ScratchArray2DReal DL;
206+
ScratchArray2DReal D;
207+
ScratchArray2DReal DU;
208+
ScratchArray2DReal X;
209+
};
210+
```
211+
where `DL`, `D`, `DU`, `X` are views of size (`NRow`, `VecLength`) in the scratch memory space. The views
212+
`DL`, `D`, `DU` are inputs denoting the lower, main, and upper diagonal, respectively. The view `X` should contain
213+
the rhs at input and will be overwritten with the solution after `solve` is called.
214+
215+
The scratch data for the specialized diffusion tridiagonal solver will be encapsulated in a struct called `TriDiagDiffScratch`
216+
```c++
217+
struct TriDiagDiffScratch {
218+
ScratchArray2DReal G;
219+
ScratchArray2DReal H;
220+
ScratchArray2DReal X;
221+
ScratchArray2DReal Alpha;
222+
};
223+
```
224+
where `G`, `H`, `X`, `Alpha` are views of size (`NRow`, `VecLength`) in the scratch memory space. The views
225+
`G` and `H` are inputs corresponding to the variables `g` and `h` introduced in [Section 3](#3-algorithmic-formulation).
226+
The view `X` has the same meaning as in the general case.
227+
The view `Alpha` is an internal workspace used by the algorithm.
228+
229+
#### 4.1.2.2 Solver structs
230+
231+
The four solver algorithms will be implemented as four structs `ThomasSolver`, `PCRSolver`, `ThomasDiffusionSolver`, and `PCRDiffusionSolver`. Currently, there is no plan for those structs to have any data members and they will only provide static methods, acting essentially as namespaces.
232+
233+
### 4.2 Methods
234+
235+
#### 4.2.1 Scratch Constructors
236+
The constructors of scratch spaces take a team member and system size (`NRow`)
237+
```c++
238+
TriDiagScratch(const TeamMember &Member, int NRow);
239+
TriDiagDiffScratch(const TeamMember &Member, int NRow);
240+
```
241+
242+
#### 4.2.2 Policy creation
243+
Every solver will provide a static method
244+
```c++
245+
static TeamPolicy makeTeamPolicy(int NBatch, int NRow);
246+
```
247+
that creates an appropriate team policy for solving `NBatch` systems of size `NRow`.
248+
249+
#### 4.2.3 Solve Methods
250+
The general solvers `ThomasSolver` and `PCRSolver` will have a static solve method
251+
```c++
252+
static void solve(const TeamMember &Member, const TriDiagScratch &Scratch);
253+
```
254+
that takes a team member and an initialized general scratch space. After calling this method `Scratch.X`
255+
will contain the solution. There will also be a convenience method
256+
```c++
257+
static void solve(const TeamMember &Member,
258+
const Array2DReal &DL, const Array2DReal &D, const Array2DReal &DU, const Array2DReal &X);
259+
```
260+
that loads the inputs from global arrays.
261+
262+
The diffusion solvers `ThomasDiffusionSolver` and `PCRDiffusionSolver` will provide a similar method
263+
```c++
264+
static void solve(const TeamMember &Member, const TriDiagDiffScratch &Scratch);
265+
```
266+
differing only in the type of scratch space. Similarly, there will be a convenience method
267+
```c++
268+
static void solve(const TeamMember &Member,
269+
const Array2DReal &G, const Array2DReal &H, const Array2DReal &X);
270+
```
271+
which loads the inputs from global arrays.
272+
273+
## 5 Verification and Testing
274+
275+
### 5.1 Test solvers correctness using prescribed matrix
276+
277+
Given analytically prescribed matrix `A` and vector `y` the solution to the problem `A x = z` with `z = A y` will be
278+
checked to see if the resulting `x` is equal to the prescribed vector `y`. This will be done for all solvers for a variety of (batch size, system size) combinations.
279+
280+
### 5.2 Test diffusion solvers convergence using manufactured solution
281+
282+
The convergence of diffusion solvers will be tested using a manufactured solution.
283+
284+
### 5.3 Test stability
285+
286+
The diffusion solvers stability will be tested on an idealized vertical mixing problem with abrupt changes in
287+
the diffusion coefficient.

components/omega/doc/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ design/TimeMgr
110110
design/Timers
111111
design/TimeStepping
112112
design/Tracers
113+
design/TridiagonalSolver
113114
114115
design/Template
115116
```

0 commit comments

Comments
 (0)