Skip to content

Commit 14ab501

Browse files
committed
Add tridiagonal solver design document
1 parent bb8be10 commit 14ab501

File tree

2 files changed

+188
-0
lines changed

2 files changed

+188
-0
lines changed
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
# Tridiagonal Solver
2+
3+
<!--- use table of contents if desired for longer documents -->
4+
**Table of Contents**
5+
1. [Overview](#1-overview)
6+
2. [Requirements](#2-requirements)
7+
3. [Algorithmic Formulation](#3-algorithmic-formulation)
8+
4. [Design](#4-design)
9+
5. [Verification and Testing](#5-verification-and-testing)
10+
11+
## 1 Overview
12+
13+
In geophysical models, implicit time integration of vertical terms often requires solution to tridiagonal systems of equations.
14+
Typically, a separate system needs to be solved in each vertical column, which requires an efficient batched tridiagonal solver.
15+
One common situation where a tridiagonal system arises is implicit treatment of vertical diffusion/mixing.
16+
In principle, this problem results in a symmetric diagonally-dominant tridiagonal system.
17+
However, in ocean modeling, the coefficients of vertical mixing can vary by orders of magnitude and
18+
can become very large in some layers.
19+
This requires specialized algorithms that can handle this situation stably.
20+
21+
## 2 Requirements
22+
23+
### 2.1 Requirement: Modularity
24+
25+
There should be one implementation of batched tridiagonal solver that can be called every time a solution to such a system is needed.
26+
27+
### 2.2 Requirement: Stability for vertical mixing problems
28+
29+
The solver must be stable for vertical mixing problems with large variations in mixing coefficients.
30+
31+
### 2.3 Requirement: Bottom boundary conditions
32+
33+
When applied to the implicit vertical mixing of momentum, the solver should be
34+
sufficiently general to be able to incorporate various bottom drag formulations.
35+
36+
### 2.4 Requirement: Performance
37+
38+
The solver must be performant on CPU and GPU architectures.
39+
40+
### 2.5 Desired: Ability to fuse kernels
41+
42+
Implicit vertical mixing will require some pre-processing (e.g. setup of the system) and post-processing work.
43+
It is desirable to handle all of that in one computational kernel. This requires the ability to call the
44+
solver inside a `parallelFor`.
45+
46+
## 3 Algorithmic Formulation
47+
A general tridiagonal system has the form:
48+
$$
49+
a_i x_{i - 1} + b_i x_i + c_i x_{i + 1} = y_i.
50+
$$
51+
The standard second-order discretization of a diffusion problem leads to a system of the form
52+
$$
53+
-g_{i - 1} x_{i - 1} + (g_{i - 1} + g_i + h_i) x_i - g_{i} x_{i + 1} = y_i,
54+
$$
55+
where $g_i$, $h_i$ are positive and $g_i$ is proportional to the mixing coefficient
56+
and can be much greater than $h_i$, which is the layer thickness.
57+
58+
59+
### 3.1 Thomas algorithm
60+
The Thomas algorithm is a simplified form of Gaussian elimination for tridiagonal systems of equations.
61+
In its typical implementation, the forward elimination phase proceeds as follows
62+
$$
63+
\begin{aligned}
64+
c'_i &= \frac{c_i}{b_i - a_i c'_{i - 1}}, \\
65+
y'_i &= \frac{y_i - a_i y'_{i - 1}}{b_i - a_i c'_{i - 1}}.
66+
\end{aligned}
67+
$$
68+
When applied to the diffusion system this leads to
69+
$$
70+
\begin{aligned}
71+
c'_i &= \frac{-g_i}{h_i + g_{i - 1} (1 + c'_{i - 1}) + g_{i}}, \\
72+
y'_i &= \frac{y_i + g_{i - 1} y'_{i - 1}}{h_i + g_{i - 1} (1 + c'_{i - 1}) + g_i}.
73+
\end{aligned}
74+
$$
75+
Let's consider what happens when the mixing coefficient, and hence $g_i$, abruptly changes.
76+
Suppose that $h_i$ is small, $g_{i - 1}$ is small, but $g_i$ is very large.
77+
Then $c'_i \approx -1$ and in the next iteration the term $g_i (1 + c'_i)$ in
78+
the denominator of both expression will multiply a very small number by a very large number.
79+
80+
Following [Appendix E in Schopf and Loughe](https://journals.ametsoc.org/view/journals/mwre/123/9/1520-0493_1995_123_2839_argiom_2_0_co_2.xml), to remedy that we can introduce
81+
$$
82+
\alpha'_i = g_i (1 + c'_i),
83+
$$
84+
which satisfies the following recursion relation
85+
$$
86+
\alpha'_i = \frac{g_i (h_i + \alpha'_{i - 1})}{h_i + \alpha'_{i - 1} + g_i}.
87+
$$
88+
The above equation together with
89+
$$
90+
\begin{aligned}
91+
c'_i &= \frac{-g_i}{h_i + \alpha'_{i - 1} + g_{i}}, \\
92+
y'_i &= \frac{y_i + g_{i - 1} y'_{i - 1}}{h_i + \alpha'_{i - 1} + g_i},
93+
\end{aligned}
94+
$$
95+
forms the modifed stable algorithm.
96+
97+
### 3.2 (Parallel) cyclic reduction
98+
99+
The Thomas algorithm is work-efficient, but inherently serial.
100+
While systems in different columns can
101+
be solved in parallel, this might not expose enough parallelism on modern GPUs.
102+
There are parallel tridiagonal algorithms that perform better on modern GPUs,
103+
see [Zhang, Cohen, and Owens](https://doi.org/10.1145/1837853.1693472).
104+
The two algorithms best suited for small systems are cyclic reduction and parallel cyclic reduction.
105+
106+
The basic idea of both cyclic reduction algorithms is as follows. Let's consider three consecutive equations corresponding to $y_{i- 1}$, $y_i$, and $y_{i+1}$
107+
$$
108+
\begin{aligned}
109+
a_{i - 1} x_{i - 2} + b_{i - 1} x_{i - 1} + c_{i - 1} x_i &= y_{i - 1}, \\
110+
a_i x_{i - 1} + b_i x_i + c_i x_{i + 1} &= y_i, \\
111+
a_{i + 1} x_{i} + b_{i + 1} x_{i + 1} + c_{i + 1} x_{i + 2} &= y_{i + 1}.
112+
\end{aligned}
113+
$$
114+
We can eliminate $x_{i - 1}$ and $x_{i + 1}$ to obtain a system of the form
115+
$$
116+
\hat{a}_i x_{i - 2} + \hat{b}_i x_i + \hat{c}_i x_{i + 2} = \hat{y}_i,
117+
$$
118+
where the modified coefficients $\hat{a}_i$, $\hat{b}_i$, $\hat{c}_i$ and the modified rhs $\hat{y}_i$ are
119+
$$
120+
\begin{aligned}
121+
\hat{a}_i &= -\frac{a_{i - 1} a_i}{b_{i - 1}}, \\
122+
\hat{b}_i &= b_i - \frac{c_{i - 1} a_i}{b_{i - 1}} - \frac{c_{i} a_{i + 1}}{b_{i + 1}}, \\
123+
\hat{c}_i &= -\frac{c_i c_{i + 1}}{b_{i + 1}}, \\
124+
\hat{y}_i &= y_i - \frac{a_i y_{i - 1}}{b_{i - 1}} - \frac{c_i y_{i + 1}}{b_{i + 1}}.
125+
\end{aligned}
126+
$$
127+
The resulting system of equations for $x_{2j}$ is still tridiagonal and has roughly half the size of the original.
128+
129+
The cyclic reduction algorithm has two phases.
130+
In the first phase, the above elimination step is iterated until the system is reduced to either one or two equations, which can then be directly solved.
131+
In each iteration the computation of modified coefficients can be done in parallel.
132+
The second phase involves finding the rest of the solution by using the final coefficients.
133+
The second phase is also iterative, where at each iteration the number of know solution values increases by a factor of two. A drawback of this algorithm is that the amount of parallel computations available at each iteration is not constant.
134+
135+
The parallel cyclic reduction is based on the same idea, but has only one phase. In the first iteration it reduces the
136+
original system to two systems of half the size. The second iteration reduces it to four systems of quarter the size
137+
and so on. In the final iteration systems of size one or two are solved to obtain the whole solution at once.
138+
In contrast to the cyclic reduction, this algorithm has constant amount of parallelism available at the cost of performing
139+
more redundant work.
140+
141+
A naive application of the cyclic reduction to the diffusion system would result
142+
in the following equation for the modified main diagonal
143+
$$
144+
\hat{b}_i = h_i
145+
+ g_{i - 1} - \frac{g_{i-1}^2}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
146+
+ g_{i + 1} - \frac{g_{i+1}^2}{h_{i + 1} + g_{i} + g_{i + 1}}.
147+
$$
148+
Using this expression can potentially result in catastrophic cancellation errors and overflows if
149+
$g_{i - 1}$ or $g_{i + 1}$ are very large.
150+
To improve its stability, this expression can be rewritten as
151+
$$
152+
\hat{b}_i = h_i
153+
+ h_{i - 1} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
154+
+ h_{i + 1} \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}}
155+
+ g_{i - 2} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
156+
+ g_i \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}}.
157+
$$
158+
This equation can be shown to be in the form
159+
$$
160+
\hat{b}_i = \hat{h}_i + \hat{g}_{i - 2} + \hat{g}_i,
161+
$$
162+
where
163+
$$
164+
\begin{aligned}
165+
\hat{h}_i &= h_i
166+
+ h_{i - 1} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}}
167+
+ h_{i + 1} \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}}, \\
168+
\hat{g}_i &= g_i \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}}.
169+
\end{aligned}
170+
$$
171+
These two equations form the basis of stable (parallel) cyclic reduction for diffusion problems.
172+
173+
174+
## 4 Design
175+
176+
TODO
177+
178+
## 5 Verification and Testing
179+
180+
### 5.1 Test correctness
181+
182+
The solver will be compared against exact solution for a variety of (batch size, system size) combinations.
183+
184+
### 5.2 Test stability
185+
186+
The solver stability will be tested on an idealized vertical mixing problem with abrupt changes in
187+
the diffusion coefficient.

components/omega/doc/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ design/TimeMgr
110110
design/Timers
111111
design/TimeStepping
112112
design/Tracers
113+
design/TridiagonalSolver
113114
114115
design/Template
115116
```

0 commit comments

Comments
 (0)