|
| 1 | +# Tridiagonal Solver |
| 2 | + |
| 3 | +<!--- use table of contents if desired for longer documents --> |
| 4 | +**Table of Contents** |
| 5 | +1. [Overview](#1-overview) |
| 6 | +2. [Requirements](#2-requirements) |
| 7 | +3. [Algorithmic Formulation](#3-algorithmic-formulation) |
| 8 | +4. [Design](#4-design) |
| 9 | +5. [Verification and Testing](#5-verification-and-testing) |
| 10 | + |
| 11 | +## 1 Overview |
| 12 | + |
| 13 | +In geophysical models, implicit time integration of vertical terms often requires solution to tridiagonal systems of equations. |
| 14 | +Typically, a separate system needs to be solved in each vertical column, which requires an efficient batched tridiagonal solver. |
| 15 | +One common situation where a tridiagonal system arises is implicit treatment of vertical diffusion/mixing. |
| 16 | +In principle, this problem results in a symmetric diagonally-dominant tridiagonal system. |
| 17 | +However, in ocean modeling, the coefficients of vertical mixing can vary by orders of magnitude and |
| 18 | +can become very large in some layers. |
| 19 | +This requires specialized algorithms that can handle this situation stably. |
| 20 | + |
| 21 | +## 2 Requirements |
| 22 | + |
| 23 | +### 2.1 Requirement: Modularity |
| 24 | + |
| 25 | +There should be one implementation of batched tridiagonal solver that can be called every time a solution to such a system is needed. |
| 26 | + |
| 27 | +### 2.2 Requirement: Stability for vertical mixing problems |
| 28 | + |
| 29 | +The solver must be stable for vertical mixing problems with large variations in mixing coefficients. |
| 30 | + |
| 31 | +### 2.3 Requirement: Bottom boundary conditions |
| 32 | + |
| 33 | +When applied to the implicit vertical mixing of momentum, the solver should be |
| 34 | +sufficiently general to be able to incorporate various bottom drag formulations. |
| 35 | + |
| 36 | +### 2.4 Requirement: Performance |
| 37 | + |
| 38 | +The solver must be performant on CPU and GPU architectures. |
| 39 | + |
| 40 | +### 2.5 Desired: Ability to fuse kernels |
| 41 | + |
| 42 | +Implicit vertical mixing will require some pre-processing (e.g. setup of the system) and post-processing work. |
| 43 | +It is desirable to handle all of that in one computational kernel. This requires the ability to call the |
| 44 | +solver inside a `parallelFor`. |
| 45 | + |
| 46 | +## 3 Algorithmic Formulation |
| 47 | +A general tridiagonal system has the form: |
| 48 | +$$ |
| 49 | +a_i x_{i - 1} + b_i x_i + c_i x_{i + 1} = y_i. |
| 50 | +$$ |
| 51 | +The standard second-order discretization of a diffusion problem leads to a system of the form |
| 52 | +$$ |
| 53 | +-g_{i - 1} x_{i - 1} + (g_{i - 1} + g_i + h_i) x_i - g_{i} x_{i + 1} = y_i, |
| 54 | +$$ |
| 55 | +where $g_i$, $h_i$ are positive and $g_i$ is proportional to the mixing coefficient |
| 56 | +and can be much greater than $h_i$, which is the layer thickness. |
| 57 | + |
| 58 | + |
| 59 | +### 3.1 Thomas algorithm |
| 60 | +The Thomas algorithm is a simplified form of Gaussian elimination for tridiagonal systems of equations. |
| 61 | +In its typical implementation, the forward elimination phase proceeds as follows |
| 62 | +$$ |
| 63 | +\begin{aligned} |
| 64 | +c'_i &= \frac{c_i}{b_i - a_i c'_{i - 1}}, \\ |
| 65 | +y'_i &= \frac{y_i - a_i y'_{i - 1}}{b_i - a_i c'_{i - 1}}. |
| 66 | +\end{aligned} |
| 67 | +$$ |
| 68 | +When applied to the diffusion system this leads to |
| 69 | +$$ |
| 70 | +\begin{aligned} |
| 71 | +c'_i &= \frac{-g_i}{h_i + g_{i - 1} (1 + c'_{i - 1}) + g_{i}}, \\ |
| 72 | +y'_i &= \frac{y_i + g_{i - 1} y'_{i - 1}}{h_i + g_{i - 1} (1 + c'_{i - 1}) + g_i}. |
| 73 | +\end{aligned} |
| 74 | +$$ |
| 75 | +Let's consider what happens when the mixing coefficient, and hence $g_i$, abruptly changes. |
| 76 | +Suppose that $h_i$ is small, $g_{i - 1}$ is small, but $g_i$ is very large. |
| 77 | +Then $c'_i \approx -1$ and in the next iteration the term $g_i (1 + c'_i)$ in |
| 78 | +the denominator of both expression will multiply a very small number by a very large number. |
| 79 | + |
| 80 | +Following [Appendix E in Schopf and Loughe](https://journals.ametsoc.org/view/journals/mwre/123/9/1520-0493_1995_123_2839_argiom_2_0_co_2.xml), to remedy that we can introduce |
| 81 | +$$ |
| 82 | +\alpha'_i = g_i (1 + c'_i), |
| 83 | +$$ |
| 84 | +which satisfies the following recursion relation |
| 85 | +$$ |
| 86 | +\alpha'_i = \frac{g_i (h_i + \alpha'_{i - 1})}{h_i + \alpha'_{i - 1} + g_i}. |
| 87 | +$$ |
| 88 | +The above equation together with |
| 89 | +$$ |
| 90 | +\begin{aligned} |
| 91 | +c'_i &= \frac{-g_i}{h_i + \alpha'_{i - 1} + g_{i}}, \\ |
| 92 | +y'_i &= \frac{y_i + g_{i - 1} y'_{i - 1}}{h_i + \alpha'_{i - 1} + g_i}, |
| 93 | +\end{aligned} |
| 94 | +$$ |
| 95 | +forms the modifed stable algorithm. |
| 96 | + |
| 97 | +### 3.2 (Parallel) cyclic reduction |
| 98 | + |
| 99 | +The Thomas algorithm is work-efficient, but inherently serial. |
| 100 | +While systems in different columns can |
| 101 | +be solved in parallel, this might not expose enough parallelism on modern GPUs. |
| 102 | +There are parallel tridiagonal algorithms that perform better on modern GPUs, |
| 103 | +see [Zhang, Cohen, and Owens](https://doi.org/10.1145/1837853.1693472). |
| 104 | +The two algorithms best suited for small systems are cyclic reduction and parallel cyclic reduction. |
| 105 | + |
| 106 | +The basic idea of both cyclic reduction algorithms is as follows. Let's consider three consecutive equations corresponding to $y_{i- 1}$, $y_i$, and $y_{i+1}$ |
| 107 | +$$ |
| 108 | +\begin{aligned} |
| 109 | +a_{i - 1} x_{i - 2} + b_{i - 1} x_{i - 1} + c_{i - 1} x_i &= y_{i - 1}, \\ |
| 110 | +a_i x_{i - 1} + b_i x_i + c_i x_{i + 1} &= y_i, \\ |
| 111 | +a_{i + 1} x_{i} + b_{i + 1} x_{i + 1} + c_{i + 1} x_{i + 2} &= y_{i + 1}. |
| 112 | +\end{aligned} |
| 113 | +$$ |
| 114 | +We can eliminate $x_{i - 1}$ and $x_{i + 1}$ to obtain a system of the form |
| 115 | +$$ |
| 116 | +\hat{a}_i x_{i - 2} + \hat{b}_i x_i + \hat{c}_i x_{i + 2} = \hat{y}_i, |
| 117 | +$$ |
| 118 | +where the modified coefficients $\hat{a}_i$, $\hat{b}_i$, $\hat{c}_i$ and the modified rhs $\hat{y}_i$ are |
| 119 | +$$ |
| 120 | +\begin{aligned} |
| 121 | +\hat{a}_i &= -\frac{a_{i - 1} a_i}{b_{i - 1}}, \\ |
| 122 | +\hat{b}_i &= b_i - \frac{c_{i - 1} a_i}{b_{i - 1}} - \frac{c_{i} a_{i + 1}}{b_{i + 1}}, \\ |
| 123 | +\hat{c}_i &= -\frac{c_i c_{i + 1}}{b_{i + 1}}, \\ |
| 124 | +\hat{y}_i &= y_i - \frac{a_i y_{i - 1}}{b_{i - 1}} - \frac{c_i y_{i + 1}}{b_{i + 1}}. |
| 125 | +\end{aligned} |
| 126 | +$$ |
| 127 | +The resulting system of equations for $x_{2j}$ is still tridiagonal and has roughly half the size of the original. |
| 128 | + |
| 129 | +The cyclic reduction algorithm has two phases. |
| 130 | +In the first phase, the above elimination step is iterated until the system is reduced to either one or two equations, which can then be directly solved. |
| 131 | +In each iteration the computation of modified coefficients can be done in parallel. |
| 132 | +The second phase involves finding the rest of the solution by using the final coefficients. |
| 133 | +The second phase is also iterative, where at each iteration the number of know solution values increases by a factor of two. A drawback of this algorithm is that the amount of parallel computations available at each iteration is not constant. |
| 134 | + |
| 135 | +The parallel cyclic reduction is based on the same idea, but has only one phase. In the first iteration it reduces the |
| 136 | +original system to two systems of half the size. The second iteration reduces it to four systems of quarter the size |
| 137 | +and so on. In the final iteration systems of size one or two are solved to obtain the whole solution at once. |
| 138 | +In contrast to the cyclic reduction, this algorithm has constant amount of parallelism available at the cost of performing |
| 139 | +more redundant work. |
| 140 | + |
| 141 | +A naive application of the cyclic reduction to the diffusion system would result |
| 142 | +in the following equation for the modified main diagonal |
| 143 | +$$ |
| 144 | +\hat{b}_i = h_i |
| 145 | + + g_{i - 1} - \frac{g_{i-1}^2}{h_{i - 1} + g_{i - 2} + g_{i - 1}} |
| 146 | + + g_{i + 1} - \frac{g_{i+1}^2}{h_{i + 1} + g_{i} + g_{i + 1}}. |
| 147 | +$$ |
| 148 | +Using this expression can potentially result in catastrophic cancellation errors and overflows if |
| 149 | +$g_{i - 1}$ or $g_{i + 1}$ are very large. |
| 150 | +To improve its stability, this expression can be rewritten as |
| 151 | +$$ |
| 152 | +\hat{b}_i = h_i |
| 153 | + + h_{i - 1} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}} |
| 154 | + + h_{i + 1} \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}} |
| 155 | + + g_{i - 2} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}} |
| 156 | + + g_i \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}}. |
| 157 | +$$ |
| 158 | +This equation can be shown to be in the form |
| 159 | +$$ |
| 160 | +\hat{b}_i = \hat{h}_i + \hat{g}_{i - 2} + \hat{g}_i, |
| 161 | +$$ |
| 162 | +where |
| 163 | +$$ |
| 164 | +\begin{aligned} |
| 165 | +\hat{h}_i &= h_i |
| 166 | + + h_{i - 1} \frac{g_{i - 1}}{h_{i - 1} + g_{i - 2} + g_{i - 1}} |
| 167 | + + h_{i + 1} \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}}, \\ |
| 168 | +\hat{g}_i &= g_i \frac{g_{i + 1}}{h_{i + 1} + g_i + g_{i + 1}}. |
| 169 | +\end{aligned} |
| 170 | +$$ |
| 171 | +These two equations form the basis of stable (parallel) cyclic reduction for diffusion problems. |
| 172 | + |
| 173 | + |
| 174 | +## 4 Design |
| 175 | + |
| 176 | +TODO |
| 177 | + |
| 178 | +## 5 Verification and Testing |
| 179 | + |
| 180 | +### 5.1 Test correctness |
| 181 | + |
| 182 | +The solver will be compared against exact solution for a variety of (batch size, system size) combinations. |
| 183 | + |
| 184 | +### 5.2 Test stability |
| 185 | + |
| 186 | +The solver stability will be tested on an idealized vertical mixing problem with abrupt changes in |
| 187 | +the diffusion coefficient. |
0 commit comments