On UT representation in PaTH paper and its code implementation #567

SeepingFragranceLock · 2025-08-15T17:05:37Z

SeepingFragranceLock
Aug 15, 2025

Update: I should point out the UT transform in the picture is conditionally correct with $\prod_{t=0}^{L-1}H_t \equiv H_{L-1}H_{L-2}\cdots H_0$, but that contradicts PaTH code and block matrix form in sections 3.2 and 3.3.

We can derive a linear system for products of Householder-like matrices, and get a compact form for this operator as
$$I - W^TDT_o^{-1}W$$
$$T_o^{-1} = (I+\text{strictLower}(WW^TD))^{-1}$$

However, the positions of both $D$ s are different from the paper and the source code here. Any thoughts?

For clarity, let's go through the compute steps.
overview

First, compute the scaled $WW^T$.
the paper get the wrong position for the inner $D$, I believe it is a typo; and it should reorder as $\text{strictLower}(DWW^T)$ by chunk_scaled_dot_kkt

Then, we go to solve_tril to compute
$$T_f^{-1} = (I+\text{strictLower}(DWW^T))^{-1}$$

Finally, we get the complete UT presentations here intra_chunk_preprocess_fwd
$$I - W^TT_f^{-1}DW$$

It turns out
$$T_f^{-1}D = DT_o^{-1}$$
$$(I+\text{strictLower}(DWW^T))^{-1}D = D(I+\text{strictLower}(WW^TD))^{-1}$$

So why FLA choose the other way to compute $T^{-1}$?

SeepingFragranceLock · 2025-09-02T07:51:10Z

SeepingFragranceLock
Sep 2, 2025
Author

Gone through so many try-and-errors when studying PaTH, both paper and code, I would like to put 2 cents in this elegant algorithm, especially for the math theory.

First answer why we get the alternative positions of diagonal D.
It just arises with the way we prepare the linear system equations. If dividing both sides by $\beta_t$, we get the form as in the paper.

$$\displaylines{\beta_t (v_tu_t^\top) + \beta_t \sum_{j=0}^{t-1} \beta_j v_j u_j^\top(u_j u_t^\top) = \beta_t (xu_t^\top)\\\ \mathbf{P} = \mathbf{I} - \mathbf{W}^\top \left(\mathbf{I} + \text{triu}(\mathbf{D} \mathbf{W}\mathbf{W}^\top, 1)\right)^{-1} \mathbf{D} \mathbf{W}}$$

Note that all vectors are row vectors, define $H_t = I - \beta_t u^\top_t u_t$ and $xP = xH_0H_1 \cdots H_{L-1}$, an upper triangle system.

When reading sections 3.3 in the paper, I struggled to get a consisten structure for the boundary-adjusted equations. How can we get the matrix form from its vector form? The paper suggests following derivations in section 3.2, but that is still handwaving for me. I realize there is no short way. By checking handedness and orders for operator $P$, we found:

$$\begin{array}{rcl} (\overleftarrow{\mathbf{Q}}_{[i]})_t &:=& \left( \prod_{m=iB+1}^{iB+t} \mathbf{H}_m \right) \mathbf{q}_{iB+t} \ \ \text{(UpperTriangle)} \\ &\neq& \mathbf{q}_{iB+t} - \mathbf{W}_{[i]}^\top \mathbf{T}_{[i]}^{-1} (\mathbf{W}_{[i]} \odot \mathbf{M}_t^R) \mathbf{q}_{iB+t}\ \ \text{(Break causality)} \\\ \overleftarrow{\mathbf{Q}}_{[i]} &=& \mathbf{Q}_{[i]} - \text{lower}(\mathbf{Q}_{[i]} \mathbf{W}_{[i]}^\top) \mathbf{T}_{[i]}^{-1} \mathbf{W}_{[i]}\\\ \newline (\overrightarrow{\mathbf{K}}_{[i]})_s &:=& \left( \left( \prod_{m=iB+s+1}^{(i+1)B} \mathbf{H}_m \right)^\top \mathbf{k}_{iB+s} \right)\ \ \text{(LowerTriangle)} \\ &\neq& \mathbf{k}_{iB+s} - (\mathbf{T}_{[i]}^{-1} \mathbf{W}_{[i]})^\top (\mathbf{W}_{[i]} \odot \mathbf{M}_s^L) \mathbf{k}_{iB+s}\ \ \text{(Break causality)} \\\ \overrightarrow{\mathbf{K}}_{[i]} &=& \mathbf{K}_{[i]} - \left( \mathbf{T}_{[i]}^{-1} \text{strictLower}(\mathbf{W}_{[i]} \mathbf{K}_{[i]}^\top) \right)^\top \mathbf{W}_{[i]} \end{array}$$

The reason is simple, $T$ is triangle. We can not simplely change a system $xP$ to $Px^\top$ without changing its causality. $P$ must be transposed at the same time! We can see it clearer with pair-wise attention score in section 3.2. To preserve its inner product, the UT form must transpose from vector form to matrix form, vice versa.

$$\begin{array}{rcl} \tilde{A}_{ij} &:=& \mathbf{k}_j^\top \left( \prod_{t=j+1}^i \mathbf{H}_t \right) \mathbf{q}_i \ \ (\text{UpperTriangle}) \\ &\neq& \mathbf{k}_j^\top \mathbf{q}_i - \mathbf{k}_j^\top (\mathbf{W} \odot \mathbf{M}_{j+1}^L)^\top \mathbf{T}^{-1} (\mathbf{W} \odot \mathbf{M}_i^R) \mathbf{q}_i \ \ \text{(break causality)} \\ \tilde{\mathbf{A}} &=& \text{lower}(\mathbf{Q}\mathbf{K}^\top) - \text{lower}(\mathbf{Q}\mathbf{W}^\top) \mathbf{T}^{-1} \text{strictLower}(\mathbf{W}\mathbf{K}^\top)\ \ \text{(LowerTriangle)} \end{array}$$

The PaTH code is implemented using the matrix block form as blueprint. That said, the masked UT transform proposed in section 3.1 is for conceptual understading which I still grasp.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLA

On UT representation in PaTH paper and its code implementation #567

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

FLA

On UT representation in PaTH paper and its code implementation #567

Uh oh!

Uh oh!

SeepingFragranceLock Aug 15, 2025

Replies: 1 comment

Uh oh!

Uh oh!

SeepingFragranceLock Sep 2, 2025 Author

SeepingFragranceLock
Aug 15, 2025

SeepingFragranceLock
Sep 2, 2025
Author