On UT representation in PaTH paper and its code implementation #567
Replies: 1 comment
-
Gone through so many try-and-errors when studying PaTH, both paper and code, I would like to put 2 cents in this elegant algorithm, especially for the math theory. First answer why we get the alternative positions of diagonal Note that all vectors are row vectors, define When reading sections 3.3 in the paper, I struggled to get a consisten structure for the boundary-adjusted equations. How can we get the matrix form from its vector form? The paper suggests following derivations in section 3.2, but that is still handwaving for me. I realize there is no short way. By checking handedness and orders for operator The reason is simple, The PaTH code is implemented using the matrix block form as blueprint. That said, the masked UT transform proposed in section 3.1 is for conceptual understading which I still grasp. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Update: I should point out the UT transform in the picture is conditionally correct with$\prod_{t=0}^{L-1}H_t \equiv H_{L-1}H_{L-2}\cdots H_0$ , but that contradicts PaTH code and block matrix form in sections 3.2 and 3.3.
We can derive a linear system for products of Householder-like matrices, and get a compact form for this operator as
$$I - W^TDT_o^{-1}W$$
$$T_o^{-1} = (I+\text{strictLower}(WW^TD))^{-1}$$
However, the positions of both$D$ s are different from the paper and the source code here. Any thoughts?
For clarity, let's go through the compute steps.
overview
First, compute the scaled$WW^T$ .$D$ , I believe it is a typo; and it should reorder as $\text{strictLower}(DWW^T)$ by chunk_scaled_dot_kkt

the paper get the wrong position for the inner
Then, we go to solve_tril to compute
$$T_f^{-1} = (I+\text{strictLower}(DWW^T))^{-1}$$
Finally, we get the complete UT presentations here intra_chunk_preprocess_fwd
$$I - W^TT_f^{-1}DW$$
It turns out
$$T_f^{-1}D = DT_o^{-1}$$
$$(I+\text{strictLower}(DWW^T))^{-1}D = D(I+\text{strictLower}(WW^TD))^{-1}$$
So why FLA choose the other way to compute$T^{-1}$ ?
Beta Was this translation helpful? Give feedback.
All reactions