|
| 1 | +.. _maths_cox_datafit: |
| 2 | + |
| 3 | +============================= |
| 4 | +Mathematic behind Cox datafit |
| 5 | +============================= |
| 6 | + |
| 7 | +This tutorial presents the mathematics behind Cox datafit using both Breslow and Efron estimate. |
| 8 | + |
| 9 | + |
| 10 | +Problem setup |
| 11 | +============= |
| 12 | + |
| 13 | +Let's consider a typical survival analysis setup with |
| 14 | + |
| 15 | +- :math:`\mathbf{X} \in \mathbb{R}^{n \times p}` is a matrix of :math:`p` predictors and :math:`n` samples :math:`x_i \in \mathbb{R}^p` |
| 16 | +- :math:`y \in \mathbb{R}^n` a vector recording the time of events occurrences |
| 17 | +- :math:`s \in \{ 0, 1 \}^n` a binary vector (observation censorship) where :math:`1` means *event occurred* |
| 18 | + |
| 19 | +where we are interested in estimating the coefficients :math:`\beta \in \mathbb{R}^p` of the linear combination of predictors. |
| 20 | + |
| 21 | +We will start by establishing the Cox datafit using the Breslow estimate then extend to the case of tied observations through the Efron estimate. |
| 22 | + |
| 23 | + |
| 24 | +Breslow estimate |
| 25 | +================ |
| 26 | + |
| 27 | +Datafit expression |
| 28 | +------------------ |
| 29 | + |
| 30 | +To get the expression of the Cox datafit, we refer to the expression of the negative log-likelihood according to the Breslow estimate [`1`_, Section 2] |
| 31 | + |
| 32 | +.. math:: |
| 33 | + :label: breslow-estimate |
| 34 | +
|
| 35 | + l(\beta) = \frac{1}{n} \sum_{i=1}^{n} -s_i \langle x_i, \beta \rangle + s_i \log(\sum_{y_j \geq y_i} e^{\langle x_j, \beta \rangle}) |
| 36 | + . |
| 37 | +
|
| 38 | +To get compact expression, We introduce the matrix :math:`mathbf{B} \in \mathbb{R}^{n \times n}` defined as :math:`\mathbf{B}_{i, j} = \mathbb{1}_{y_j \geq y_i} = 1` if :math:`y_j \geq y_i` and :math:`0` otherwise. |
| 39 | + |
| 40 | +We notice the first term in :eq:`breslow-estimate` can be rewritten as |
| 41 | + |
| 42 | +.. math:: |
| 43 | +
|
| 44 | + \sum_{i=1}^{n} -s_i \langle x_i, \beta \rangle = -\langle s, \mathbf{X}\beta \rangle |
| 45 | + , |
| 46 | +
|
| 47 | +whereas the second term writes |
| 48 | + |
| 49 | +.. math:: |
| 50 | +
|
| 51 | + \sum_{i=1}^n s_i \log(\sum_{y_j \geq y_i} e^{\langle x_j, \beta \rangle}) = \sum_{i=1}^n s_i \log(\sum_{j=1}^n \mathbb{1}_{y_j \geq y_i} e^{\langle x_j, \beta \rangle}) = \langle s, \log(\mathbf{B}e^{\mathbf{X}\beta}) \rangle |
| 52 | + , |
| 53 | +
|
| 54 | +where the :math:`\log` is applied element-wise. Therefore we deduce the expression of Cox datafit |
| 55 | + |
| 56 | +.. math:: |
| 57 | + :label: vectorized-cox-breslow |
| 58 | +
|
| 59 | + l(\beta) = -\frac{1}{n} \langle s, \mathbf{X}\beta \rangle + \frac{1}{n} \langle s, \log(\mathbf{B}e^{\mathbf{X}\beta}) \rangle |
| 60 | + . |
| 61 | +
|
| 62 | +We observe from this vectorized reformulation that the Cox datafit only depends :math:`\mathbf{X}\beta`. Hence, we can focus on the function |
| 63 | + |
| 64 | +.. math:: |
| 65 | + :label: simple-function |
| 66 | +
|
| 67 | + F(u) = -\frac{1}{n} \langle s, u \rangle + \frac{1}{n} \langle s, \log(\mathbf{B}e^u) \rangle |
| 68 | + , |
| 69 | +
|
| 70 | +to derive the gradient and Hessian. |
| 71 | + |
| 72 | + |
| 73 | +Gradient and Hessian |
| 74 | +-------------------- |
| 75 | + |
| 76 | +We apply the chain rule to derive the function :math:`F` gradient and Hessian. |
| 77 | +For :math:`u \in \mathbb{R}^n`, The gradient is |
| 78 | + |
| 79 | +.. math:: |
| 80 | + :label: raw-gradient |
| 81 | +
|
| 82 | + \nabla F(u) = -\frac{1}{n} s + \frac{1}{n} [\text{diag}(e^u)\mathbf{B}^\top] \frac{s}{\mathbf{B}e^u} |
| 83 | + , |
| 84 | +
|
| 85 | +and the Hessian reads |
| 86 | + |
| 87 | +.. math:: |
| 88 | + :label: raw-hessian |
| 89 | +
|
| 90 | + \nabla^2 F(u) = \frac{1}{n} \text{diag}(e^u) \text{diag}(\mathbf{B}^\top \frac{s}{\mathbf{B}e^u}) - \frac{1}{n} \text{diag}(e^u) \mathbf{B}^\top \text{diag}(\frac{s}{(\mathbf{B}e^u)^2})\mathbf{B}\text{diag}(e^u) |
| 91 | + , |
| 92 | +
|
| 93 | +where the division and the square operations are performed element-wise. |
| 94 | + |
| 95 | +The Hessian, as it is, is costly to evaluate because of the right hand-side term. |
| 96 | +In particular, the latter involves a :math:`\mathcal{O}(n^3)` operations. We overcome this limitation by using a diagonal upper bound on the Hessian. |
| 97 | + |
| 98 | +We construct such an upper bound by noticing that |
| 99 | + |
| 100 | +#. the function :math:`F` is convex and hence :math:`\nabla^2 F(u)` is positive semi-definite |
| 101 | +#. the second term is positive semi-definite |
| 102 | + |
| 103 | +Therefore, we have, |
| 104 | + |
| 105 | +.. math:: |
| 106 | + :label: diagonal-upper-bound |
| 107 | +
|
| 108 | + \nabla^2 F(u) \preceq \frac{1}{n} \text{diag}(e^u) \text{diag}(\mathbf{B}^\top \frac{s}{\mathbf{B}e^u}) |
| 109 | + , |
| 110 | +
|
| 111 | +where the inequality is the Loewner order on positive semi-definite matrices. |
| 112 | + |
| 113 | +.. note:: |
| 114 | + |
| 115 | + Having a diagonal Hessian reduces Hessian computational cost to :math:`\mathcal{O}(n)` instead of :math:`\mathcal{O}(n^3)`. |
| 116 | + It also reduces the Hessian-vector operations to :math:`\mathcal{O}(n)` instead of :math:`\mathcal{O}(n^2)`. |
| 117 | + |
| 118 | + |
| 119 | +Efron estimate |
| 120 | +============== |
| 121 | + |
| 122 | +Datafit expression |
| 123 | +------------------ |
| 124 | + |
| 125 | +Efron estimate refines Breslow by handling tied observations (observations with identical occurrences' time). |
| 126 | +We can define :math:`y_{i_1}, \ldots, y_{i_m}` the unique times, assumed to be in total :math:`m` and |
| 127 | + |
| 128 | +.. math:: |
| 129 | + :label: def-H |
| 130 | + |
| 131 | + H_{y_{i_l}} = \{ i \ | \ s_i = 1 \ ;\ y_i = y_{i_l} \} |
| 132 | + , |
| 133 | + |
| 134 | +the set of uncensored observations with the same time :math:`y_{i_l}`. |
| 135 | + |
| 136 | +Again, we refer to the expression of the negative log-likelihood according to Efron estimate [`2`_, Section 6, equation (6.7)] to get the datafit formula |
| 137 | + |
| 138 | +.. math:: |
| 139 | + :label: efron-estimate |
| 140 | +
|
| 141 | + l(\beta) = \frac{1}{n} \sum_{l=1}^{m} ( |
| 142 | + \sum_{i \in H_{i_l}} - \langle x_i, \beta \rangle |
| 143 | + + \sum_{i \in H_{i_l}} \log(\sum_{y_j \geq y_{i_l}} e^{\langle x_j, \beta \rangle} - \frac{\#(i) - 1}{ |H_{i_l} |}\sum_{j \in H_{i_l}} e^{\langle x_j, \beta \rangle})) |
| 144 | + , |
| 145 | +
|
| 146 | +where :math:`| H_{i_l} |` stands for the cardinal of :math:`H_{i_l}`, and :math:`\#(i)` the index of observation :math:`i` in :math:`H_{i_l}`. |
| 147 | + |
| 148 | +Ideally, we would like to rewrite this expression like :eq:`vectorized-cox-breslow` to leverage the established results about the gradient and Hessian. A closer look reveals what distinguishes both expressions is the presence of a double sum and a second term in the :math:`\log`. |
| 149 | + |
| 150 | +First, we can observe that :math:`\cup_{l=1}^{m} H_{i_l} = \{ i \ | \ s_i = 1 \}`, which enables fusing the two sums, for instance |
| 151 | + |
| 152 | +.. math:: |
| 153 | +
|
| 154 | + \sum_{l=1}^{m}\sum_{i \in H_{i_l}} - \langle x_i, \beta \rangle = \sum_{i: s_i = 1} - \langle x_i, \beta \rangle = \sum_{i=1}^n -s_i \langle x_i, \beta \rangle = -\langle s, \mathbf{X}\beta \rangle |
| 155 | + . |
| 156 | +
|
| 157 | +On the other hand, the minus term within :math:`\log` can be rewritten as a linear term in :math:`mathbf{X}\beta` |
| 158 | + |
| 159 | +.. math:: |
| 160 | +
|
| 161 | + - \frac{\#(i) - 1}{| H_{i_l} |}\sum_{j \in H_{i_l}} e^{\langle x_j, \beta \rangle} |
| 162 | + = \sum_{j=1}^{n} -\frac{\#(i) - 1}{| H_{i_l} |} \ \mathbb{1}_{j \in H_{i_l}} \ e^{\langle x_j, \beta \rangle} |
| 163 | + = \sum_{j=1}^n a_{i,j} e^{\langle x_j, \beta \rangle} |
| 164 | + = \langle a_i, e^{\mathbf{X}\beta} \rangle |
| 165 | + , |
| 166 | +
|
| 167 | +where :math:`a_i` is a vector in :math:`\mathbb{R}^n` chosen accordingly to preform the linear operation. |
| 168 | + |
| 169 | +By defining the matrix :math:`\mathbf{A}` with rows :math:`(a_i)_{i \in [n]}`, we deduce the final expression |
| 170 | + |
| 171 | +.. math:: |
| 172 | + :label: vectorized-cox-efron |
| 173 | +
|
| 174 | + l(\beta) = -\frac{1}{n} \langle s, \mathbf{X}\beta \rangle +\frac{1}{n} \langle s, \log(\mathbf{B}e^{\mathbf{X}\beta} - \mathbf{A}e^{\mathbf{X}\beta}) \rangle |
| 175 | + . |
| 176 | +
|
| 177 | +Algorithm 1 provides an efficient procedure to evaluate :math:`\mathbf{A}v` for some :math:`v` in :math:`\mathbb{R}^n`. |
| 178 | + |
| 179 | +.. image:: /_static/images/cox-tutorial/A_dot_v.png |
| 180 | + :width: 400 |
| 181 | + :align: center |
| 182 | + :alt: Algorithm 1 to evaluate A dot v |
| 183 | + |
| 184 | + |
| 185 | +Gradient and Hessian |
| 186 | +-------------------- |
| 187 | + |
| 188 | +Now that we cast the Efron estimate in form similar to :eq:`vectorized-cox-breslow`, the evaluation of gradient and the diagonal upper of the Hessian reduces to subtracting a linear term. |
| 189 | +Algorithm 2 provides an efficient procedure to evaluate :math:`\mathbf{A}^\top v` for some :math:`v` in :math:`\mathbb{R}^n`. |
| 190 | + |
| 191 | +.. image:: /_static/images/cox-tutorial/A_transpose_dot_v.png |
| 192 | + :width: 400 |
| 193 | + :align: center |
| 194 | + :alt: Algorithm 1 to evaluate A transpose dot v |
| 195 | + |
| 196 | +.. note:: |
| 197 | + |
| 198 | + We notice that the complexity of both algorithms is :math:`\mathcal{O}(n)` despite involving a matrix multiplication. |
| 199 | + This is due to the special structure of :math:`\mathbf{A}` which in the case of sorted observations has a block diagonal structure |
| 200 | + with each block having equal columns. |
| 201 | + |
| 202 | + Here is an illustration with sorted observations having group sizes of identical occurrences times :math:`3, 2, 1, 3` respectively |
| 203 | + |
| 204 | + .. image:: /_static/images/cox-tutorial/structure_matrix_A.png |
| 205 | + :width: 300 |
| 206 | + :align: center |
| 207 | + :alt: Illustration of the structure of A when observations are sorted |
| 208 | + |
| 209 | + |
| 210 | +Reference |
| 211 | +========= |
| 212 | + |
| 213 | +.. _1: |
| 214 | +[1] DY Lin. On the Breslow estimator. Lifetime data analysis, 13:471–480, 2007. |
| 215 | + |
| 216 | +.. _2: |
| 217 | +[2] Bradley Efron. The efficiency of cox’s likelihood function for censored data. Journal of the |
| 218 | +American statistical Association, 72(359):557–565, 1977. |
0 commit comments