Skip to content

Commit 9bce414

Browse files
authored
DOC - Add tutorial for Cox datafit (#163)
1 parent 7a570a6 commit 9bce414

File tree

5 files changed

+225
-1
lines changed

5 files changed

+225
-1
lines changed
100 KB
Loading
103 KB
Loading
80.5 KB
Loading

doc/tutorials/cox_datafit.rst

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
.. _maths_cox_datafit:
2+
3+
=============================
4+
Mathematic behind Cox datafit
5+
=============================
6+
7+
This tutorial presents the mathematics behind Cox datafit using both Breslow and Efron estimate.
8+
9+
10+
Problem setup
11+
=============
12+
13+
Let's consider a typical survival analysis setup with
14+
15+
- :math:`\mathbf{X} \in \mathbb{R}^{n \times p}` is a matrix of :math:`p` predictors and :math:`n` samples :math:`x_i \in \mathbb{R}^p`
16+
- :math:`y \in \mathbb{R}^n` a vector recording the time of events occurrences
17+
- :math:`s \in \{ 0, 1 \}^n` a binary vector (observation censorship) where :math:`1` means *event occurred*
18+
19+
where we are interested in estimating the coefficients :math:`\beta \in \mathbb{R}^p` of the linear combination of predictors.
20+
21+
We will start by establishing the Cox datafit using the Breslow estimate then extend to the case of tied observations through the Efron estimate.
22+
23+
24+
Breslow estimate
25+
================
26+
27+
Datafit expression
28+
------------------
29+
30+
To get the expression of the Cox datafit, we refer to the expression of the negative log-likelihood according to the Breslow estimate [`1`_, Section 2]
31+
32+
.. math::
33+
:label: breslow-estimate
34+
35+
l(\beta) = \frac{1}{n} \sum_{i=1}^{n} -s_i \langle x_i, \beta \rangle + s_i \log(\sum_{y_j \geq y_i} e^{\langle x_j, \beta \rangle})
36+
.
37+
38+
To get compact expression, We introduce the matrix :math:`mathbf{B} \in \mathbb{R}^{n \times n}` defined as :math:`\mathbf{B}_{i, j} = \mathbb{1}_{y_j \geq y_i} = 1` if :math:`y_j \geq y_i` and :math:`0` otherwise.
39+
40+
We notice the first term in :eq:`breslow-estimate` can be rewritten as
41+
42+
.. math::
43+
44+
\sum_{i=1}^{n} -s_i \langle x_i, \beta \rangle = -\langle s, \mathbf{X}\beta \rangle
45+
,
46+
47+
whereas the second term writes
48+
49+
.. math::
50+
51+
\sum_{i=1}^n s_i \log(\sum_{y_j \geq y_i} e^{\langle x_j, \beta \rangle}) = \sum_{i=1}^n s_i \log(\sum_{j=1}^n \mathbb{1}_{y_j \geq y_i} e^{\langle x_j, \beta \rangle}) = \langle s, \log(\mathbf{B}e^{\mathbf{X}\beta}) \rangle
52+
,
53+
54+
where the :math:`\log` is applied element-wise. Therefore we deduce the expression of Cox datafit
55+
56+
.. math::
57+
:label: vectorized-cox-breslow
58+
59+
l(\beta) = -\frac{1}{n} \langle s, \mathbf{X}\beta \rangle + \frac{1}{n} \langle s, \log(\mathbf{B}e^{\mathbf{X}\beta}) \rangle
60+
.
61+
62+
We observe from this vectorized reformulation that the Cox datafit only depends :math:`\mathbf{X}\beta`. Hence, we can focus on the function
63+
64+
.. math::
65+
:label: simple-function
66+
67+
F(u) = -\frac{1}{n} \langle s, u \rangle + \frac{1}{n} \langle s, \log(\mathbf{B}e^u) \rangle
68+
,
69+
70+
to derive the gradient and Hessian.
71+
72+
73+
Gradient and Hessian
74+
--------------------
75+
76+
We apply the chain rule to derive the function :math:`F` gradient and Hessian.
77+
For :math:`u \in \mathbb{R}^n`, The gradient is
78+
79+
.. math::
80+
:label: raw-gradient
81+
82+
\nabla F(u) = -\frac{1}{n} s + \frac{1}{n} [\text{diag}(e^u)\mathbf{B}^\top] \frac{s}{\mathbf{B}e^u}
83+
,
84+
85+
and the Hessian reads
86+
87+
.. math::
88+
:label: raw-hessian
89+
90+
\nabla^2 F(u) = \frac{1}{n} \text{diag}(e^u) \text{diag}(\mathbf{B}^\top \frac{s}{\mathbf{B}e^u}) - \frac{1}{n} \text{diag}(e^u) \mathbf{B}^\top \text{diag}(\frac{s}{(\mathbf{B}e^u)^2})\mathbf{B}\text{diag}(e^u)
91+
,
92+
93+
where the division and the square operations are performed element-wise.
94+
95+
The Hessian, as it is, is costly to evaluate because of the right hand-side term.
96+
In particular, the latter involves a :math:`\mathcal{O}(n^3)` operations. We overcome this limitation by using a diagonal upper bound on the Hessian.
97+
98+
We construct such an upper bound by noticing that
99+
100+
#. the function :math:`F` is convex and hence :math:`\nabla^2 F(u)` is positive semi-definite
101+
#. the second term is positive semi-definite
102+
103+
Therefore, we have,
104+
105+
.. math::
106+
:label: diagonal-upper-bound
107+
108+
\nabla^2 F(u) \preceq \frac{1}{n} \text{diag}(e^u) \text{diag}(\mathbf{B}^\top \frac{s}{\mathbf{B}e^u})
109+
,
110+
111+
where the inequality is the Loewner order on positive semi-definite matrices.
112+
113+
.. note::
114+
115+
Having a diagonal Hessian reduces Hessian computational cost to :math:`\mathcal{O}(n)` instead of :math:`\mathcal{O}(n^3)`.
116+
It also reduces the Hessian-vector operations to :math:`\mathcal{O}(n)` instead of :math:`\mathcal{O}(n^2)`.
117+
118+
119+
Efron estimate
120+
==============
121+
122+
Datafit expression
123+
------------------
124+
125+
Efron estimate refines Breslow by handling tied observations (observations with identical occurrences' time).
126+
We can define :math:`y_{i_1}, \ldots, y_{i_m}` the unique times, assumed to be in total :math:`m` and
127+
128+
.. math::
129+
:label: def-H
130+
131+
H_{y_{i_l}} = \{ i \ | \ s_i = 1 \ ;\ y_i = y_{i_l} \}
132+
,
133+
134+
the set of uncensored observations with the same time :math:`y_{i_l}`.
135+
136+
Again, we refer to the expression of the negative log-likelihood according to Efron estimate [`2`_, Section 6, equation (6.7)] to get the datafit formula
137+
138+
.. math::
139+
:label: efron-estimate
140+
141+
l(\beta) = \frac{1}{n} \sum_{l=1}^{m} (
142+
\sum_{i \in H_{i_l}} - \langle x_i, \beta \rangle
143+
+ \sum_{i \in H_{i_l}} \log(\sum_{y_j \geq y_{i_l}} e^{\langle x_j, \beta \rangle} - \frac{\#(i) - 1}{ |H_{i_l} |}\sum_{j \in H_{i_l}} e^{\langle x_j, \beta \rangle}))
144+
,
145+
146+
where :math:`| H_{i_l} |` stands for the cardinal of :math:`H_{i_l}`, and :math:`\#(i)` the index of observation :math:`i` in :math:`H_{i_l}`.
147+
148+
Ideally, we would like to rewrite this expression like :eq:`vectorized-cox-breslow` to leverage the established results about the gradient and Hessian. A closer look reveals what distinguishes both expressions is the presence of a double sum and a second term in the :math:`\log`.
149+
150+
First, we can observe that :math:`\cup_{l=1}^{m} H_{i_l} = \{ i \ | \ s_i = 1 \}`, which enables fusing the two sums, for instance
151+
152+
.. math::
153+
154+
\sum_{l=1}^{m}\sum_{i \in H_{i_l}} - \langle x_i, \beta \rangle = \sum_{i: s_i = 1} - \langle x_i, \beta \rangle = \sum_{i=1}^n -s_i \langle x_i, \beta \rangle = -\langle s, \mathbf{X}\beta \rangle
155+
.
156+
157+
On the other hand, the minus term within :math:`\log` can be rewritten as a linear term in :math:`mathbf{X}\beta`
158+
159+
.. math::
160+
161+
- \frac{\#(i) - 1}{| H_{i_l} |}\sum_{j \in H_{i_l}} e^{\langle x_j, \beta \rangle}
162+
= \sum_{j=1}^{n} -\frac{\#(i) - 1}{| H_{i_l} |} \ \mathbb{1}_{j \in H_{i_l}} \ e^{\langle x_j, \beta \rangle}
163+
= \sum_{j=1}^n a_{i,j} e^{\langle x_j, \beta \rangle}
164+
= \langle a_i, e^{\mathbf{X}\beta} \rangle
165+
,
166+
167+
where :math:`a_i` is a vector in :math:`\mathbb{R}^n` chosen accordingly to preform the linear operation.
168+
169+
By defining the matrix :math:`\mathbf{A}` with rows :math:`(a_i)_{i \in [n]}`, we deduce the final expression
170+
171+
.. math::
172+
:label: vectorized-cox-efron
173+
174+
l(\beta) = -\frac{1}{n} \langle s, \mathbf{X}\beta \rangle +\frac{1}{n} \langle s, \log(\mathbf{B}e^{\mathbf{X}\beta} - \mathbf{A}e^{\mathbf{X}\beta}) \rangle
175+
.
176+
177+
Algorithm 1 provides an efficient procedure to evaluate :math:`\mathbf{A}v` for some :math:`v` in :math:`\mathbb{R}^n`.
178+
179+
.. image:: /_static/images/cox-tutorial/A_dot_v.png
180+
:width: 400
181+
:align: center
182+
:alt: Algorithm 1 to evaluate A dot v
183+
184+
185+
Gradient and Hessian
186+
--------------------
187+
188+
Now that we cast the Efron estimate in form similar to :eq:`vectorized-cox-breslow`, the evaluation of gradient and the diagonal upper of the Hessian reduces to subtracting a linear term.
189+
Algorithm 2 provides an efficient procedure to evaluate :math:`\mathbf{A}^\top v` for some :math:`v` in :math:`\mathbb{R}^n`.
190+
191+
.. image:: /_static/images/cox-tutorial/A_transpose_dot_v.png
192+
:width: 400
193+
:align: center
194+
:alt: Algorithm 1 to evaluate A transpose dot v
195+
196+
.. note::
197+
198+
We notice that the complexity of both algorithms is :math:`\mathcal{O}(n)` despite involving a matrix multiplication.
199+
This is due to the special structure of :math:`\mathbf{A}` which in the case of sorted observations has a block diagonal structure
200+
with each block having equal columns.
201+
202+
Here is an illustration with sorted observations having group sizes of identical occurrences times :math:`3, 2, 1, 3` respectively
203+
204+
.. image:: /_static/images/cox-tutorial/structure_matrix_A.png
205+
:width: 300
206+
:align: center
207+
:alt: Illustration of the structure of A when observations are sorted
208+
209+
210+
Reference
211+
=========
212+
213+
.. _1:
214+
[1] DY Lin. On the Breslow estimator. Lifetime data analysis, 13:471–480, 2007.
215+
216+
.. _2:
217+
[2] Bradley Efron. The efficiency of cox’s likelihood function for censored data. Journal of the
218+
American statistical Association, 72(359):557–565, 1977.

doc/tutorials/tutorials.rst

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,10 @@ Learn to add a custom penalty by implementing the :math:`\ell_1` penalty.
2121
:ref:`Computation of the intercept <maths_unpenalized_intercept>`
2222
-----------------------------------------------------------------
2323

24-
Explore how ``skglm`` fits unpenalized intercept.
24+
Explore how ``skglm`` fits an unpenalized intercept.
25+
26+
27+
:ref:`Mathematics behind Cox datafit <maths_cox_datafit>`
28+
-----------------------------------------------------------------
29+
30+
Get details about Cox datafit equations.

0 commit comments

Comments
 (0)