diff --git a/book/chapters/chapter13/beyond_regression_and_classification.qmd b/book/chapters/chapter13/beyond_regression_and_classification.qmd index f9dcaeaef..dd3b7c3b3 100644 --- a/book/chapters/chapter13/beyond_regression_and_classification.qmd +++ b/book/chapters/chapter13/beyond_regression_and_classification.qmd @@ -129,18 +129,21 @@ By using `po("learner_cv")` for internal resampling and `po("tunethreshold")` to ## Survival Analysis {#sec-survival} `r index("Survival analysis")` is a field of statistics concerned with trying to predict/estimate the time until an event takes place. -This predictive problem is unique as survival models are trained and tested on data that may include 'censoring', which occurs when the event of interest does *not* take place. +This predictive problem is unique because survival models are trained and tested on data that may include 'censoring', which occurs when the exact event time is *not* observed for some subjects. +The most common type of censoring is 'right censoring', which happens when the event of interest has not yet occurred by the time observation ends — either due to a fixed study cutoff (*administrative censoring*) or because individuals are lost to follow-up (*random censoring*). Survival analysis can be hard to explain in the abstract, so as a working example consider a marathon runner in a race. Here the 'survival problem' is trying to predict the time when the marathon runner finishes the race. -However, if the event of interest does not take place (e.g., the marathon runner gives up and does not finish the race), they are said to be censored. -Instead of throwing away information about censored events, survival analysis datasets include a status variable that provides information about the 'status' of an observation. -So in our example, we might write the runner's outcome as $(4, 1)$ if they finish the race at four hours, otherwise, if they give up at two hours we would write $(2, 0)$. +However, not all finish times may be observed. +For example, if the organizers stop recording finish times after a certain point, then any runner still running beyond that time will be *administratively* censored. +Alternatively, a runner might drop out of the race unexpectedly—for instance, if their tracking chip malfunctions or if they accidentally leave the course and are no longer followed—resulting in *random* censoring. +Instead of discarding such incomplete observations, survival analysis incorporates a status variable to reflect whether the event was observed. +In our example, we might record a runner’s outcome as $(3, 1)$ if they finish the race in three hours and we observe it, as $(4, 0)$ if they are still running at four hours when observation ends (administrative censoring), or as $(2.5, 0)$ if their tracking device fails and we lose contact at 2.5 hours (random censoring). The key to modeling in survival analysis is that we assume there exists a hypothetical time the marathon runner would have finished if they had not been censored, it is then the job of a survival learner to estimate what the true survival time would have been for a similar runner, assuming they are *not* censored (see @fig-censoring). Mathematically, this is represented by the hypothetical event time, $Y$, the hypothetical censoring time, $C$, the observed outcome time, $T = \min(Y, C)$, the event indicator $\Delta := (T = Y)$, and as usual some features, $X$. Learners are trained on $(T, \Delta)$ but, critically, make predictions of $Y$ from previously unseen features. This means that unlike classification and regression, learners are trained on two variables, $(T, \Delta)$, which, in R, is often captured in a `r ref("survival::Surv")` object. -Relating to our example above, the runner's outcome would then be $(T = 4, \Delta = 1)$ or $(T = 2, \Delta = 0)$. +Relating to our example above, the runner's outcome would then be represented as $(T = 3, \Delta = 1)$ if they finish in three hours, or as $(T = 4, \Delta = 0)$ if they are still running when the race clock ends, or as $(T = 2.5, \Delta = 0)$ if we lose contact with them partway through. Another example is in the code below, where we randomly generate six survival times and six event indicators, an outcome with a `+` indicates the outcome is censored, otherwise, the event of interest occurred. ```{r beyond_regression_and_classification-006}