Skip to content

Commit a5a6c0b

Browse files
committed
update the disease transmission case study
1 parent e6e8fbd commit a5a6c0b

File tree

1 file changed

+34
-29
lines changed

1 file changed

+34
-29
lines changed

knitr/disease_transmission/boarding_school_case_study.Rmd

Lines changed: 34 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
22
title: "Bayesian workflow for disease transmission modeling in Stan"
3-
author: "Léo Grinsztajn^[[email protected]], Elizaveta Semenova, Charles C. Margossian, Julien Riou"
3+
author: "Léo Grinsztajn^[École polytechnique, Palaiseau, France, [email protected]], Elizaveta Semenova^[Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK], Charles C. Margossian^[Department of Statistics, Columbia University, New York, NY, USA], Julien Riou^[Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland]"
44
link-citations: true
55
output:
66
html_document:
7-
toc: true
8-
toc_depth: 5
9-
toc_float: true
7+
toc: true
8+
toc_depth: 5
9+
toc_float: true
1010
bibliography: biblio.bib
1111
biblio-style: imsmart-nameyear
1212
abstract: This tutorial shows how to build, fit, and criticize disease transmission models in Stan, and should be useful to researchers interested in modeling the COVID-19 outbreak and doing Bayesian inference. Bayesian modeling provides a principled way to quantify uncertainty and incorporate prior knowledge into the model. What is more, Stan's main inference engine, Hamiltonian Monte Carlo sampling, is amiable to diagnostics, which means we can verify whether our inference is reliable. Stan is an expressive probabilistic programing language that abstracts the inference and allows users to focus on the modeling. The resulting code is readable and easily extensible, which makes the modeler's work more transparent and flexible. In this tutorial, we demonstrate with a simple Susceptible-Infected-Recovered (SIR) model how to formulate, fit, and diagnose a compartmental model in Stan. We also introduce more advanced topics which can help practitioners fit sophisticated models; notably, how to use simulations to probe our model and our priors, and computational techniques to scale ODE-based models.
@@ -96,7 +96,7 @@ and scale up ODEs in Stan.
9696
Throughout the tutorial, we use R as a scripting language^[Note that Stan can also be used with other langages such as Python or Julia, see [here](https://mc-stan.org/users/interfaces/) for the list of Stan interfaces],
9797
and, while we review some elementary concepts,
9898
assume the reader has basic familiarity with Bayesian inference
99-
and Stan.
99+
and Stan. The source code of this case study can be found on Github [here](https://github.com/stan-dev/example-models/tree/master/knitr/disease_transmission).
100100

101101

102102
# 1 Simple SIR
@@ -163,7 +163,6 @@ Let's give some intuition behind these ODEs. The proportion of infected people a
163163
The above model holds under several assumptions:
164164

165165
* births and deaths are not contributing to the dynamics and the total population $N=S+I+R$ remains constant,
166-
167166
* recovered individuals do not become susceptible again over time,
168167

169168
* the infection rate $\beta$ and recovery rate $\gamma$ are constant,
@@ -538,7 +537,8 @@ Here we see that the model gives a satisfying fit to the data,
538537
and that the model uncertainty is able to capture the variation of the data.
539538

540539
```{r}
541-
smr_pred <- cbind(as.data.frame(summary(fit_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), t, cases)
540+
smr_pred <- cbind(as.data.frame(summary(
541+
fit_sir_negbin, pars = "pred_cases", probs = c(0.05, 0.5, 0.95))$summary), t, cases)
542542
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
543543
544544
ggplot(smr_pred, mapping = aes(x = t)) +
@@ -551,7 +551,8 @@ ggplot(smr_pred, mapping = aes(x = t)) +
551551
Maybe we also want to access the true number of infected people at each time, and not just the number of students in bed. This is a latent variable for which we have an estimation.
552552
```{r}
553553
params <- lapply(t, function(i){sprintf("y[%s,2]", i)}) #number of infected for each day
554-
smr_y <- as.data.frame(summary(fit_sir_negbin, pars = params, probs = c(0.05, 0.5, 0.95))$summary)
554+
smr_y <- as.data.frame(summary(fit_sir_negbin,
555+
pars = params, probs = c(0.05, 0.5, 0.95))$summary)
555556
colnames(smr_y) <- make.names(colnames(smr_y)) # to remove % in the col names
556557
557558
ggplot(smr_y, mapping = aes(x = t)) +
@@ -686,7 +687,7 @@ ggplot(data = df_test) +
686687
scale_x_log10()
687688
```
688689

689-
We can do the same thing for R0 (again, on the log-scale), the loose bounds being 0.3 and 30.
690+
We can do the same thing for $R_0$ (again, on the log-scale), the loose bounds being 0.3 and 30.
690691
```{r}
691692
df_test <- tibble(r = s_prior$R0)
692693
ggplot(data = df_test) +
@@ -698,38 +699,37 @@ ggplot(data = df_test) +
698699
```
699700

700701
We thus see that these distributions are coherent with domain knowledge. See [here](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) for more recommendations on prior choice.
701-
^[Previoulsy, we fitted the data with these priors and found a posteriori R0 ~ 3 and a recovery time of ~ 2 days. This is quite unexpected from our basic domain knowledge, but can probably be explained (R0 bigger among students? isolated students counts for recovered in the model? etc). This shows that the prior should not be too constraining in order to incorporate both prior knowledge and unexpected knowledge from the data.]
702+
^[Previoulsy, we fitted the data with these priors and found a posteriori $R_0$ ~ 3 and a recovery time of ~ 2 days. This is quite unexpected from our basic domain knowledge, but can probably be explained ($R_0$ bigger among students? isolated students counts for recovered in the model? etc). This shows that the prior should not be too constraining in order to incorporate both prior knowledge and unexpected knowledge from the data.]
702703

703704
We can also plot trajectories of infection according to the prior,
704705
that is the number of infected people at each time accoring to prior distributions of parameters.
705-
```{r}
706-
plot(1, type = "n", main = "Prior Predictive Infection Samples", xlim = c(t[1], t[length(t)]), xlab="y", ylim = c(0, 1000), ylab = "")
707-
708-
for (r in 1:1000){
709-
lines(t, s_prior$y[,,2][r,],lw = 0.3, col = rgb(0, 0, 0, alpha = 0.6))
710-
}
711-
text(x = 1.8, y = 550, label = "Population size", col = rgb(1,0,0), cex = 0.8)
712-
abline(a = 577, b = 0, col = "red")
706+
```{r, warning=FALSE}
707+
n_draws <- 1000
708+
draws <- as_tibble(t(s_prior$y[,,2][1:n_draws,])) %>% add_column(t=t)
709+
draws <- pivot_longer(draws, c(1:1000) , names_to = "draw")
710+
draws %>%
711+
ggplot() +
712+
geom_line(mapping = aes(x = t, y=value, group = draw), alpha = 0.6, size=0.1) +
713+
geom_hline(yintercept=763, color="red") +
714+
geom_text(x=1.8, y=747, label="Population size", color="red") +
715+
labs(x = "Day", y="Number of infected students")
713716
```
714717

715718
And the median (black line) and 90% interval of the *a priori* number of student in bed (i.e the observed number of infected students).
716719
```{r}
717-
smr_pred <- cbind(as.data.frame(summary(fit_sir_prior, pars="pred_cases", probs=c(0.05, 0.5, 0.95))$summary), t)
720+
smr_pred <- cbind(as.data.frame(summary(fit_sir_prior, pars="pred_cases",
721+
probs=c(0.05, 0.5, 0.95))$summary), t)
718722
colnames(smr_pred) <- make.names(colnames(smr_pred)) # to remove % in the col names
719723
720724
ggplot(smr_pred, mapping=aes(x=t)) +
721725
geom_ribbon(aes(ymin = X5., ymax = X95.), fill = "orange", alpha = 0.6) +
722726
geom_line(mapping=aes(x=t, y=X50.)) +
723-
geom_abline(intercept=577, color="red" ) +
724-
geom_text(x=1.8, y=560, label="Population size", color="red") +
727+
geom_hline(yintercept=763, color="red" ) +
728+
geom_text(x=1.8, y=747, label="Population size", color="red") +
725729
labs(x = "Day", y="Number of students in bed")
726730
```
727731

728-
It seems that most trajectories are reasonable
729-
(the number of infected stays below the total number of people depicted in red,
730-
we see the number growing then decreasing etc.),
731-
and quite diverse. Still, some of the curves look a little bit funky
732-
and suggest we could refine our priors and make them more informative, although it may not be needed here.
732+
It seems that most trajectories are reasonable and quite diverse. Still, some of the curves look a little bit funky and suggest we could refine our priors and make them more informative, although it may not be needed here.
733733

734734
Typically, we can get away with priors that do not capture all our *a priori* knowledge,
735735
provided the data is informative enough.
@@ -799,8 +799,10 @@ These are all questions this simple test can help us tackle.
799799

800800
We take one arbitrary draw from the prior distribution
801801
```{r}
802-
draw <- 12 # one arbitrary draw from the prior distribution
803-
cases_simu <- s_prior$pred_cases[draw,] # the number of predicted cases sampled from the prior distribution, which we will use as data
802+
# one arbitrary draw from the prior distribution
803+
draw <- 12
804+
# the number of predicted cases sampled from the prior distribution, which we will use as data
805+
cases_simu <- s_prior$pred_cases[draw,]
804806
```
805807

806808
And use it as data which we fit to our model.
@@ -814,7 +816,8 @@ We can then examine the estimated posterior distribution.
814816

815817
```{r}
816818
params = c("beta", "gamma", "phi")
817-
paste("true beta :", toString(s_prior$beta[draw]), ", true gamma :", toString(s_prior$gamma[draw]), ", true phi :", toString(s_prior$phi[draw]))
819+
paste("true beta :", toString(s_prior$beta[draw]),
820+
", true gamma :", toString(s_prior$gamma[draw]), ", true phi :", toString(s_prior$phi[draw]))
818821
print(fit_simu, pars = params)
819822
```
820823
We plot the posterior density (in red) to check if it matches the true value of the parameter (black line).
@@ -1156,7 +1159,9 @@ where $y$ is the 4-entry state of the system, and $\phi_\text{death}$ is the dea
11561159

11571160
7. Spatial heterogeneity could be modelled either via metapopulation models or models capturing neighbouring structure explicitly, such as CAR models.
11581161

1162+
# Acknowledgments
11591163

1164+
We thank Ben Bales and Andrew Gelman for their helpful comments on this case study.
11601165

11611166
# References
11621167

0 commit comments

Comments
 (0)