Skip to content

Commit ab499fe

Browse files
committed
notebook for BYM2 islands
1 parent 5dcadbc commit ab499fe

16 files changed

+1027
-0
lines changed
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Freni-Sterrantino et al 2017 - BYM2 connected, disconnected for Scotland Lip Cancer Dataset"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"The BYM2 model for areal data adds to components to a GLM: an ICAR component which accounts for the spatial structure of the data, and a random effects component. See the Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) for details on the ICAR, BYM, and BYM2 models. This implementation assumes that the spatial structure is a single, fully connected component, i.e., a graph where any node in the graph can be reached from any other node.\n",
15+
"\n",
16+
"In [A note on intrinsic Conditional Autoregressive models for disconnected graphs](https://arxiv.org/abs/1705.04854), Freni-Sterrantino et.al. show how to implement this model for disconnected graphs. In this notebook, we present that Stan implementation of this proposal."
17+
]
18+
},
19+
{
20+
"cell_type": "markdown",
21+
"metadata": {},
22+
"source": [
23+
"### Areal data: the counties in Scotland, circa 1980"
24+
]
25+
},
26+
{
27+
"cell_type": "markdown",
28+
"metadata": {},
29+
"source": [
30+
"The canonical dataset used to test and compare different parameterizations of ICAR models is a study on the incidence of lip cancer in Scotland in the 1970s and 1980s. The data, including the names and coordinates for the counties of Scotland are available from R package [SpatialEpi](https://cran.r-project.org/web/packages/SpatialEpi/SpatialEpi.pdf), dataset `scotland`.\n",
31+
"\n",
32+
"3 of these counties are islands: the Outer Hebrides (western.isles), Shetland, and Orkney. In the canonical datasets, these islands are conntected to the mainland, so that the adjacency graph consists of a single, fully connected component. However, different maps are possible: a map with 4 components, the mainland and the 3 islands; or a map with 3 components: the mainland, a component consisting of Shetland and Orkney, and a singleton consisting of the Hebrides. The following plots demonstrate the differences:"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": null,
38+
"metadata": {},
39+
"outputs": [],
40+
"source": [
41+
"import matplotlib.pyplot as plt\n",
42+
"import matplotlib.image as mpimg\n",
43+
"from matplotlib import rcParams\n",
44+
"\n",
45+
"%matplotlib inline\n",
46+
"\n",
47+
"# figure size in inches optional\n",
48+
"rcParams['figure.figsize'] = 11 ,8\n",
49+
"\n",
50+
"# read images\n",
51+
"img_A = mpimg.imread('scot_connected.png')\n",
52+
"img_B = mpimg.imread('scot_3_comp.png')\n",
53+
"img_C = mpimg.imread('scot_islands.png')\n",
54+
"\n",
55+
"\n",
56+
"# display images\n",
57+
"fig, ax = plt.subplots(1,3)\n",
58+
"ax[0].imshow(img_A);\n",
59+
"ax[1].imshow(img_B);\n",
60+
"ax[2].imshow(img_C);"
61+
]
62+
},
63+
{
64+
"cell_type": "markdown",
65+
"metadata": {},
66+
"source": [
67+
"### Areal data munging: from spatial polygon to 2D array of edges\n",
68+
"\n",
69+
"Inputs to the Stan model must match the set of variables declared in the `data` block.\n",
70+
"\n",
71+
"The Stan implementation of the ICAR model computes with a 2D array of size 2 $\\times$ J where J is the number of edges in the graph. Each column entry in this array represents one undirected edge in the graph, where for each edge i, entries [i,1] and [i,2] index the nodes connected by that edge. Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n",
72+
"\n",
73+
"The `scotland` data is a set of spatial polygons, i.e., a description of the shape of each county in terms of its lat,lon coordinates. The R package [spdep](https://r-spatial.github.io/spdep/index.html) extracts the adjacency relations as a `nb` object.\n",
74+
"We have written a set of helper functions which take the `nb` objects for each graph into the set of data structures needed by the Stan models, these are in file `bym2_helpers.R`. \n",
75+
"The three versions of the Scotland spatial structure are in files `scotland_nbs.data.R`, `scotland_3_comp_nbs.data.R`, and `scotland_islands_nbs.data.R`.\n",
76+
"The file `munge_scotland.R` munges the data, and it has been saved as JSON data files."
77+
]
78+
},
79+
{
80+
"cell_type": "markdown",
81+
"metadata": {},
82+
"source": [
83+
"## Fit connected graph on Scotland Lip cancer dataset with BYM2 model implemented in Stan."
84+
]
85+
},
86+
{
87+
"cell_type": "code",
88+
"execution_count": null,
89+
"metadata": {},
90+
"outputs": [],
91+
"source": [
92+
"from cmdstanpy import cmdstan_path, CmdStanModel, install_cmdstan\n",
93+
"# install_cmdstan() # as needed - will install latest release (as needed)"
94+
]
95+
},
96+
{
97+
"cell_type": "markdown",
98+
"metadata": {},
99+
"source": [
100+
"The dataset `scot_connected.data.json` contains the cancer dataset together with the spatial structure. The cancer study data is:\n",
101+
"\n",
102+
"- `y`: observed outcome - number of cases of lip cancer\n",
103+
"- `x`: single predictor - percent of population working in agriculture, forestry, or fisheries.\n",
104+
"- `E`: population\n",
105+
"\n",
106+
"The spatial structure is comprised of:\n",
107+
"\n",
108+
"- I: `int<lower = 0> I; // number of nodes`\n",
109+
"- J: `int<lower = 0> J; // number of edges`\n",
110+
"- edges: `int<lower = 1, upper = I> edges[2, J]; // node[1, j] adjacent to node[2, j]`\n",
111+
"- tau: `real tau; // scaling factor`"
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"metadata": {},
118+
"outputs": [],
119+
"source": [
120+
"from cmdstanpy import cmdstan_path, CmdStanModel\n",
121+
"bym2_model = CmdStanModel(stan_file='bym2.stan')\n",
122+
"bym2_fit = bym2_model.sample(data='scot_connected.data.json')"
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": null,
128+
"metadata": {},
129+
"outputs": [],
130+
"source": [
131+
"bym2_fit.summary()"
132+
]
133+
},
134+
{
135+
"cell_type": "markdown",
136+
"metadata": {},
137+
"source": [
138+
"## Fit disconnected graphs on Scotland Lip cancer dataset with BYM2 model implemented in Stan, following Freni-Sterrantino\n"
139+
]
140+
},
141+
{
142+
"cell_type": "code",
143+
"execution_count": null,
144+
"metadata": {},
145+
"outputs": [],
146+
"source": [
147+
"from cmdstanpy import cmdstan_path, CmdStanModel\n",
148+
"bym2_model = CmdStanModel(stan_file='bym2_islands.stan')\n",
149+
"print(bym2_model.code())"
150+
]
151+
},
152+
{
153+
"cell_type": "code",
154+
"execution_count": null,
155+
"metadata": {},
156+
"outputs": [],
157+
"source": [
158+
"import json\n",
159+
"with open('scot_3_comp.data.json') as fd:\n",
160+
" scot_data = json.load(fd)\n",
161+
"print(scot_data)"
162+
]
163+
},
164+
{
165+
"cell_type": "code",
166+
"execution_count": null,
167+
"metadata": {},
168+
"outputs": [],
169+
"source": [
170+
"bym2_fit = bym2_model.sample(data=scot_data)\n",
171+
"bym2_fit.summary()"
172+
]
173+
},
174+
{
175+
"cell_type": "code",
176+
"execution_count": null,
177+
"metadata": {},
178+
"outputs": [],
179+
"source": []
180+
},
181+
{
182+
"cell_type": "code",
183+
"execution_count": null,
184+
"metadata": {},
185+
"outputs": [],
186+
"source": []
187+
}
188+
],
189+
"metadata": {
190+
"kernelspec": {
191+
"display_name": "Python 3",
192+
"language": "python",
193+
"name": "python3"
194+
},
195+
"language_info": {
196+
"codemirror_mode": {
197+
"name": "ipython",
198+
"version": 3
199+
},
200+
"file_extension": ".py",
201+
"mimetype": "text/x-python",
202+
"name": "python",
203+
"nbconvert_exporter": "python",
204+
"pygments_lexer": "ipython3",
205+
"version": "3.8.5"
206+
}
207+
},
208+
"nbformat": 4,
209+
"nbformat_minor": 4
210+
}
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
functions {
2+
/**
3+
* Return the log probability density for the vector of coefficients
4+
* under the ICAR model with unit variance, where the second two
5+
* arguments are parallel vectors of coefficients for adjacent
6+
* regions. For example, a series of three coeffs phi[1:3] making
7+
* up a time series would have b1 = [phi[1], phi[2]]' and b2 =
8+
* [phi[2], phi[3]]'.
9+
*
10+
* @param phi vector of varying effects
11+
* @param b1 parallel vector of elements of phi
12+
* @param b2 second parallel vector of adjacent elemens of phi
13+
* @return ICAR log density
14+
* @reject if b1 and b2 are not the same size
15+
*/
16+
real soft_ctr_std_icar_lpdf(vector phi, vector b1, vector b2) {
17+
return -0.5 * dot_self(b1 - b2) // equiv normal_lpdf(b1 | b2, 1)
18+
+ normal_lpdf(sum(phi) | 0, 0.001 * rows(phi));
19+
}
20+
21+
/**
22+
* Return the log probability density of the specified vector of
23+
* coefficients under the ICAR model with unit variance, where
24+
* adjacency is determined by the adjacency array. The adjacency
25+
* array contains two parallel arrays of adjacent element indexes.
26+
* For example, a series of four coefficients phi[1:4] making up a
27+
* time series would have adjacency array {{1, 2, 3}, {2, 3, 4}},
28+
* signaling that coefficient 1 is adjacent to coefficient 2, 2
29+
* adjacent to 3, and 3 adjacent to 4.
30+
*
31+
* @param phi vector of varying effects
32+
* @param adjacency parallel arrays of indexes of adjacent elements of phi
33+
* @return ICAR log probability density
34+
* @reject if the the adjacency matrix does not have two rows
35+
*/
36+
real standard_icar_lpdf(vector phi, int[ , ] adjacency) {
37+
if (size(adjacency) != 2)
38+
reject("require 2rows for adjacency array;",
39+
" found rows = ", size(adjacency));
40+
return soft_ctr_std_icar_lpdf(phi | phi[adjacency[1]], phi[adjacency[2]]);
41+
}
42+
}
43+
data {
44+
// spatial structure
45+
int<lower = 0> I; // number of nodes
46+
int<lower = 0> J; // number of edges
47+
int<lower = 1, upper = I> edges[2, J]; // node[1, j] adjacent to node[2, j]
48+
49+
real tau; // scaling factor
50+
51+
int<lower=0> y[I]; // count outcomes
52+
vector<lower=0>[I] E; // exposure
53+
vector[I] x; // predictor
54+
}
55+
transformed data {
56+
vector[I] log_E = log(E);
57+
}
58+
parameters {
59+
real alpha; // intercept
60+
real beta; // covariate
61+
62+
// spatial effects
63+
real logit_rho; // proportion of spatial effect that's spatially smoothed
64+
real<lower = 0> sigma; // scale of spatial effects
65+
vector[I] theta; // standardized heterogeneous spatial effects
66+
vector[I] phi; // standardized spatially smoothed spatial effects
67+
}
68+
transformed parameters {
69+
real<lower=0, upper=1> rho = inv_logit(logit_rho);
70+
// spatial effects (combine heterogeneous and spatially smoothed)
71+
vector[I] gamma = (sqrt(1 - rho) * theta + sqrt(rho) * sqrt(1 / tau) * phi) * sigma;
72+
}
73+
model {
74+
y ~ poisson_log(log_E + alpha + x * beta + gamma * sigma); // co-variates
75+
76+
alpha ~ normal(0, 1);
77+
beta ~ normal(0, 1);
78+
79+
// spatial hyperpriors and priors
80+
sigma ~ normal(0, 1);
81+
logit_rho ~ normal(0, 1);
82+
theta ~ normal(0, 1);
83+
phi ~ standard_icar(edges);
84+
}
85+
generated quantities {
86+
vector[I] eta = log_E + alpha + x * beta + gamma * sigma;
87+
vector[I] y_prime = exp(eta);
88+
int y_rep[I,10];
89+
for (j in 1:10) {
90+
if (max(eta) > 20) {
91+
// avoid overflow in poisson_log_rng
92+
print("max eta too big: ", max(eta));
93+
for (i in 1:I)
94+
y_rep[i,j] = -1;
95+
} else {
96+
for (i in 1:I)
97+
y_rep[i,j] = poisson_log_rng(eta[i]);
98+
}
99+
}
100+
}

0 commit comments

Comments
 (0)