Skip to content

Commit 885bd18

Browse files
committed
checkpointing update bym2 islands notebook, R helpers
1 parent 071ce3a commit 885bd18

File tree

3 files changed

+140
-46
lines changed

3 files changed

+140
-46
lines changed

knitr/car-iar-poisson/update_2021_02/BYM2 islands.ipynb

Lines changed: 139 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,23 @@
2323
"### Disease mapping: computing relative risk over a map of geographical regions\n",
2424
"\n",
2525
"Disease mapping concerns the study of disease risk over a map of geographical regions.\n",
26-
"For an areal map of $I$ regions, the outcome $y_i$ is the number of cases of a given disease in region $i$. For a rare disease, a Poisson model is assumed, $y_i | {\\theta}_i \\sim Po({\\theta}_i)$ with mean ${\\theta}_i = E_i * r_i$, where $E_i$ is the expected cases count for the disease and $r_i$ is the relative risk. Relative risk values above 1 indicate higher risk associated living a region. The relative risk can be modelled in terms of the effect of a covariates X as $log(r_i) = \\alpha + \\beta * x_i + {re}_i$; $\\alpha$ is the baseline log risk, $\\beta$ is the effect of the covariates, and ${re}_i$ is a random effect capturing extra Poisson variability possibly due to unobserved risk factors.\n",
27-
"\n",
28-
"In these models, areal maps are represented as a graph where the nodes in the graph are areal regions and the undirected edges in the graph represent the symmetric neighbor relationship."
26+
"For an areal map of $N$ regions, the data consists of\n",
27+
"outcome $y_i$, the number of cases of a given disease in region $i$,\n",
28+
"and may possibly include observed values of region-specific covariates $x_i$.\n",
29+
"The counts $y_i$ are modeled as either Poisson or binomial random variables in generalized linear models,\n",
30+
"using a log or logit link function, respectively.\n",
31+
"For rare diseases the binomial probability is small and the Poisson model\n",
32+
"is used as an approximation.\n",
33+
"\n",
34+
"Counts of rare events in small-population regions are noisy;\n",
35+
"removing this noise allows the underlying phenomena of interest to be seen more clearly.\n",
36+
"Conditional autoregressive (CAR) models smooth noisy estimates\n",
37+
"by pooling information from neighboring regions.\n",
38+
"Given an areal map, the binary _neighbor_ relationship\n",
39+
"(written $i \\sim j$ where $i \\neq j$)\n",
40+
"is $1$ if regions $n_i$ and $n_j$ are neighbors and is otherwise $0$.\n",
41+
"For CAR models, the neighbor relationship is symmetric but not reflexive;\n",
42+
"if $i \\sim j$ then $j \\sim i$, but a region is not its own neighbor."
2943
]
3044
},
3145
{
@@ -34,29 +48,29 @@
3448
"source": [
3549
"### The BYM2 model\n",
3650
"\n",
37-
"The BYM2 model is a disease mapping model presented in [Riebler et al. 2016](https://arxiv.org/abs/1601.0118).\n",
38-
"For the above disease mapping regression model, the random effects component is parameterized in terms of:\n",
51+
"The BYM2 model is a disease mapping model presented in [Riebler et al. 2016](https://arxiv.org/abs/1601.0118)\n",
52+
"with a spatial random effects component parameterized in terms of:\n",
3953
"\n",
40-
"- $\\phi$, an ICAR component which accounts for the spatial structure of the data.\n",
41-
"- $\\theta$, an ordinary random effects component which accounts for non-spatial heterogeneity.\n",
54+
"- $\\theta$, an ICAR component which accounts for the spatial structure of the data.\n",
55+
"- $\\phi$, an ordinary random effects component which accounts for non-spatial heterogeneity.\n",
4256
"- $\\rho$, a mixing parameter which accounts for the amount of spatial/non-spatial variation.\n",
4357
"- $\\sigma$, a precision (scale) parameter placed on the combined ICAR and ordinary random effects components.\n",
4458
"\n",
59+
"The mixing parameter $\\rho$ is the fraction of spatial variance; the amount of non-spatial variance is $1-\\rho$.\n",
4560
"In order for $\\sigma$ to legitimately be the standard deviation of the combined components,\n",
46-
"it is critical that for each $i$, $\\operatorname{Var}(\\phi_i) \\approx \\operatorname{Var}(\\theta_i) \\approx 1$. therfore, the BYM2 model introduces a scaling factor $\\tau$ to the model.\n",
47-
"Riebler recommends scaling the ICAR component $\\phi$ so the geometric mean of the average marginal variance of its elements is 1. \n",
48-
"By dividing the spatial component $\\phi$ by $\\sqrt{\\tau}$, the variances of these components are on the same scale.\n",
49-
"For irregular areal maps, where individual regions have varying number of neighbors, the scaling factor $\\tau$ necessarily comes into the model as data.\n",
61+
"it is critical that for each areal unit $i$, the spatial and heterogenious random effects are on the same scale so that $\\operatorname{Var}(\\theta_i) \\approx \\operatorname{Var}(\\phi_i) \\approx 1$. Because $\\phi$ is a vector of standard Gaussians, $\\operatorname{Var}(\\phi_i) \\approx 1$ by construction. To scale $\\rho$ for the ICAR component $\\theta$, the BYM2 model introduces variable $\\tau$.\n",
62+
"The combined spatial, non-spatial random effects are:\n",
63+
"$$\\sigma ((\\sqrt{\\rho/\\tau}\\,\\theta + \\sqrt{1-\\rho}\\,\\phi)$$\n",
5064
"\n",
51-
"The combined random effects component for the BYM2 model is: \n",
52-
"$$\\sigma (\\sqrt{1-\\rho}\\,\\theta^* + (\\sqrt{\\rho/\\tau}\\,\\phi^* )$$\n",
65+
"Riebler recommends using the geometric mean of the average marginal variance of the elements of $\\theta$ as the value of $\\tau$.\n",
66+
"As the structure of $\\theta$ is map-specific, i.e., dependent on the number of regions and number of neighbors of each region, the scaling factor $\\tau$ is data-dependent.\n",
5367
"\n",
5468
"The recommended priors are:\n",
5569
"\n",
5670
"- A standard prior on the standard deviation $\\sigma$; we use a half-normal, also possible are half-t or an exponential.\n",
5771
"- A beta(1/2,1/2) prior on $\\rho$.\n",
5872
"\n",
59-
"The Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) provides the background and derivations for the ICAR and the BYM2 model."
73+
"The Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) provides the background and derivations for the ICAR and the BYM2 model.\n"
6074
]
6175
},
6276
{
@@ -65,7 +79,7 @@
6579
"source": [
6680
"### Stan implementation of the BYM2 model for a fully connected spatial structure\n",
6781
"\n",
68-
"For the Stan implementation of the ICAR component, we compute the per-node spatial variance by representing the spatial structrue of the map as an _edgelist_; a 2D array of size 2 × J where J is the number of edges in the graph. Each column entry in this array represents one undirected edge in the graph, where for each edge j, entries [j,1] and [j,2] index the nodes connected by that edge. Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n",
82+
"For the Stan implementation of the ICAR component, we compute the per-node spatial variance by representing the spatial structure of the map as an _edgelist_; a 2D array of size 2 × J where J is the number of edges in the graph. Each column entry in this array represents one undirected edge in the graph, where for each edge j, entries [j,1] and [j,2] index the nodes connected by that edge. Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n",
6983
"\n",
7084
"\n",
7185
"When the areal map is a single, fully connected component, i.e., a graph where any node in the graph can be reached from any other node, the BYM2 model is implemented as follows.\n",
@@ -97,8 +111,8 @@
97111
"parameters {\n",
98112
" real<lower=0, upper=1> rho; // proportion of spatial effect that's spatially smoothed\n",
99113
" real<lower = 0> sigma; // scale of spatial effects\n",
100-
" vector[I] theta; // standardized heterogeneous spatial effects\n",
101-
" vector[I] phi; // standardized spatially smoothed spatial effects"
114+
" vector[I] theta; // standardized spatially smoothed spatial effects\n",
115+
" vector[I] phi; // standardized heterogeneous spatial effects"
102116
]
103117
},
104118
{
@@ -113,37 +127,37 @@
113127
"metadata": {},
114128
"source": [
115129
"transformed parameters {\n",
116-
" vector[I] gamma = (sqrt(1 - rho) * theta + sqrt(rho / tau) * phi) * sigma;"
130+
" vector[I] gamma = (sqrt(rho / tau) * theta + sqrt(1 - rho) * phi) * sigma;"
117131
]
118132
},
119133
{
120134
"cell_type": "markdown",
121135
"metadata": {},
122136
"source": [
123-
"The spatial effects parameters `phi` has an ICAR prior. We implement this by defining a log probability density function to compute the ICAR model via the pairwise difference formula. Because this is an improper prior, we must add a soft sum-to-zero constraint:"
137+
"The spatial effects parameters `theta` has an ICAR prior. We implement this by defining a log probability density function to compute the ICAR model via the pairwise difference formula. Because this is an improper prior, we must add a soft sum-to-zero constraint:"
124138
]
125139
},
126140
{
127141
"cell_type": "raw",
128142
"metadata": {},
129143
"source": [
130-
"real standard_icar_lpdf(vector phi, int[ , ] adjacency) {\n",
131-
" return 0.5 * dot_self(phi[adjacency[1,]] - phi[adjacency[2]])\n",
132-
"\t + normal_lpdf(sum(phi) | 0, 0.001 * rows(phi));\n",
144+
"real standard_icar_lpdf(vector theta, int[ , ] adjacency) {\n",
145+
" return 0.5 * dot_self(theta[adjacency[1,]] - theta[adjacency[2]])\n",
146+
"\t + normal_lpdf(sum(theta) | 0, 0.001 * rows(theta));\n",
133147
"}"
134148
]
135149
},
136150
{
137151
"cell_type": "markdown",
138152
"metadata": {},
139153
"source": [
140-
"### Freni_Sterrantino recommendations for BYM2 model for disconnected graphs\n",
154+
"### BYM2 model for disconnected graphs\n",
141155
"\n",
142-
"Freni-Sterrantino et al show how to adjust the scaling factors when the areal map is not fully connected but has at least one connected multi-node component. For a map with $K$ components:\n",
156+
"Freni-Sterrantino et al show how to adjust the scaling factors when the areal map is not fully connected but has at least one connected multi-node component. For a map with $C$ components:\n",
143157
"\n",
144-
"- Each connected component of size > 1 is scaled independently with scaling factor ${\\tau}_k$ and a sum-to-zero constraint is imposed on that component.\n",
158+
"- Each connected component of size > 1 is scaled independently with scaling factor ${\\tau}_c$ and a sum-to-zero constraint is imposed on that component\n",
145159
"\n",
146-
"- Components of size 1 are drawn from a standard distribution; i.e., they are treated as having random i.i.d. spatial variance."
160+
"- Components of size 1 are drawn from a standard distribution; i.e., they are treated as having random i.i.d. spatial variance\n"
147161
]
148162
},
149163
{
@@ -152,10 +166,9 @@
152166
"source": [
153167
"### Stan implementation of the BYM2 model for disconnected graphs\n",
154168
"\n",
155-
"To extend the BYM2 model to these areal maps, we agument this model with a series of per-component masks into the node and edgelists and use Stan's multi-index operator and vectorized operations for efficient computation.\n",
156-
"\n",
157-
"The spatial structure includes a set of arrays describing component-wise node, edgesets.\n",
158-
"The `_cts` arrays record the size of the node and edgelists for each component, the `_idx` arrays provide the indices of the members of each component."
169+
"To extend the BYM2 model to these areal maps, we add a set of indexes for the node and edgelists which we can use to pick out the subgraphs for each component.\n",
170+
"The `_cts` arrays record the size of the node and edgelists for each component, the `_idx` arrays provide the indices of the members of each component.\n",
171+
"The Stan language's multi-index expression allows for vectorized operations given an array of indices."
159172
]
160173
},
161174
{
@@ -191,9 +204,9 @@
191204
" vector[I] gamma;\n",
192205
" for (k in 1:K)\n",
193206
" gamma[K_node_idxs[k, 1:K_node_cts[k]]] = \n",
194-
" (sqrt(1 - rho) * theta[K_node_idxs[k, 1:K_node_cts[k]]]\n",
207+
" (sqrt(rho / tau) * theta[K_node_idxs[k, 1:K_node_cts[k]]]\n",
195208
" +\n",
196-
" sqrt(rho / tau) * phi[K_node_idxs[k, 1:K_node_cts[k]]])\n",
209+
" sqrt(1 - rho) * phi[K_node_idxs[k, 1:K_node_cts[k]]])\n",
197210
" * sigma;"
198211
]
199212
},
@@ -209,7 +222,7 @@
209222
"cell_type": "raw",
210223
"metadata": {},
211224
"source": [
212-
"real standard_icar_disconnected_lpdf(vector phi,\n",
225+
"real standard_icar_disconnected_lpdf(vector theta,\n",
213226
"\t\t\t\t int[ , ] adjacency,\n",
214227
"\t\t\t\t int[ ] node_cts,\n",
215228
"\t\t\t\t int[ ] edge_cts,\n",
@@ -218,17 +231,77 @@
218231
" real total = 0;\n",
219232
" for (n in 1:size(node_cts)) {\n",
220233
" if (node_cts[n] > 1)\n",
221-
" total += -0.5 * dot_self(phi[adjacency[1, edge_idxs[n, 1:edge_cts[n]]]] -\n",
222-
" phi[adjacency[2, edge_idxs[n, 1:edge_cts[n]]]])\n",
223-
" + normal_lpdf(sum(phi[node_idxs[n, 1:node_cts[n]]]) |\n",
234+
" total += -0.5 * dot_self(theta[adjacency[1, edge_idxs[n, 1:edge_cts[n]]]] -\n",
235+
" theta[adjacency[2, edge_idxs[n, 1:edge_cts[n]]]])\n",
236+
" + normal_lpdf(sum(theta[node_idxs[n, 1:node_cts[n]]]) |\n",
224237
" 0, 0.001 * node_cts[n]);\n",
225238
" else\n",
226-
" total += normal_lpdf(phi[n] | 0, 1); // iid spatial variance\n",
239+
" total += normal_lpdf(theta[n] | 0, 1); // iid spatial variance\n",
227240
" }\n",
228241
" return total;\n",
229242
"}"
230243
]
231244
},
245+
{
246+
"cell_type": "markdown",
247+
"metadata": {},
248+
"source": [
249+
"### Stan program data inputs `I`, `J`, `edges`, `tau`\n",
250+
"\n",
251+
"A certain amount of data munging is necessary to assemble the data inputs to the Stan program.\n",
252+
"\n",
253+
"Areal maps, as coded in a modern geographical information systems (GIS), consist of \n",
254+
"a set of regions, where each region is described by polygons over latitude/longitude coordinates. These are distributed as a structured bundle of tables of information and geolocational coordinates called [shapefiles](https://en.wikipedia.org/wiki/Shapefile). Spatial weights are an abstraction over the geographical relationships in these maps.\n",
255+
"\n",
256+
"- The R package `sf` provides the tools to read in and edit GIS shapefils.\n",
257+
"(_Note: the Python library `geopandas` provides similar functionality_).\n",
258+
"\n",
259+
"- The R package `spdep` (_Python library `pysal`_) provides functions to create and manipulate spatial weights\n",
260+
"as graphs and/or matrices\n",
261+
"\n",
262+
"- The [igraph](https://igraph.org) libraries for R, Python, and C++ provide functions for graph and network analysis.\n",
263+
"\n",
264+
"- The R package `Matrix` provides functions for sparse matrices."
265+
]
266+
},
267+
{
268+
"cell_type": "markdown",
269+
"metadata": {},
270+
"source": [
271+
"#### Computing `I`, `J`, `edges`: the number of nodes, edges, and edgelist\n",
272+
"\n",
273+
"The areal graph is described in terms of integers `I` and`J` the number of nodes and edges, respectively, and `edges` the $2 \\times J$ integer array of edges, where entries `[1,j]` and `[2,j]` specify neighboring regions. Given a set of areal shapefiles as inputs, the munging required using the above R libraries is:\n",
274+
"\n",
275+
"1. Read in the shapefiles - R function `sf::st_read`\n",
276+
"2. Transform to neighbors list `spdep::poly2nb` - list length is `I`\n",
277+
"3. Transform list to adjacency matrix `spdep::nb2mat`, then extract edgelist - R function `igraph::graph_from_adjacency_matrix` - num rows is `J`.\n",
278+
"4. Transpose edgelist from $\\mathrm{J} \\times 2$ matrix to $2 \\times \\mathrm{J}$ - result is `edges`."
279+
]
280+
},
281+
{
282+
"cell_type": "markdown",
283+
"metadata": {},
284+
"source": [
285+
"#### Computing `tau`, the ICAR scaling factor\n",
286+
"\n",
287+
"Riebler recommends scaling the ICAR component $\\theta$ so the geometric mean of the average marginal variance of the areal units is $1$, in order to make mixing parameter $\\rho$ properly interpretable. The steps required for this computation are:\n",
288+
"\n",
289+
"1. compute Q, the precision matrix corresponding to the neighborhood structure of the areal map. In R, we use packages `Matrix` and `R-INLA` to compute `tau` by creating a `Matrix::sparseMatrix` object from the edgelist `edges` - this is Q.\n",
290+
"2. compute Q_pert by adding a small amount of noise to the diagonal elements of Q\n",
291+
"3. compute Q_inv, the covariances between all neighbors - **computationally expensive**\n",
292+
"4. compute the geometric mean of the diagonal elements of Q_inv, `exp(mean(log(x)))`, this is $\\tau$"
293+
]
294+
},
295+
{
296+
"cell_type": "markdown",
297+
"metadata": {},
298+
"source": [
299+
"### Data inputs for disconnected graphs\n",
300+
"\n",
301+
"\n",
302+
"Using the R `spdep` package functions `n.comp.id` and `subset.nb`, we first create per-component subgraphs and then compute the scaling factor as outlined above."
303+
]
304+
},
232305
{
233306
"cell_type": "markdown",
234307
"metadata": {},
@@ -458,7 +531,8 @@
458531
"source": [
459532
"names = list(islands_summary.index)\n",
460533
"phi_rows = [names.index(name) for name in names if name.startswith('phi[')]\n",
461-
"eta_rows = [names.index(name) for name in names if name.startswith('eta[')]"
534+
"theta_rows = [names.index(name) for name in names if name.startswith('theta[')]\n",
535+
"relrisk_rows = [names.index(name) for name in names if name.startswith('y_prime[')]\n"
462536
]
463537
},
464538
{
@@ -467,9 +541,11 @@
467541
"metadata": {},
468542
"outputs": [],
469543
"source": [
470-
"print('spatial effects:\\n{}\\n\\nlog relative risk:\\n{}\\n'.format(\n",
544+
"print('observed:\\n{}\\nspatial effects:\\n{}\\nextra_spatial_effect:\\n{}\\nlog relative risk:\\n{}\\n'.format(\n",
545+
" islands_data['y'][0:11],\n",
471546
" islands_summary.iloc[phi_rows,:][0:11], \n",
472-
" islands_summary.iloc[eta_rows,:][0:11]))"
547+
" islands_summary.iloc[theta_rows,:][0:11], \n",
548+
" islands_summary.iloc[relrisk_rows,:][0:11]))"
473549
]
474550
},
475551
{
@@ -571,14 +647,35 @@
571647
"scrolled": false
572648
},
573649
"outputs": [],
650+
"source": [
651+
"az.rcParams.update({'plot.max_subplots': 12})\n",
652+
"\n",
653+
"az.plot_density(\n",
654+
" [connected_az, islands_az], \n",
655+
" data_labels=['connected', 'islands'], \n",
656+
" var_names=[\"y_prime\"],\n",
657+
" shade=0.1,\n",
658+
" textsize=32,\n",
659+
" hdi_prob=0.999999\n",
660+
")\n",
661+
"plt.show()"
662+
]
663+
},
664+
{
665+
"cell_type": "code",
666+
"execution_count": null,
667+
"metadata": {},
668+
"outputs": [],
574669
"source": [
575670
"az.rcParams.update({'plot.max_subplots': 56})\n",
576671
"\n",
577672
"az.plot_density(\n",
578673
" [connected_az, islands_az], \n",
579674
" data_labels=['connected', 'islands'], \n",
580-
" var_names=[\"eta\"],\n",
581-
" shade=0.1\n",
675+
" var_names=[\"phi\"],\n",
676+
" shade=0.1,\n",
677+
" textsize=48,\n",
678+
" hdi_prob=.999999\n",
582679
")\n",
583680
"plt.show()"
584681
]
-679 KB
Binary file not shown.

knitr/car-iar-poisson/update_2021_02/bym2_helpers.R

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,9 @@ nb_to_edge_array <- function(nb_obj) {
2323
t(as_edgelist(graph_from_adjacency_matrix(adj_matrix, mode="undirected")))
2424
}
2525

26-
2726
# compute geometric mean of a vector
2827
geometric_mean <- function(x) exp(mean(log(x)))
2928

30-
3129
# compute scaling factor for a fully connected areal map
3230
# accounts for differences in spatial connectivity
3331
scaling_factor <- function(edge_array) {
@@ -93,7 +91,7 @@ index_components <- function(nb_obj) {
9391
} else {
9492
scaling_factors[k] = 1.0
9593
}
96-
}
94+
}
9795

9896
return(list("K"=num_comps,
9997
"K_node_cts"=as.vector(table(comp_idxs)),
@@ -102,4 +100,3 @@ index_components <- function(nb_obj) {
102100
"K_edge_idxs"=comp_edge_idxs,
103101
"tau"=scaling_factors))
104102
}
105-

0 commit comments

Comments
 (0)