|
23 | 23 | "### Disease mapping: computing relative risk over a map of geographical regions\n", |
24 | 24 | "\n", |
25 | 25 | "Disease mapping concerns the study of disease risk over a map of geographical regions.\n", |
26 | | - "For an areal map of $I$ regions, the outcome $y_i$ is the number of cases of a given disease in region $i$. For a rare disease, a Poisson model is assumed, $y_i | {\\theta}_i \\sim Po({\\theta}_i)$ with mean ${\\theta}_i = E_i * r_i$, where $E_i$ is the expected cases count for the disease and $r_i$ is the relative risk. Relative risk values above 1 indicate higher risk associated living a region. The relative risk can be modelled in terms of the effect of a covariates X as $log(r_i) = \\alpha + \\beta * x_i + {re}_i$; $\\alpha$ is the baseline log risk, $\\beta$ is the effect of the covariates, and ${re}_i$ is a random effect capturing extra Poisson variability possibly due to unobserved risk factors.\n", |
27 | | - "\n", |
28 | | - "In these models, areal maps are represented as a graph where the nodes in the graph are areal regions and the undirected edges in the graph represent the symmetric neighbor relationship." |
| 26 | + "For an areal map of $N$ regions, the data consists of\n", |
| 27 | + "outcome $y_i$, the number of cases of a given disease in region $i$,\n", |
| 28 | + "and may possibly include observed values of region-specific covariates $x_i$.\n", |
| 29 | + "The counts $y_i$ are modeled as either Poisson or binomial random variables in generalized linear models,\n", |
| 30 | + "using a log or logit link function, respectively.\n", |
| 31 | + "For rare diseases the binomial probability is small and the Poisson model\n", |
| 32 | + "is used as an approximation.\n", |
| 33 | + "\n", |
| 34 | + "Counts of rare events in small-population regions are noisy;\n", |
| 35 | + "removing this noise allows the underlying phenomena of interest to be seen more clearly.\n", |
| 36 | + "Conditional autoregressive (CAR) models smooth noisy estimates\n", |
| 37 | + "by pooling information from neighboring regions.\n", |
| 38 | + "Given an areal map, the binary _neighbor_ relationship\n", |
| 39 | + "(written $i \\sim j$ where $i \\neq j$)\n", |
| 40 | + "is $1$ if regions $n_i$ and $n_j$ are neighbors and is otherwise $0$.\n", |
| 41 | + "For CAR models, the neighbor relationship is symmetric but not reflexive;\n", |
| 42 | + "if $i \\sim j$ then $j \\sim i$, but a region is not its own neighbor." |
29 | 43 | ] |
30 | 44 | }, |
31 | 45 | { |
|
34 | 48 | "source": [ |
35 | 49 | "### The BYM2 model\n", |
36 | 50 | "\n", |
37 | | - "The BYM2 model is a disease mapping model presented in [Riebler et al. 2016](https://arxiv.org/abs/1601.0118).\n", |
38 | | - "For the above disease mapping regression model, the random effects component is parameterized in terms of:\n", |
| 51 | + "The BYM2 model is a disease mapping model presented in [Riebler et al. 2016](https://arxiv.org/abs/1601.0118)\n", |
| 52 | + "with a spatial random effects component parameterized in terms of:\n", |
39 | 53 | "\n", |
40 | | - "- $\\phi$, an ICAR component which accounts for the spatial structure of the data.\n", |
41 | | - "- $\\theta$, an ordinary random effects component which accounts for non-spatial heterogeneity.\n", |
| 54 | + "- $\\theta$, an ICAR component which accounts for the spatial structure of the data.\n", |
| 55 | + "- $\\phi$, an ordinary random effects component which accounts for non-spatial heterogeneity.\n", |
42 | 56 | "- $\\rho$, a mixing parameter which accounts for the amount of spatial/non-spatial variation.\n", |
43 | 57 | "- $\\sigma$, a precision (scale) parameter placed on the combined ICAR and ordinary random effects components.\n", |
44 | 58 | "\n", |
| 59 | + "The mixing parameter $\\rho$ is the fraction of spatial variance; the amount of non-spatial variance is $1-\\rho$.\n", |
45 | 60 | "In order for $\\sigma$ to legitimately be the standard deviation of the combined components,\n", |
46 | | - "it is critical that for each $i$, $\\operatorname{Var}(\\phi_i) \\approx \\operatorname{Var}(\\theta_i) \\approx 1$. therfore, the BYM2 model introduces a scaling factor $\\tau$ to the model.\n", |
47 | | - "Riebler recommends scaling the ICAR component $\\phi$ so the geometric mean of the average marginal variance of its elements is 1. \n", |
48 | | - "By dividing the spatial component $\\phi$ by $\\sqrt{\\tau}$, the variances of these components are on the same scale.\n", |
49 | | - "For irregular areal maps, where individual regions have varying number of neighbors, the scaling factor $\\tau$ necessarily comes into the model as data.\n", |
| 61 | + "it is critical that for each areal unit $i$, the spatial and heterogenious random effects are on the same scale so that $\\operatorname{Var}(\\theta_i) \\approx \\operatorname{Var}(\\phi_i) \\approx 1$. Because $\\phi$ is a vector of standard Gaussians, $\\operatorname{Var}(\\phi_i) \\approx 1$ by construction. To scale $\\rho$ for the ICAR component $\\theta$, the BYM2 model introduces variable $\\tau$.\n", |
| 62 | + "The combined spatial, non-spatial random effects are:\n", |
| 63 | + "$$\\sigma ((\\sqrt{\\rho/\\tau}\\,\\theta + \\sqrt{1-\\rho}\\,\\phi)$$\n", |
50 | 64 | "\n", |
51 | | - "The combined random effects component for the BYM2 model is: \n", |
52 | | - "$$\\sigma (\\sqrt{1-\\rho}\\,\\theta^* + (\\sqrt{\\rho/\\tau}\\,\\phi^* )$$\n", |
| 65 | + "Riebler recommends using the geometric mean of the average marginal variance of the elements of $\\theta$ as the value of $\\tau$.\n", |
| 66 | + "As the structure of $\\theta$ is map-specific, i.e., dependent on the number of regions and number of neighbors of each region, the scaling factor $\\tau$ is data-dependent.\n", |
53 | 67 | "\n", |
54 | 68 | "The recommended priors are:\n", |
55 | 69 | "\n", |
56 | 70 | "- A standard prior on the standard deviation $\\sigma$; we use a half-normal, also possible are half-t or an exponential.\n", |
57 | 71 | "- A beta(1/2,1/2) prior on $\\rho$.\n", |
58 | 72 | "\n", |
59 | | - "The Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) provides the background and derivations for the ICAR and the BYM2 model." |
| 73 | + "The Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) provides the background and derivations for the ICAR and the BYM2 model.\n" |
60 | 74 | ] |
61 | 75 | }, |
62 | 76 | { |
|
65 | 79 | "source": [ |
66 | 80 | "### Stan implementation of the BYM2 model for a fully connected spatial structure\n", |
67 | 81 | "\n", |
68 | | - "For the Stan implementation of the ICAR component, we compute the per-node spatial variance by representing the spatial structrue of the map as an _edgelist_; a 2D array of size 2 × J where J is the number of edges in the graph. Each column entry in this array represents one undirected edge in the graph, where for each edge j, entries [j,1] and [j,2] index the nodes connected by that edge. Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n", |
| 82 | + "For the Stan implementation of the ICAR component, we compute the per-node spatial variance by representing the spatial structure of the map as an _edgelist_; a 2D array of size 2 × J where J is the number of edges in the graph. Each column entry in this array represents one undirected edge in the graph, where for each edge j, entries [j,1] and [j,2] index the nodes connected by that edge. Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n", |
69 | 83 | "\n", |
70 | 84 | "\n", |
71 | 85 | "When the areal map is a single, fully connected component, i.e., a graph where any node in the graph can be reached from any other node, the BYM2 model is implemented as follows.\n", |
|
97 | 111 | "parameters {\n", |
98 | 112 | " real<lower=0, upper=1> rho; // proportion of spatial effect that's spatially smoothed\n", |
99 | 113 | " real<lower = 0> sigma; // scale of spatial effects\n", |
100 | | - " vector[I] theta; // standardized heterogeneous spatial effects\n", |
101 | | - " vector[I] phi; // standardized spatially smoothed spatial effects" |
| 114 | + " vector[I] theta; // standardized spatially smoothed spatial effects\n", |
| 115 | + " vector[I] phi; // standardized heterogeneous spatial effects" |
102 | 116 | ] |
103 | 117 | }, |
104 | 118 | { |
|
113 | 127 | "metadata": {}, |
114 | 128 | "source": [ |
115 | 129 | "transformed parameters {\n", |
116 | | - " vector[I] gamma = (sqrt(1 - rho) * theta + sqrt(rho / tau) * phi) * sigma;" |
| 130 | + " vector[I] gamma = (sqrt(rho / tau) * theta + sqrt(1 - rho) * phi) * sigma;" |
117 | 131 | ] |
118 | 132 | }, |
119 | 133 | { |
120 | 134 | "cell_type": "markdown", |
121 | 135 | "metadata": {}, |
122 | 136 | "source": [ |
123 | | - "The spatial effects parameters `phi` has an ICAR prior. We implement this by defining a log probability density function to compute the ICAR model via the pairwise difference formula. Because this is an improper prior, we must add a soft sum-to-zero constraint:" |
| 137 | + "The spatial effects parameters `theta` has an ICAR prior. We implement this by defining a log probability density function to compute the ICAR model via the pairwise difference formula. Because this is an improper prior, we must add a soft sum-to-zero constraint:" |
124 | 138 | ] |
125 | 139 | }, |
126 | 140 | { |
127 | 141 | "cell_type": "raw", |
128 | 142 | "metadata": {}, |
129 | 143 | "source": [ |
130 | | - "real standard_icar_lpdf(vector phi, int[ , ] adjacency) {\n", |
131 | | - " return 0.5 * dot_self(phi[adjacency[1,]] - phi[adjacency[2]])\n", |
132 | | - "\t + normal_lpdf(sum(phi) | 0, 0.001 * rows(phi));\n", |
| 144 | + "real standard_icar_lpdf(vector theta, int[ , ] adjacency) {\n", |
| 145 | + " return 0.5 * dot_self(theta[adjacency[1,]] - theta[adjacency[2]])\n", |
| 146 | + "\t + normal_lpdf(sum(theta) | 0, 0.001 * rows(theta));\n", |
133 | 147 | "}" |
134 | 148 | ] |
135 | 149 | }, |
136 | 150 | { |
137 | 151 | "cell_type": "markdown", |
138 | 152 | "metadata": {}, |
139 | 153 | "source": [ |
140 | | - "### Freni_Sterrantino recommendations for BYM2 model for disconnected graphs\n", |
| 154 | + "### BYM2 model for disconnected graphs\n", |
141 | 155 | "\n", |
142 | | - "Freni-Sterrantino et al show how to adjust the scaling factors when the areal map is not fully connected but has at least one connected multi-node component. For a map with $K$ components:\n", |
| 156 | + "Freni-Sterrantino et al show how to adjust the scaling factors when the areal map is not fully connected but has at least one connected multi-node component. For a map with $C$ components:\n", |
143 | 157 | "\n", |
144 | | - "- Each connected component of size > 1 is scaled independently with scaling factor ${\\tau}_k$ and a sum-to-zero constraint is imposed on that component.\n", |
| 158 | + "- Each connected component of size > 1 is scaled independently with scaling factor ${\\tau}_c$ and a sum-to-zero constraint is imposed on that component\n", |
145 | 159 | "\n", |
146 | | - "- Components of size 1 are drawn from a standard distribution; i.e., they are treated as having random i.i.d. spatial variance." |
| 160 | + "- Components of size 1 are drawn from a standard distribution; i.e., they are treated as having random i.i.d. spatial variance\n" |
147 | 161 | ] |
148 | 162 | }, |
149 | 163 | { |
|
152 | 166 | "source": [ |
153 | 167 | "### Stan implementation of the BYM2 model for disconnected graphs\n", |
154 | 168 | "\n", |
155 | | - "To extend the BYM2 model to these areal maps, we agument this model with a series of per-component masks into the node and edgelists and use Stan's multi-index operator and vectorized operations for efficient computation.\n", |
156 | | - "\n", |
157 | | - "The spatial structure includes a set of arrays describing component-wise node, edgesets.\n", |
158 | | - "The `_cts` arrays record the size of the node and edgelists for each component, the `_idx` arrays provide the indices of the members of each component." |
| 169 | + "To extend the BYM2 model to these areal maps, we add a set of indexes for the node and edgelists which we can use to pick out the subgraphs for each component.\n", |
| 170 | + "The `_cts` arrays record the size of the node and edgelists for each component, the `_idx` arrays provide the indices of the members of each component.\n", |
| 171 | + "The Stan language's multi-index expression allows for vectorized operations given an array of indices." |
159 | 172 | ] |
160 | 173 | }, |
161 | 174 | { |
|
191 | 204 | " vector[I] gamma;\n", |
192 | 205 | " for (k in 1:K)\n", |
193 | 206 | " gamma[K_node_idxs[k, 1:K_node_cts[k]]] = \n", |
194 | | - " (sqrt(1 - rho) * theta[K_node_idxs[k, 1:K_node_cts[k]]]\n", |
| 207 | + " (sqrt(rho / tau) * theta[K_node_idxs[k, 1:K_node_cts[k]]]\n", |
195 | 208 | " +\n", |
196 | | - " sqrt(rho / tau) * phi[K_node_idxs[k, 1:K_node_cts[k]]])\n", |
| 209 | + " sqrt(1 - rho) * phi[K_node_idxs[k, 1:K_node_cts[k]]])\n", |
197 | 210 | " * sigma;" |
198 | 211 | ] |
199 | 212 | }, |
|
209 | 222 | "cell_type": "raw", |
210 | 223 | "metadata": {}, |
211 | 224 | "source": [ |
212 | | - "real standard_icar_disconnected_lpdf(vector phi,\n", |
| 225 | + "real standard_icar_disconnected_lpdf(vector theta,\n", |
213 | 226 | "\t\t\t\t int[ , ] adjacency,\n", |
214 | 227 | "\t\t\t\t int[ ] node_cts,\n", |
215 | 228 | "\t\t\t\t int[ ] edge_cts,\n", |
|
218 | 231 | " real total = 0;\n", |
219 | 232 | " for (n in 1:size(node_cts)) {\n", |
220 | 233 | " if (node_cts[n] > 1)\n", |
221 | | - " total += -0.5 * dot_self(phi[adjacency[1, edge_idxs[n, 1:edge_cts[n]]]] -\n", |
222 | | - " phi[adjacency[2, edge_idxs[n, 1:edge_cts[n]]]])\n", |
223 | | - " + normal_lpdf(sum(phi[node_idxs[n, 1:node_cts[n]]]) |\n", |
| 234 | + " total += -0.5 * dot_self(theta[adjacency[1, edge_idxs[n, 1:edge_cts[n]]]] -\n", |
| 235 | + " theta[adjacency[2, edge_idxs[n, 1:edge_cts[n]]]])\n", |
| 236 | + " + normal_lpdf(sum(theta[node_idxs[n, 1:node_cts[n]]]) |\n", |
224 | 237 | " 0, 0.001 * node_cts[n]);\n", |
225 | 238 | " else\n", |
226 | | - " total += normal_lpdf(phi[n] | 0, 1); // iid spatial variance\n", |
| 239 | + " total += normal_lpdf(theta[n] | 0, 1); // iid spatial variance\n", |
227 | 240 | " }\n", |
228 | 241 | " return total;\n", |
229 | 242 | "}" |
230 | 243 | ] |
231 | 244 | }, |
| 245 | + { |
| 246 | + "cell_type": "markdown", |
| 247 | + "metadata": {}, |
| 248 | + "source": [ |
| 249 | + "### Stan program data inputs `I`, `J`, `edges`, `tau`\n", |
| 250 | + "\n", |
| 251 | + "A certain amount of data munging is necessary to assemble the data inputs to the Stan program.\n", |
| 252 | + "\n", |
| 253 | + "Areal maps, as coded in a modern geographical information systems (GIS), consist of \n", |
| 254 | + "a set of regions, where each region is described by polygons over latitude/longitude coordinates. These are distributed as a structured bundle of tables of information and geolocational coordinates called [shapefiles](https://en.wikipedia.org/wiki/Shapefile). Spatial weights are an abstraction over the geographical relationships in these maps.\n", |
| 255 | + "\n", |
| 256 | + "- The R package `sf` provides the tools to read in and edit GIS shapefils.\n", |
| 257 | + "(_Note: the Python library `geopandas` provides similar functionality_).\n", |
| 258 | + "\n", |
| 259 | + "- The R package `spdep` (_Python library `pysal`_) provides functions to create and manipulate spatial weights\n", |
| 260 | + "as graphs and/or matrices\n", |
| 261 | + "\n", |
| 262 | + "- The [igraph](https://igraph.org) libraries for R, Python, and C++ provide functions for graph and network analysis.\n", |
| 263 | + "\n", |
| 264 | + "- The R package `Matrix` provides functions for sparse matrices." |
| 265 | + ] |
| 266 | + }, |
| 267 | + { |
| 268 | + "cell_type": "markdown", |
| 269 | + "metadata": {}, |
| 270 | + "source": [ |
| 271 | + "#### Computing `I`, `J`, `edges`: the number of nodes, edges, and edgelist\n", |
| 272 | + "\n", |
| 273 | + "The areal graph is described in terms of integers `I` and`J` the number of nodes and edges, respectively, and `edges` the $2 \\times J$ integer array of edges, where entries `[1,j]` and `[2,j]` specify neighboring regions. Given a set of areal shapefiles as inputs, the munging required using the above R libraries is:\n", |
| 274 | + "\n", |
| 275 | + "1. Read in the shapefiles - R function `sf::st_read`\n", |
| 276 | + "2. Transform to neighbors list `spdep::poly2nb` - list length is `I`\n", |
| 277 | + "3. Transform list to adjacency matrix `spdep::nb2mat`, then extract edgelist - R function `igraph::graph_from_adjacency_matrix` - num rows is `J`.\n", |
| 278 | + "4. Transpose edgelist from $\\mathrm{J} \\times 2$ matrix to $2 \\times \\mathrm{J}$ - result is `edges`." |
| 279 | + ] |
| 280 | + }, |
| 281 | + { |
| 282 | + "cell_type": "markdown", |
| 283 | + "metadata": {}, |
| 284 | + "source": [ |
| 285 | + "#### Computing `tau`, the ICAR scaling factor\n", |
| 286 | + "\n", |
| 287 | + "Riebler recommends scaling the ICAR component $\\theta$ so the geometric mean of the average marginal variance of the areal units is $1$, in order to make mixing parameter $\\rho$ properly interpretable. The steps required for this computation are:\n", |
| 288 | + "\n", |
| 289 | + "1. compute Q, the precision matrix corresponding to the neighborhood structure of the areal map. In R, we use packages `Matrix` and `R-INLA` to compute `tau` by creating a `Matrix::sparseMatrix` object from the edgelist `edges` - this is Q.\n", |
| 290 | + "2. compute Q_pert by adding a small amount of noise to the diagonal elements of Q\n", |
| 291 | + "3. compute Q_inv, the covariances between all neighbors - **computationally expensive**\n", |
| 292 | + "4. compute the geometric mean of the diagonal elements of Q_inv, `exp(mean(log(x)))`, this is $\\tau$" |
| 293 | + ] |
| 294 | + }, |
| 295 | + { |
| 296 | + "cell_type": "markdown", |
| 297 | + "metadata": {}, |
| 298 | + "source": [ |
| 299 | + "### Data inputs for disconnected graphs\n", |
| 300 | + "\n", |
| 301 | + "\n", |
| 302 | + "Using the R `spdep` package functions `n.comp.id` and `subset.nb`, we first create per-component subgraphs and then compute the scaling factor as outlined above." |
| 303 | + ] |
| 304 | + }, |
232 | 305 | { |
233 | 306 | "cell_type": "markdown", |
234 | 307 | "metadata": {}, |
|
458 | 531 | "source": [ |
459 | 532 | "names = list(islands_summary.index)\n", |
460 | 533 | "phi_rows = [names.index(name) for name in names if name.startswith('phi[')]\n", |
461 | | - "eta_rows = [names.index(name) for name in names if name.startswith('eta[')]" |
| 534 | + "theta_rows = [names.index(name) for name in names if name.startswith('theta[')]\n", |
| 535 | + "relrisk_rows = [names.index(name) for name in names if name.startswith('y_prime[')]\n" |
462 | 536 | ] |
463 | 537 | }, |
464 | 538 | { |
|
467 | 541 | "metadata": {}, |
468 | 542 | "outputs": [], |
469 | 543 | "source": [ |
470 | | - "print('spatial effects:\\n{}\\n\\nlog relative risk:\\n{}\\n'.format(\n", |
| 544 | + "print('observed:\\n{}\\nspatial effects:\\n{}\\nextra_spatial_effect:\\n{}\\nlog relative risk:\\n{}\\n'.format(\n", |
| 545 | + " islands_data['y'][0:11],\n", |
471 | 546 | " islands_summary.iloc[phi_rows,:][0:11], \n", |
472 | | - " islands_summary.iloc[eta_rows,:][0:11]))" |
| 547 | + " islands_summary.iloc[theta_rows,:][0:11], \n", |
| 548 | + " islands_summary.iloc[relrisk_rows,:][0:11]))" |
473 | 549 | ] |
474 | 550 | }, |
475 | 551 | { |
|
571 | 647 | "scrolled": false |
572 | 648 | }, |
573 | 649 | "outputs": [], |
| 650 | + "source": [ |
| 651 | + "az.rcParams.update({'plot.max_subplots': 12})\n", |
| 652 | + "\n", |
| 653 | + "az.plot_density(\n", |
| 654 | + " [connected_az, islands_az], \n", |
| 655 | + " data_labels=['connected', 'islands'], \n", |
| 656 | + " var_names=[\"y_prime\"],\n", |
| 657 | + " shade=0.1,\n", |
| 658 | + " textsize=32,\n", |
| 659 | + " hdi_prob=0.999999\n", |
| 660 | + ")\n", |
| 661 | + "plt.show()" |
| 662 | + ] |
| 663 | + }, |
| 664 | + { |
| 665 | + "cell_type": "code", |
| 666 | + "execution_count": null, |
| 667 | + "metadata": {}, |
| 668 | + "outputs": [], |
574 | 669 | "source": [ |
575 | 670 | "az.rcParams.update({'plot.max_subplots': 56})\n", |
576 | 671 | "\n", |
577 | 672 | "az.plot_density(\n", |
578 | 673 | " [connected_az, islands_az], \n", |
579 | 674 | " data_labels=['connected', 'islands'], \n", |
580 | | - " var_names=[\"eta\"],\n", |
581 | | - " shade=0.1\n", |
| 675 | + " var_names=[\"phi\"],\n", |
| 676 | + " shade=0.1,\n", |
| 677 | + " textsize=48,\n", |
| 678 | + " hdi_prob=.999999\n", |
582 | 679 | ")\n", |
583 | 680 | "plt.show()" |
584 | 681 | ] |
|
0 commit comments