checkpointing update bym2 islands notebook, R helpers

mitzimorris · mitzimorris · commit 885bd18e93fd · 2021-05-14T08:48:17.000-04:00
diff --git a/knitr/car-iar-poisson/update_2021_02/BYM2 islands.ipynb b/knitr/car-iar-poisson/update_2021_02/BYM2 islands.ipynb
@@ -23,9 +23,23 @@
     "### Disease mapping: computing relative risk over a map of geographical regions\n",
     "\n",
     "Disease mapping concerns the study of disease risk over a map of geographical regions.\n",
-    "For an areal map of $I$ regions, the outcome $y_i$ is the number of cases of a given disease in region $i$. For a rare disease, a Poisson model is assumed, $y_i | {\\theta}_i \\sim Po({\\theta}_i)$ with mean ${\\theta}_i = E_i * r_i$, where $E_i$ is the expected cases count for the disease and $r_i$ is the relative risk.  Relative risk values above 1 indicate higher risk associated living a region. The relative risk can be modelled in terms of the effect of a covariates X as $log(r_i) = \\alpha + \\beta * x_i + {re}_i$; $\\alpha$ is the baseline log risk, $\\beta$ is the effect of the covariates, and ${re}_i$ is a random effect capturing extra Poisson variability possibly due to unobserved risk factors.\n",
-    "\n",
-    "In these models, areal maps are represented as a graph where the nodes in the graph are areal regions and the undirected edges in the graph represent the symmetric neighbor relationship."
+    "For an areal map of $N$ regions, the data consists of\n",
+    "outcome $y_i$, the number of cases of a given disease in region $i$,\n",
+    "and may possibly include observed values of region-specific covariates $x_i$.\n",
+    "The counts $y_i$ are modeled as either Poisson or binomial random variables in generalized linear models,\n",
+    "using a log or logit link function, respectively.\n",
+    "For rare diseases the binomial probability is small and the Poisson model\n",
+    "is used as an approximation.\n",
+    "\n",
+    "Counts of rare events in small-population regions are noisy;\n",
+    "removing this noise allows the underlying phenomena of interest to be seen more clearly.\n",
+    "Conditional autoregressive (CAR) models smooth noisy estimates\n",
+    "by pooling information from neighboring regions.\n",
+    "Given an areal map, the binary _neighbor_ relationship\n",
+    "(written $i \\sim j$ where $i \\neq j$)\n",
+    "is $1$ if regions $n_i$ and $n_j$ are neighbors and is otherwise $0$.\n",
+    "For CAR models, the neighbor relationship is symmetric but not reflexive;\n",
+    "if $i \\sim j$ then $j \\sim i$, but a region is not its own neighbor."
    ]
   },
   {
@@ -34,29 +48,29 @@
    "source": [
     "### The BYM2 model\n",
     "\n",
-    "The BYM2 model is a disease mapping model presented in [Riebler et al. 2016](https://arxiv.org/abs/1601.0118).\n",
-    "For the above disease mapping regression model, the random effects component is parameterized in terms of:\n",
+    "The BYM2 model is a disease mapping model presented in [Riebler et al. 2016](https://arxiv.org/abs/1601.0118)\n",
+    "with a spatial random effects component parameterized in terms of:\n",
     "\n",
-    "- $\\phi$, an ICAR component which accounts for the spatial structure of the data.\n",
-    "- $\\theta$, an ordinary random effects component which accounts for non-spatial heterogeneity.\n",
+    "- $\\theta$, an ICAR component which accounts for the spatial structure of the data.\n",
+    "- $\\phi$, an ordinary random effects component which accounts for non-spatial heterogeneity.\n",
     "- $\\rho$,  a mixing parameter which accounts for the amount of spatial/non-spatial variation.\n",
     "- $\\sigma$, a precision (scale) parameter placed on the combined ICAR and ordinary random effects components.\n",
     "\n",
+    "The mixing parameter $\\rho$ is the fraction of spatial variance; the amount of non-spatial variance is $1-\\rho$.\n",
     "In order for $\\sigma$ to legitimately be the standard deviation of the combined components,\n",
-    "it is critical that for each $i$, $\\operatorname{Var}(\\phi_i) \\approx \\operatorname{Var}(\\theta_i) \\approx 1$. therfore, the BYM2 model introduces a scaling factor $\\tau$ to the model.\n",
-    "Riebler recommends scaling the ICAR component $\\phi$ so the geometric mean of the average marginal variance of its elements is 1. \n",
-    "By dividing the spatial component $\\phi$ by $\\sqrt{\\tau}$, the variances of these components are on the same scale.\n",
-    "For irregular areal maps, where individual regions have varying number of neighbors, the scaling factor $\\tau$ necessarily comes into the model as data.\n",
+    "it is critical that for each areal unit $i$, the spatial and heterogenious random effects are on the same scale so that $\\operatorname{Var}(\\theta_i) \\approx \\operatorname{Var}(\\phi_i) \\approx 1$. Because $\\phi$ is a vector of standard Gaussians, $\\operatorname{Var}(\\phi_i) \\approx 1$ by construction. To scale $\\rho$ for the ICAR component $\\theta$, the BYM2 model introduces variable $\\tau$.\n",
+    "The combined spatial, non-spatial random effects are:\n",
+    "$$\\sigma ((\\sqrt{\\rho/\\tau}\\,\\theta + \\sqrt{1-\\rho}\\,\\phi)$$\n",
     "\n",
-    "The combined random effects component for the BYM2 model is: \n",
-    "$$\\sigma (\\sqrt{1-\\rho}\\,\\theta^* + (\\sqrt{\\rho/\\tau}\\,\\phi^* )$$\n",
+    "Riebler recommends using the geometric mean of the average marginal variance of the elements of $\\theta$ as the value of $\\tau$.\n",
+    "As the structure of $\\theta$ is map-specific, i.e., dependent on the number of regions and number of neighbors of each region, the scaling factor $\\tau$ is data-dependent.\n",
     "\n",
     "The recommended priors are:\n",
     "\n",
     "- A standard prior on the standard deviation $\\sigma$; we use a half-normal, also possible are half-t or an exponential.\n",
     "- A beta(1/2,1/2) prior on $\\rho$.\n",
     "\n",
-    "The Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) provides the background and derivations for the ICAR and the BYM2 model."
+    "The Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) provides the background and derivations for the ICAR and the BYM2 model.\n"
    ]
   },
   {
@@ -65,7 +79,7 @@
    "source": [
     "### Stan implementation of the BYM2 model for a fully connected spatial structure\n",
     "\n",
-    "For the Stan implementation of the ICAR component, we compute the per-node spatial variance by representing the spatial structrue of the map as an _edgelist_; a 2D array of size 2 × J where J is the number of edges in the graph. Each column entry in this array represents one undirected edge in the graph, where for each edge j, entries [j,1] and [j,2] index the nodes connected by that edge. Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n",
+    "For the Stan implementation of the ICAR component, we compute the per-node spatial variance by representing the spatial structure of the map as an _edgelist_; a 2D array of size 2 × J where J is the number of edges in the graph. Each column entry in this array represents one undirected edge in the graph, where for each edge j, entries [j,1] and [j,2] index the nodes connected by that edge. Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n",
     "\n",
     "\n",
     "When the areal map is a single, fully connected component, i.e., a graph where any node in the graph can be reached from any other node, the BYM2 model is implemented as follows.\n",
@@ -97,8 +111,8 @@
     "parameters {\n",
     "  real<lower=0, upper=1> rho; // proportion of spatial effect that's spatially smoothed\n",
     "  real<lower = 0> sigma;  // scale of spatial effects\n",
-    "  vector[I] theta;  // standardized heterogeneous spatial effects\n",
-    "  vector[I] phi;  // standardized spatially smoothed spatial effects"
+    "  vector[I] theta;  // standardized spatially smoothed spatial effects\n",
+    "  vector[I] phi;  // standardized heterogeneous spatial effects"
    ]
   },
   {
@@ -113,37 +127,37 @@
    "metadata": {},
    "source": [
     "transformed parameters {\n",
-    "  vector[I] gamma = (sqrt(1 - rho) * theta + sqrt(rho / tau) * phi) * sigma;"
+    "  vector[I] gamma = (sqrt(rho / tau) * theta + sqrt(1 - rho) * phi) * sigma;"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The spatial effects parameters `phi` has an ICAR prior.  We implement this by defining a log probability density function to compute the ICAR model via the pairwise difference formula.  Because this is an improper prior, we must add a soft sum-to-zero constraint:"
+    "The spatial effects parameters `theta` has an ICAR prior.  We implement this by defining a log probability density function to compute the ICAR model via the pairwise difference formula.  Because this is an improper prior, we must add a soft sum-to-zero constraint:"
    ]
   },
   {
    "cell_type": "raw",
    "metadata": {},
    "source": [
-    "real standard_icar_lpdf(vector phi, int[ , ] adjacency) {\n",
-    "    return 0.5 * dot_self(phi[adjacency[1,]] - phi[adjacency[2]])\n",
-    "\t  + normal_lpdf(sum(phi) | 0, 0.001 * rows(phi));\n",
+    "real standard_icar_lpdf(vector theta, int[ , ] adjacency) {\n",
+    "    return 0.5 * dot_self(theta[adjacency[1,]] - theta[adjacency[2]])\n",
+    "\t  + normal_lpdf(sum(theta) | 0, 0.001 * rows(theta));\n",
     "}"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Freni_Sterrantino recommendations for BYM2 model for disconnected graphs\n",
+    "### BYM2 model for disconnected graphs\n",
     "\n",
-    "Freni-Sterrantino et al show how to adjust the scaling factors when the areal map is not fully connected but has at least one connected multi-node component.  For a map with $K$ components:\n",
+    "Freni-Sterrantino et al show how to adjust the scaling factors when the areal map is not fully connected but has at least one connected multi-node component.  For a map with $C$ components:\n",
     "\n",
-    "- Each connected component of size > 1 is scaled independently with scaling factor ${\\tau}_k$ and a sum-to-zero constraint is imposed on that component.\n",
+    "- Each connected component of size > 1 is scaled independently with scaling factor ${\\tau}_c$ and a sum-to-zero constraint is imposed on that component\n",
     "\n",
-    "- Components of size 1 are drawn from a standard distribution; i.e., they are treated as having random i.i.d. spatial variance."
+    "- Components of size 1 are drawn from a standard distribution; i.e., they are treated as having random i.i.d. spatial variance\n"
    ]
   },
   {
@@ -152,10 +166,9 @@
    "source": [
     "### Stan implementation of the BYM2 model for disconnected graphs\n",
     "\n",
-    "To extend the BYM2 model to these areal maps, we agument this model with a series of per-component masks into the node and edgelists and use Stan's multi-index operator and vectorized operations for efficient computation.\n",
-    "\n",
-    "The spatial structure includes a set of arrays describing component-wise node, edgesets.\n",
-    "The `_cts` arrays record the size of the node and edgelists for each component, the `_idx` arrays provide the indices of the members of each component."
+    "To extend the BYM2 model to these areal maps, we add a set of indexes for the node and edgelists which we can use to pick out the subgraphs for each component.\n",
+    "The `_cts` arrays record the size of the node and edgelists for each component, the `_idx` arrays provide the indices of the members of each component.\n",
+    "The Stan language's multi-index expression allows for vectorized operations given an array of indices."
    ]
   },
   {
@@ -191,9 +204,9 @@
     "  vector[I] gamma;\n",
     "  for (k in 1:K)\n",
     "    gamma[K_node_idxs[k, 1:K_node_cts[k]]] = \n",
-    "            (sqrt(1 - rho) * theta[K_node_idxs[k, 1:K_node_cts[k]]]\n",
+    "            (sqrt(rho / tau) * theta[K_node_idxs[k, 1:K_node_cts[k]]]\n",
     "             +\n",
-    "             sqrt(rho / tau) * phi[K_node_idxs[k, 1:K_node_cts[k]]])\n",
+    "             sqrt(1 - rho) * phi[K_node_idxs[k, 1:K_node_cts[k]]])\n",
     "            * sigma;"
    ]
   },
@@ -209,7 +222,7 @@
    "cell_type": "raw",
    "metadata": {},
    "source": [
-    "real standard_icar_disconnected_lpdf(vector phi,\n",
+    "real standard_icar_disconnected_lpdf(vector theta,\n",
     "\t\t\t\t       int[ , ] adjacency,\n",
     "\t\t\t\t       int[ ] node_cts,\n",
     "\t\t\t\t       int[ ] edge_cts,\n",
@@ -218,17 +231,77 @@
     "    real total = 0;\n",
     "    for (n in 1:size(node_cts)) {\n",
     "      if (node_cts[n] > 1)\n",
-    "        total += -0.5 * dot_self(phi[adjacency[1, edge_idxs[n, 1:edge_cts[n]]]] -\n",
-    "                                 phi[adjacency[2, edge_idxs[n, 1:edge_cts[n]]]])\n",
-    "                  + normal_lpdf(sum(phi[node_idxs[n, 1:node_cts[n]]]) |\n",
+    "        total += -0.5 * dot_self(theta[adjacency[1, edge_idxs[n, 1:edge_cts[n]]]] -\n",
+    "                                 theta[adjacency[2, edge_idxs[n, 1:edge_cts[n]]]])\n",
+    "                  + normal_lpdf(sum(theta[node_idxs[n, 1:node_cts[n]]]) |\n",
     "                                      0, 0.001 * node_cts[n]);\n",
     "      else\n",
-    "          total += normal_lpdf(phi[n] | 0, 1);  // iid spatial variance\n",
+    "          total += normal_lpdf(theta[n] | 0, 1);  // iid spatial variance\n",
     "    }\n",
     "    return total;\n",
     "}"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Stan program data inputs `I`, `J`, `edges`, `tau`\n",
+    "\n",
+    "A certain amount of data munging is necessary to assemble the data inputs to the Stan program.\n",
+    "\n",
+    "Areal maps, as coded in a modern geographical information systems (GIS), consist of \n",
+    "a set of regions, where each region is described by polygons over latitude/longitude coordinates.  These are distributed as a structured bundle of tables of information and geolocational coordinates called [shapefiles](https://en.wikipedia.org/wiki/Shapefile). Spatial weights are an abstraction over the geographical relationships in these maps.\n",
+    "\n",
+    "- The R package `sf` provides the tools to read in and edit GIS shapefils.\n",
+    "(_Note: the Python library `geopandas` provides similar functionality_).\n",
+    "\n",
+    "- The R package `spdep` (_Python library `pysal`_) provides functions to create and manipulate spatial weights\n",
+    "as graphs and/or matrices\n",
+    "\n",
+    "- The [igraph](https://igraph.org) libraries for R, Python, and C++ provide functions for graph and network analysis.\n",
+    "\n",
+    "- The R package `Matrix` provides functions for sparse matrices."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Computing `I`, `J`, `edges`: the number of nodes, edges, and edgelist\n",
+    "\n",
+    "The areal graph is described in terms of integers `I` and`J` the number of nodes and edges, respectively, and `edges` the $2 \\times J$ integer array of edges, where entries `[1,j]` and `[2,j]` specify neighboring regions.  Given a set of areal shapefiles as inputs, the munging required using the above R libraries is:\n",
+    "\n",
+    "1. Read in the shapefiles - R function `sf::st_read`\n",
+    "2. Transform to neighbors list `spdep::poly2nb` - list length is `I`\n",
+    "3. Transform list to adjacency matrix `spdep::nb2mat`, then extract edgelist - R function `igraph::graph_from_adjacency_matrix` - num rows is `J`.\n",
+    "4. Transpose edgelist from $\\mathrm{J} \\times 2$ matrix to $2 \\times \\mathrm{J}$ - result is `edges`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Computing `tau`, the ICAR scaling factor\n",
+    "\n",
+    "Riebler recommends scaling the ICAR component $\\theta$ so the geometric mean of the average marginal variance of the areal units is $1$, in order to make mixing parameter $\\rho$ properly interpretable. The steps required for this computation are:\n",
+    "\n",
+    "1. compute Q, the precision matrix corresponding to the neighborhood structure of the areal map.  In R, we use packages `Matrix` and `R-INLA` to compute `tau` by creating a `Matrix::sparseMatrix` object from the edgelist `edges` - this is Q.\n",
+    "2. compute Q_pert by adding a small amount of noise to the diagonal elements of Q\n",
+    "3. compute Q_inv, the covariances between all neighbors - **computationally expensive**\n",
+    "4. compute the geometric mean of the diagonal elements of Q_inv, `exp(mean(log(x)))`, this is $\\tau$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data inputs for disconnected graphs\n",
+    "\n",
+    "\n",
+    "Using the R `spdep` package functions `n.comp.id` and `subset.nb`, we first create per-component subgraphs and then compute the scaling factor as outlined above."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -458,7 +531,8 @@
    "source": [
     "names = list(islands_summary.index)\n",
     "phi_rows = [names.index(name) for name in names if name.startswith('phi[')]\n",
-    "eta_rows = [names.index(name) for name in names if name.startswith('eta[')]"
+    "theta_rows = [names.index(name) for name in names if name.startswith('theta[')]\n",
+    "relrisk_rows = [names.index(name) for name in names if name.startswith('y_prime[')]\n"
    ]
   },
   {
@@ -467,9 +541,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print('spatial effects:\\n{}\\n\\nlog relative risk:\\n{}\\n'.format(\n",
+    "print('observed:\\n{}\\nspatial effects:\\n{}\\nextra_spatial_effect:\\n{}\\nlog relative risk:\\n{}\\n'.format(\n",
+    "    islands_data['y'][0:11],\n",
     "    islands_summary.iloc[phi_rows,:][0:11], \n",
-    "    islands_summary.iloc[eta_rows,:][0:11]))"
+    "    islands_summary.iloc[theta_rows,:][0:11], \n",
+    "    islands_summary.iloc[relrisk_rows,:][0:11]))"
    ]
   },
   {
@@ -571,14 +647,35 @@
     "scrolled": false
    },
    "outputs": [],
+   "source": [
+    "az.rcParams.update({'plot.max_subplots': 12})\n",
+    "\n",
+    "az.plot_density(\n",
+    "    [connected_az, islands_az], \n",
+    "    data_labels=['connected', 'islands'], \n",
+    "    var_names=[\"y_prime\"],\n",
+    "    shade=0.1,\n",
+    "    textsize=32,\n",
+    "    hdi_prob=0.999999\n",
+    ")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "az.rcParams.update({'plot.max_subplots': 56})\n",
     "\n",
     "az.plot_density(\n",
     "    [connected_az, islands_az], \n",
     "    data_labels=['connected', 'islands'], \n",
-    "    var_names=[\"eta\"],\n",
-    "    shade=0.1\n",
+    "    var_names=[\"phi\"],\n",
+    "    shade=0.1,\n",
+    "    textsize=48,\n",
+    "    hdi_prob=.999999\n",
     ")\n",
     "plt.show()"
    ]
diff --git a/knitr/car-iar-poisson/update_2021_02/BYM2_islands_notebook.pdf b/knitr/car-iar-poisson/update_2021_02/BYM2_islands_notebook.pdf
diff --git a/knitr/car-iar-poisson/update_2021_02/bym2_helpers.R b/knitr/car-iar-poisson/update_2021_02/bym2_helpers.R
@@ -23,11 +23,9 @@ nb_to_edge_array <- function(nb_obj) {
     t(as_edgelist(graph_from_adjacency_matrix(adj_matrix, mode="undirected")))
 }
 
-
 # compute geometric mean of a vector
 geometric_mean <- function(x) exp(mean(log(x))) 
 
-
 # compute scaling factor for a fully connected areal map
 # accounts for differences in spatial connectivity
 scaling_factor <- function(edge_array) {
@@ -93,7 +91,7 @@ index_components <- function(nb_obj) {
         } else {
             scaling_factors[k] = 1.0
         }
-    }   
+    }
 
     return(list("K"=num_comps,
                 "K_node_cts"=as.vector(table(comp_idxs)),
@@ -102,4 +100,3 @@ index_components <- function(nb_obj) {
                 "K_edge_idxs"=comp_edge_idxs,
                 "tau"=scaling_factors))
 }
-