stan-dev
diff --git a/‎knitr/car-iar-poisson/update_2021_02/BYM2 islands.ipynb‎
Lines changed: 210 additions & 0 deletions b/‎knitr/car-iar-poisson/update_2021_02/BYM2 islands.ipynb‎
Lines changed: 210 additions & 0 deletions
diff --git a/‎knitr/car-iar-poisson/update_2021_02/bym2.stan‎
Lines changed: 100 additions & 0 deletions b/‎knitr/car-iar-poisson/update_2021_02/bym2.stan‎
Lines changed: 100 additions & 0 deletions
@@ -0,0 +1,210 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Freni-Sterrantino et al 2017 - BYM2 connected, disconnected for Scotland Lip Cancer Dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The BYM2 model for areal data adds to components to a GLM:  an ICAR component which accounts for the spatial structure of the data, and a random effects component.  See the Stan case study [Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) for details on the ICAR, BYM, and BYM2 models.  This implementation assumes that the spatial structure is a single, fully connected component, i.e., a graph where any node in the graph can be reached from any other node.\n",
+    "\n",
+    "In [A note on intrinsic Conditional Autoregressive models for disconnected graphs](https://arxiv.org/abs/1705.04854), Freni-Sterrantino et.al. show how to implement this model for disconnected graphs.  In this notebook, we present that Stan implementation of this proposal."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Areal data:  the counties in Scotland, circa 1980"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The canonical dataset used to test and compare different parameterizations of ICAR models is a study on the incidence of lip cancer in Scotland in the 1970s and 1980s.  The data, including the names and coordinates for the counties of Scotland are available from R package [SpatialEpi](https://cran.r-project.org/web/packages/SpatialEpi/SpatialEpi.pdf), dataset `scotland`.\n",
+    "\n",
+    "3 of these counties are islands:  the Outer Hebrides (western.isles), Shetland, and Orkney.  In the canonical datasets, these islands are conntected to the mainland, so that the adjacency graph consists of a single, fully connected component.  However, different maps are possible:  a map with 4 components, the mainland and the 3 islands; or a map with 3 components:  the mainland, a component consisting of Shetland and Orkney, and a singleton consisting of the Hebrides. The following plots demonstrate the differences:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import matplotlib.image as mpimg\n",
+    "from matplotlib import rcParams\n",
+    "\n",
+    "%matplotlib inline\n",
+    "\n",
+    "# figure size in inches optional\n",
+    "rcParams['figure.figsize'] = 11 ,8\n",
+    "\n",
+    "# read images\n",
+    "img_A = mpimg.imread('scot_connected.png')\n",
+    "img_B = mpimg.imread('scot_3_comp.png')\n",
+    "img_C = mpimg.imread('scot_islands.png')\n",
+    "\n",
+    "\n",
+    "# display images\n",
+    "fig, ax = plt.subplots(1,3)\n",
+    "ax[0].imshow(img_A);\n",
+    "ax[1].imshow(img_B);\n",
+    "ax[2].imshow(img_C);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Areal data munging:  from spatial polygon to 2D array of edges\n",
+    "\n",
+    "Inputs to the Stan model must match the set of variables declared in the `data` block.\n",
+    "\n",
+    "The Stan implementation of the ICAR model computes with a 2D array of size 2 $\\times$ J where J is the number of edges in the graph.  Each column entry in this array represents one undirected edge in the graph, where for each edge i, entries [i,1] and [i,2] index the nodes connected by that edge.  Treating these are parallel arrays and using Stan's vectorized operations provides a transparent implementation of the pairwise difference formula used to compute the ICAR component.\n",
+    "\n",
+    "The `scotland` data is a set of spatial polygons, i.e., a description of the shape of each county in terms of its lat,lon coordinates.  The R package [spdep](https://r-spatial.github.io/spdep/index.html) extracts the adjacency relations as a `nb` object.\n",
+    "We have written a set of helper functions which take the `nb` objects for each graph into the set of data structures needed by the Stan models, these are in file `bym2_helpers.R`.  \n",
+    "The three versions of the Scotland spatial structure are in files `scotland_nbs.data.R`, `scotland_3_comp_nbs.data.R`, and `scotland_islands_nbs.data.R`.\n",
+    "The file `munge_scotland.R` munges the data, and it has been saved as JSON data files."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Fit connected graph on Scotland Lip cancer dataset with BYM2 model implemented in Stan."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cmdstanpy import cmdstan_path, CmdStanModel, install_cmdstan\n",
+    "# install_cmdstan()  # as needed - will install latest release (as needed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The dataset `scot_connected.data.json` contains the cancer dataset together with the spatial structure.  The cancer study data is:\n",
+    "\n",
+    "- `y`: observed outcome - number of cases of lip cancer\n",
+    "- `x`: single predictor - percent of population working in agriculture, forestry, or fisheries.\n",
+    "- `E`: population\n",
+    "\n",
+    "The spatial structure is comprised of:\n",
+    "\n",
+    "- I: `int<lower = 0> I;  // number of nodes`\n",
+    "- J: `int<lower = 0> J;  // number of edges`\n",
+    "- edges: `int<lower = 1, upper = I> edges[2, J];  // node[1, j] adjacent to node[2, j]`\n",
+    "- tau: `real tau; // scaling factor`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cmdstanpy import cmdstan_path, CmdStanModel\n",
+    "bym2_model = CmdStanModel(stan_file='bym2.stan')\n",
+    "bym2_fit = bym2_model.sample(data='scot_connected.data.json')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bym2_fit.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Fit disconnected graphs on Scotland Lip cancer dataset with BYM2 model implemented in Stan, following Freni-Sterrantino\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cmdstanpy import cmdstan_path, CmdStanModel\n",
+    "bym2_model = CmdStanModel(stan_file='bym2_islands.stan')\n",
+    "print(bym2_model.code())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "with open('scot_3_comp.data.json') as fd:\n",
+    "    scot_data = json.load(fd)\n",
+    "print(scot_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bym2_fit = bym2_model.sample(data=scot_data)\n",
+    "bym2_fit.summary()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -0,0 +1,100 @@
+functions {
+  /**
+   * Return the log probability density for the vector of coefficients
+   * under the ICAR model with unit variance, where the second two
+   * arguments are parallel vectors of coefficients for adjacent
+   * regions.  For example, a series of three coeffs phi[1:3] making
+   * up a time series would have b1 = [phi[1], phi[2]]' and b2 =
+   * [phi[2], phi[3]]'.
+   * 
+   * @param phi vector of varying effects
+   * @param b1 parallel vector of elements of phi
+   * @param b2 second parallel vector of adjacent elemens of phi
+   * @return ICAR log density
+   * @reject if b1 and b2 are not the same size
+   */
+  real soft_ctr_std_icar_lpdf(vector phi, vector b1, vector b2) {
+    return -0.5 * dot_self(b1 - b2)  // equiv normal_lpdf(b1 | b2, 1)
+      + normal_lpdf(sum(phi) | 0, 0.001 * rows(phi));
+  }
+
+  /**
+   * Return the log probability density of the specified vector of
+   * coefficients under the ICAR model with unit variance, where
+   * adjacency is determined by the adjacency array. The adjacency
+   * array contains two parallel arrays of adjacent element indexes.
+   * For example, a series of four coefficients phi[1:4] making up a
+   * time series would have adjacency array {{1, 2, 3}, {2, 3, 4}},
+   * signaling that coefficient 1 is adjacent to coefficient 2, 2
+   * adjacent to 3, and 3 adjacent to 4.
+   *
+   * @param phi vector of varying effects
+   * @param adjacency parallel arrays of indexes of adjacent elements of phi
+   * @return ICAR log probability density
+   * @reject if the the adjacency matrix does not have two rows
+   */
+  real standard_icar_lpdf(vector phi, int[ , ] adjacency) {
+    if (size(adjacency) != 2)
+      reject("require 2rows for adjacency array;",
+             " found rows = ", size(adjacency));
+    return soft_ctr_std_icar_lpdf(phi | phi[adjacency[1]], phi[adjacency[2]]);
+  }
+}
+data {
+  // spatial structure
+  int<lower = 0> I;  // number of nodes
+  int<lower = 0> J;  // number of edges
+  int<lower = 1, upper = I> edges[2, J];  // node[1, j] adjacent to node[2, j]
+
+  real tau; // scaling factor
+
+  int<lower=0> y[I];              // count outcomes
+  vector<lower=0>[I] E;           // exposure
+  vector[I] x;                 // predictor
+}
+transformed data {
+  vector[I] log_E = log(E);
+}
+parameters {
+  real alpha;      // intercept
+  real beta;       // covariate
+
+  // spatial effects
+  real logit_rho;  // proportion of spatial effect that's spatially smoothed
+  real<lower = 0> sigma;  // scale of spatial effects
+  vector[I] theta;  // standardized heterogeneous spatial effects
+  vector[I] phi;  // standardized spatially smoothed spatial effects
+}
+transformed parameters {
+  real<lower=0, upper=1> rho = inv_logit(logit_rho);
+  // spatial effects (combine heterogeneous and spatially smoothed)
+  vector[I] gamma = (sqrt(1 - rho) * theta + sqrt(rho) * sqrt(1 / tau) * phi) * sigma;
+}
+model {
+  y ~ poisson_log(log_E + alpha + x * beta + gamma * sigma);  // co-variates
+
+  alpha ~ normal(0, 1);
+  beta ~ normal(0, 1);
+
+  // spatial hyperpriors and priors
+  sigma ~ normal(0, 1);
+  logit_rho ~ normal(0, 1);
+  theta ~ normal(0, 1);
+  phi ~ standard_icar(edges);
+}
+generated quantities {
+  vector[I] eta = log_E + alpha + x * beta + gamma * sigma;
+  vector[I] y_prime = exp(eta);
+  int y_rep[I,10];
+  for (j in 1:10) {
+    if (max(eta) > 20) {
+      // avoid overflow in poisson_log_rng
+      print("max eta too big: ", max(eta));  
+      for (i in 1:I)
+	y_rep[i,j] = -1;
+    } else {
+      for (i in 1:I)
+        y_rep[i,j] = poisson_log_rng(eta[i]);
+    }
+  }
+}