More sign posting

NathanielF · NathanielF · commit 34e2fe36266e · 2025-10-28T17:14:45.000Z
Signed-off-by: Nathaniel &lt;NathanielF@users.noreply.github.com&gt;
diff --git a/docs/source/knowledgebase/structural_causal_models.ipynb b/docs/source/knowledgebase/structural_causal_models.ipynb
@@ -32,15 +32,21 @@
     "\n",
     "When we ask \"What is the effect of a medical treatment?\" or \"Does quitting smoking cause weight gain?\" or \"Do job training programs increase earnings?\", we are not simply asking about the treatment itself. We are asking: What world are we operating in? This perspective is more easily seen if you imagine a causal analyst as a pet-shop owner introducing a new fish to one of their many acquariums. The new fish's survival and behavior depend less on its intrinsic properties than on how it fits within this complex, interconnected system. In which tank will the new fish thrive? \n",
     "\n",
-    "There are a number of complementary paradigms in the causal inference literature, and in `CausalPy` we have not sought to be advocate for any one view over another. There are valuable lessons to be learned from econometrics, psychology and statistics whether they adopt a DAG or Potential outcomes framing for their causal work. Where the methods are statistically sound and practical we will seek to adopt and evangelise their usage. But in this article we want to focus on the idea of a causal model as a probabilistic program. An inferential routine designed to explicitly yield insights into the effect of some intervention or treatment on an outcome of interest. \n",
+    "There are a number of complementary paradigms in the causal inference literature, and in `CausalPy` we have not sought to be advocate for any one view over another. There are valuable lessons to be learned from econometrics, psychology and statistics whether they adopt a Pearlian or Potential outcomes framing for their causal work. Where the methods are statistically sound and practical we will seek to adopt and evangelise their usage. But in this article we want to focus on the idea of a causal model as a probabilistic program. An inferential routine designed to explicitly yield insights into the effect of some intervention or treatment on an outcome of interest. \n",
     "\n",
-    "Some of the algorithms and routines available in `CausalPy` focus deriving insight from a known environment e.g. stationarity with interrupted time-series, parallel trends in difference-in-differences, positivity in propensity score weighting and strong instruments with instrumental variable designs. These methods rely on stated assumptions or facts about the environment in which the treatment takes place to justify their conclusions as causal claims. In Bayesian structural causal inference the focus is slightly different in that we wish to model both the treatment but also the environment i.e. the fish and the fishtank. In this article we'll outline a species of modelling that tries to infer structural attributes of the environment to underwrite causal claims. \n",
+    "#### Modelling Worlds and Counterfactual Worlds\n",
+    "\n",
+    "Some of the algorithms and routines available in `CausalPy` focus on deriving insight from a known environment e.g. stationarity with interrupted time-series, parallel trends in difference-in-differences, positivity in propensity score weighting and strong instruments with instrumental variable designs. These methods rely on stated assumptions or facts about the environment in which the treatment takes place to justify their conclusions as causal claims. In Bayesian structural causal inference the focus is slightly different in that we wish to model both the treatment but also the environment i.e. the fish and the fishtank. In this article we'll outline a species of modelling that tries to infer structural attributes of the environment to underwrite causal claims. \n",
     "\n",
     "This is a two step move in the Bayesian paradigm. First we infer \"backwards\" what is the most plausible state of the world $w$ conditioned on the observable data. Then we assess the probabilistic predictive distribution of treatment and outcome at the plausible range of worlds. \n",
     "\n",
     "![](../_static/forwards_backwards.png)\n",
     "\n",
-    "The important point is that we characterise the plausible worlds by how much structure we learn about in the model specification. The more structure we seek to infer, the more we risk model misspecification, but simultaneously, the more structure we learn the more useful and transparent our conclusions. The \"world\" of the model is defined by the graph structure, latent confounders, link functions, measurement models and the implementation of selection mechanisms. Contrast this with the simpler case. When we regress an outcome $Y$ on a treatment $T$ and a set of covariates $X$,\n",
+    "The important point is that we characterise the plausible worlds by how much structure we learn about in the model specification. The more structure we seek to infer, the more we risk model misspecification, but simultaneously, the more structure we learn the more useful and transparent our conclusions. The \"world\" of the model is defined by the graph structure, latent confounders, link functions, measurement models and the implementation of selection mechanisms. Contrast this with the simpler case. \n",
+    "\n",
+    "#### Not mere Association\n",
+    "\n",
+    "When we regress an outcome $Y$ on a treatment $T$ and a set of covariates $X$,\n",
     "\n",
     "$$Y = \\alpha T + X \\beta + \\epsilon$$\n",
     "\n",
@@ -176,6 +182,8 @@
    "source": [
     "Each simulated observation has a treatment $T$, an outcome $Y$, and a set of covariates $X$ with distinct causal roles. Two covariates influence both the treatment and the outcome—these are the confounders. Two others affect only the treatment and serve as valid instruments. A final covariate affects only the outcome. The treatment and outcome errors are drawn from a correlated bivariate normal distribution, introducing endogeneity through their correlation parameter $\\rho$ . The treatment is exogenous and standard should regression recover the correct effect when $\\rho$ is low; while naive estimates are biased when $\\rho$ is high.\n",
     "\n",
+    "#### Confounding Structure\n",
+    "\n",
     "The function produces both continuous and binary versions of the treatment and the outcome. This dual design lets us explore two worlds side by side: one where the treatment is a continuous dosage, and another where it is a binary decision. In both cases, the true causal effect of the treatment on the outcome is set to three. Because we know the truth, we can evaluate how well our Bayesian models recover true parameters. Even here you can see that the \"structure\" we impose on the world is abstraction over the concrete mechanisms acting in the world. We bundle the idea of selecting into the treatment as potential for correlation between treatment and outcome. This is a convenient and tractable proxy of a range of concrete settings where there is a risk of selection effects in the real world. \n",
     "\n",
     "![](../_static/JOINT_DAG.png)"
@@ -376,6 +384,8 @@
    "source": [
     "We now move from diagnosing bias to building a model that can recover causal effects under controlled conditions. To keep things interpretable, we begin with the unconfounded case, where the treatment and outcome share no latent correlation ($\\rho=0$). This setting lets us isolate what a Bayesian structural model actually does before we expose it to the challenges of endogeneity.\n",
     "\n",
+    "#### Joint Modelling and Prior Structure\n",
+    "\n",
     "At the heart of our approach is joint modelling: instead of fitting separate regressions for treatment and outcome, we model them together as draws from a joint multivariate distribution. The treatment equation captures how covariates predict exposure, while the outcome equation captures how both treatment and covariates predict the response. By expressing them jointly, we retain the covariance structure between their errors—an essential ingredient for causal inference once we later introduce confounding.\n",
     "\n",
     "The model is built using PyMC and organized through the function `make_joint_model()`. Each version shares the same generative logic but differs in how the priors handle variable selection and identification. We can think of these as different “dial settings” for how strongly the model shrinks irrelevant coefficients or searches for valid instruments. Four prior configurations are explored:\n",
@@ -393,7 +403,9 @@
     "The following code defines the model and instantiates it under several prior choices. The model’s graphical representation, produced by `pm.model_to_graphviz()`, visualizes its structure: covariates feed into both the treatment and the outcome equations, the treatment coefficient $\\alpha$ links them, and the two residuals \n",
     "$U$ and $V$ are connected through a correlation parameter $\\rho$, which we can freely set to zero or more substantive values. These parameterisations offer us a way to derive insight into the structure of the causal system under study. \n",
     "\n",
-    "### Fitting the Continuous Treatment Model"
+    "### Fitting the Continuous Treatment Model\n",
+    "\n",
+    "In this next code block we articulate the joint model for the continuous outcome and continuous treatment variable. "
    ]
   },
   {
@@ -1126,6 +1138,8 @@
    "source": [
     "This section orchestrates the fitting and sampling workflow for the suite of Bayesian models defined earlier. Having specified several variants of the joint outcome–treatment model—each differing only in its prior structure or treatment of the correlation parameter $\\rho$—we now turn to posterior inference.\n",
     "\n",
+    "#### Various Model Specifications\n",
+    "\n",
     "The functions `sample_model()`, and `fit_models()` provide a compact, repeatable sampling pipeline. Within the model context, it first draws from the prior predictive distribution, capturing what the model believes about the data before seeing any observations. These are comparable across each of models specified.\n",
     "We're moving from describing how the data are assumed to arise, to actually learning from the simulated observations. This is the backwards inference step. The output `idata_unconfounded` contains all posterior draws, prior predictive samples, and posterior predictive simulations for every model variant under the assumption of no confounding. This will allow us to compare the inferences achieved under each setting. To gauge which are the most plausible parameterisations of the world-state conditioned on the data and our model-specification."
    ]
@@ -1585,7 +1599,7 @@
     "In econometric terms, what we’ve done so far sits squarely within the structural modelling tradition. We’ve written down a joint model for both the treatment and the outcome, specified their stochastic dependencies explicitly, and interpreted the slope $\\alpha$ as a structural parameter — a feature of the data-generating process itself. This parameter has a causal meaning only insofar as the model is correctly specified: if the structural form reflects how the world actually works, \n",
     "$\\alpha$ recovers the true causal effect. By contrast, reduced-form econometrics focuses less on modelling the underlying mechanisms and more on identifying causal effects through research design — instrumental variables, difference-in-differences, or randomization. Reduced-form approaches avoid the need to specify the joint distribution of unobservables but often sacrifice interpretability: they estimate relationships that are valid for specific interventions or designs, not necessarily structural primitives.\n",
     "\n",
-    "### Plotting Treatment Estimates\n",
+    "#### Comparing Treatment Estimates\n",
     "\n",
     "The comparison of models is a form of robustness checks. We want to inspect how consistent our parameter estimates are across different model specifications. "
    ]
@@ -2445,6 +2459,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "#### Comparing Treatment Estimates\n",
+    "\n",
     "The forest plot below compares posterior estimates of the treatment effect ($\\alpha$) and the confounding correlation ($\\rho$) across model specifications when \n",
     "$\\rho = .6$ in the data-generating process. The baseline normal model (which places diffuse priors on all parameters) clearly reflects the presence of endogeneity. Its posterior mean for $\\alpha$ is biased upward relative to the true value of 3, and the estimated $\\rho$ is positive, confirming that the model detects correlation between treatment and outcome disturbances. This behaviour mirrors the familiar bias of OLS under confounding: without structural constraints or informative priors, the model attributes part of the outcome variation caused by unobserved factors to the treatment itself. This inflates and corrupts our treatment effect estimate. \n",
     "\n",
@@ -3539,6 +3555,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "#### Comparing Treatment Estimates\n",
+    "\n",
     "Three of our four approaches successfully recover the true causal effect of 3.0, with tight uncertainty bands and accurate confounding estimates. But when BART enters the outcome equation, the results collapse: the treatment effect estimate drops to near-zero. This is not a sampling failure. Diagnostics show healthy chains, good ESS, and converged r-hat values. The model is doing exactly what we asked it to do. The problem is what we asked the model to do!"
    ]
   },
@@ -4232,6 +4250,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "#### Conditional Average Treatment Effects\n",
+    "\n",
     "The BART-treatment model demonstrated that flexibility in the treatment equation doesn't harm identification. We can also introduce flexibility in how treatment effects vary with covariates, while preserving the interpretability and identifiability of structural parameters? Our `bart_treatment_cate` model allows this by interacting the treatment parameter with the covariates. This explicitly parameterize effect heterogeneity. Unlike BART in the outcome equation (which failed because it absorbed the entire treatment signal), interaction terms allow treatment effects to vary while retaining a structural interpretation. This allows flexibility while retaining identifiability. \n",
     "\n",
     "We can see this flexibility by pulling out the ITE (individual treatment effects) estimates, using the potential outcomes imputations. We can compare the ITEs across the `bart_treatment_cate` and `linear_no_bart` models. "
@@ -4664,7 +4684,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The model is specified without the observed outcomes deliberately. We feed in the predictor $X$ and now we validate how the model specification can recover accurate treatment effects. We forward sample from the system with known parameters. This generates a synthetic observation data that we will feed back into the model, to condition on data known to have been sampled from this model. This makes use of PyMC's do-syntax. We are intervening on the data generating process to set values of the parameters in the system. "
+    "The model is specified without the observed outcomes deliberately. We feed in the predictor $X$ and now we validate how the model specification can recover accurate treatment effects.\n",
+    "\n",
+    "#### Parameter Recovery\n",
+    "\n",
+    "We \"forward\" sample from the system with known parameters. This generates a synthetic observation data that we will feed back into the model, to condition on data known to have been sampled from this model. This makes use of PyMC's do-syntax. We are intervening on the data generating process to set values of the parameters in the system. "
    ]
   },
   {
@@ -4823,6 +4847,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "#### Conditional Update on Observed Data\n",
+    "\n",
     "Next we will condition on the actual observed data and apply a range of priors to the $\\rho$ term to test the sensitivity of our findings to prior weights. "
    ]
   },