Skip to content

Commit 152535a

Browse files
committed
friday's updates
1 parent 988807f commit 152535a

File tree

2 files changed

+362
-124
lines changed

2 files changed

+362
-124
lines changed

examples/causal_inference/moderation_analysis.ipynb

Lines changed: 256 additions & 106 deletions
Large diffs are not rendered by default.

examples/causal_inference/moderation_analysis.myst.md

Lines changed: 106 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,16 @@ This notebook covers Bayesian [moderation analysis](https://en.wikipedia.org/wik
2323

2424
This is not intended as a one-stop solution to a wide variety of data analysis problems, rather, it is intended as an educational exposition to show how moderation analysis works and how to conduct Bayesian parameter estimation in PyMC.
2525

26+
Moderation analysis has been approached from a variety of approaches:
27+
* Statistical approaches: It is entirely possible to approach moderation analysis from a purely statistical perspective. In this approach we might build a linear model (for example) whose aim is purely to _describe_ the data we have while making no claims about causality.
28+
* Path analysis: This approach asserts that the variables in the model are causally related and is exemplified in {cite:t}`hayes2017introduction`, for example. This approach cannot be considered as 'fully causal' as it lacks a variety of the concepts present in the causal approach.
29+
* Causal inference: This approach builds upon the path analysis approach in that there is a claim of causal relationships between the variables. But it goes further in that there are additional causal concepts which can be brought to bear.
30+
31+
+++
32+
33+
:::{attention}
2634
Note that this is sometimes mixed up with [mediation analysis](https://en.wikipedia.org/wiki/Mediation_(statistics)). Mediation analysis is appropriate when we believe the effect of a predictor variable upon an outcome variable is (partially, or fully) mediated through a 3rd mediating variable. Readers are referred to the textbook by {cite:t}`hayes2017introduction` as a comprehensive (albeit Frequentist) guide to moderation and related models as well as the PyMC example {ref}`mediation_analysis`.
35+
:::
2736

2837
```{code-cell} ipython3
2938
import arviz as az
@@ -146,17 +155,66 @@ def plot_moderation_effect(result, m, m_quantiles, ax=None):
146155
)
147156
```
148157

149-
# Does the effect of training upon muscularity decrease with age?
158+
## Does the effect of training upon muscularity decrease with age?
150159

151160
I've taken inspiration from a blog post {cite:t}`vandenbergSPSS` which examines whether age influences (moderates) the effect of training on muscle percentage. We might speculate that more training results in higher muscle mass, at least for younger people. But it might be the case that the relationship between training and muscle mass changes with age - perhaps training is less effective at increasing muscle mass in older age?
152161

153-
The schematic box and arrow notation often used in the _statistical_ literature to represent moderation is shown by an arrow from the moderating variable to the line between a predictor and an outcome variable.
162+
Let's see how we can visualize this in 3 different ways.
163+
164+
+++
165+
166+
### Statistical diagram
167+
168+
In this approach we might model the outcome variable (muscle mass) as a function of the predictor variables. In this case they would be age, training, and the interaction term between age and training. This is a purely statistical approach and does not make any claims about causality or the direction of the relationships.
169+
170+
```{code-cell} ipython3
171+
:tags: [hide-input]
172+
173+
pgm = daft.PGM(dpi=200)
174+
175+
pgm.add_node("i", "1", 0, 2, aspect=3)
176+
pgm.add_node("t", "training", 0, 1, aspect=3)
177+
pgm.add_node("a", "age", 0, 0, aspect=3)
178+
pgm.add_node("m", r"age $\times$ training", 0, -1, aspect=3)
179+
pgm.add_node("y", "muscle mass", 4, 0.5, aspect=3)
180+
pgm.add_node("e", r"$\epsilon$", 4, 1.5, fixed=True)
181+
182+
pgm.add_edge("i", "y", label=r"$\beta_0$")
183+
pgm.add_edge("t", "y", label=r"$\beta_1$")
184+
pgm.add_edge("a", "y", label=r"$\beta_2$")
185+
pgm.add_edge("m", "y", label=r"$\beta_3$")
186+
pgm.add_edge("e", "y")
187+
188+
189+
pgm.render();
190+
```
191+
192+
This diagram makes it explicit that the moderation effect is the interaction term between age and training. We'll come back to why this is the case below.
193+
194+
+++
195+
196+
We could also write this in the form of an equation:
197+
198+
$$
199+
\text{muscle mass} = \beta_0 + \beta_1 \times \text{training} + \beta_2 \times \text{age} + \beta_3 \times \text{training} \times \text{age} + \epsilon.
200+
$$
201+
202+
+++
203+
204+
### Conceptual or path diagram
205+
We can also draw moderation in a mode conceptual manner. This is perhaps visually simpler and easier to parse, but is less explicit. The moderation is shown by an arrow from the moderating variable to the line between a predictor and an outcome variable.
206+
207+
But the diagram would represent the exact same equation as shown above.
154208

155209
![](moderation_figure.png)
156210

157211
+++
158212

159-
It is useful to draw the same diagram out using the visual notation of _structural causal modeling_ (see below). This notation shows that both age and training causally influence muscle mass. The causal relationship also states that muscle mass is a function of both age and training. There is no specific visual notation in the SCM approach to represent moderation. Instead, that would be captured by the functional form of the relationship $f$. Note that the operator $:=$ is similar to the traditional $=$ operator, but it is used to denote a _causal_ or directional relationship rather than just equality.
213+
### Causal diagram
214+
215+
+++
216+
217+
Finally, we could draw the same diagram from the perspective of _structural causal modeling_. This notation shows that both age and training causally influence muscle mass. There is no specific visual notation to represent moderation in this approach. Instead, that would be captured by the functional form of the relationship $f$.
160218

161219
```{code-cell} ipython3
162220
:tags: [hide-input]
@@ -175,41 +233,71 @@ pgm.add_text(-0.25, -0.75, r"muscle mass := $f$(training, age)")
175233
pgm.render();
176234
```
177235

236+
Note that the operator $:=$ is similar to the traditional $=$ operator, but it is used to denote a _causal_ or directional relationship rather than just equality.
237+
238+
And we could, if we wanted to assume linearity, model this just as above:
239+
$$
240+
\text{muscle mass} := \beta_0 + \beta_1 \times \text{training} + \beta_2 \times \text{age} + \beta_3 \times \text{training} \times \text{age} + \epsilon.
241+
$$
242+
243+
+++
244+
245+
## The moderation model
246+
178247
Because we want to focus on the moderation concept and not the specific example it can be useful to use consistent and more abstract notation, so we will define:
179248
- $x$ as the main predictor variable. In this example it is training.
180249
- $y$ as the outcome variable. In this example it is muscle percentage.
181250
- $m$ as the moderator. In this example it is age.
182251

183-
## The moderation model
252+
+++
253+
254+
### Why is the interaction term the moderation effect?
255+
We can see that the mean $y$ is simply a multiple linear regression with an interaction term between the two predictors, $x$ and $m$.
184256

185-
While the visual schematic (above) is a useful shorthand to understand complex models when you already know what moderation is, you can't derive it from the diagram alone. So let us formally specify the moderation model - it defines an outcome variable $y$ as:
257+
We can get some insight into why this is the case by thinking about this as a multiple linear regression with $x$ and $m$ as predictor variables, but where the value of $m$ influences the relationship between $x$ and $y$. This is achieved by making the regression coefficient for $x$ is a function of $m$:
186258

187259
$$
188-
y \sim \mathrm{Normal}(\beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m, \sigma^2)
260+
y \sim \beta_0 + f(m) \cdot x + \beta_3 \cdot m
189261
$$
190262

191-
where $y$, $x$, and $m$ are your observed data, and the following are the model parameters:
192-
- $\beta_0$ is the intercept, its value does not have that much importance in the interpretation of this model.
193-
- $\beta_1$ is the rate at which $y$ (muscle percentage) increases per unit of $x$ (training hours).
194-
- $\beta_2$ is the coefficient for the interaction term $x \cdot m$.
195-
- $\beta_3$ is the rate at which $y$ (muscle percentage) increases per unit of $m$ (age).
196-
- $\sigma$ is the standard deviation of the observation noise.
263+
and if we define that as a linear function, $f(m) = \beta_1 + \beta_2 \cdot m$, we get
197264

198-
We can see that the mean $y$ is simply a multiple linear regression with an interaction term between the two predictors, $x$ and $m$.
265+
$$
266+
y \sim \beta_0 + (\beta_1 + \beta_2 \cdot m) \cdot x + \beta_3 \cdot m
267+
$$
199268

200-
We can get some insight into why this is the case by thinking about this as a multiple linear regression with $x$ and $m$ as predictor variables, but where the value of $m$ influences the relationship between $x$ and $y$. This is achieved by making the regression coefficient for $x$ is a function of $m$:
269+
which multiplies out to
201270

202271
$$
203-
y \sim \mathrm{Normal}(\beta_0 + f(m) \cdot x + \beta_3 \cdot m, \sigma^2)
272+
y \sim \beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m
204273
$$
205274

206-
and if we define that as a linear function, $f(m) = \beta_1 + \beta_2 \cdot m$, we get
275+
:::{note}
276+
We can use $f(m) = \beta_1 + \beta_2 \cdot m$ later to visualise the moderation effect.
277+
:::
278+
279+
+++
280+
281+
### Specifying a Bayesian moderation model
282+
283+
Ok, so let's start to define our moderation model in a Bayesian manner. For this example we will treat the outcome variable as normally distributed around the mean.
207284

208285
$$
209-
y \sim \mathrm{Normal}(\beta_0 + (\beta_1 + \beta_2 \cdot m) \cdot x + \beta_3 \cdot m, \sigma^2)
286+
\begin{aligned}
287+
\beta_0, \ldots, \beta_3 & \sim \text{Normal}(0, 10)\\
288+
\sigma & \sim \text{HalfCauchy}(1)\\
289+
\mu &\sim \beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m\\
290+
y &\sim \mathrm{Normal}(\mu, \sigma^2)
291+
\end{aligned}
210292
$$
211293

212-
We can use $f(m) = \beta_1 + \beta_2 \cdot m$ later to visualise the moderation effect.
294+
where $y$, $x$, and $m$ are your observed data, $\mu$ is the expected outcome value, and the following are the model parameters for which we place priors upon:
295+
- $\beta_0$ is the intercept, its value does not have that much importance in the interpretation of this model.
296+
- $\beta_1$ is the rate at which $y$ (muscle percentage) increases per unit of $x$ (training hours).
297+
- $\beta_2$ is the coefficient for the interaction term $x \cdot m$.
298+
- $\beta_3$ is the rate at which $y$ (muscle percentage) increases per unit of $m$ (age).
299+
- $\sigma$ is the standard deviation of the observation noise.
300+
213301

214302
+++
215303

0 commit comments

Comments
 (0)