You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -23,7 +23,16 @@ This notebook covers Bayesian [moderation analysis](https://en.wikipedia.org/wik
23
23
24
24
This is not intended as a one-stop solution to a wide variety of data analysis problems, rather, it is intended as an educational exposition to show how moderation analysis works and how to conduct Bayesian parameter estimation in PyMC.
25
25
26
+
Moderation analysis has been approached from a variety of approaches:
27
+
* Statistical approaches: It is entirely possible to approach moderation analysis from a purely statistical perspective. In this approach we might build a linear model (for example) whose aim is purely to _describe_ the data we have while making no claims about causality.
28
+
* Path analysis: This approach asserts that the variables in the model are causally related and is exemplified in {cite:t}`hayes2017introduction`, for example. This approach cannot be considered as 'fully causal' as it lacks a variety of the concepts present in the causal approach.
29
+
* Causal inference: This approach builds upon the path analysis approach in that there is a claim of causal relationships between the variables. But it goes further in that there are additional causal concepts which can be brought to bear.
30
+
31
+
+++
32
+
33
+
:::{attention}
26
34
Note that this is sometimes mixed up with [mediation analysis](https://en.wikipedia.org/wiki/Mediation_(statistics)). Mediation analysis is appropriate when we believe the effect of a predictor variable upon an outcome variable is (partially, or fully) mediated through a 3rd mediating variable. Readers are referred to the textbook by {cite:t}`hayes2017introduction` as a comprehensive (albeit Frequentist) guide to moderation and related models as well as the PyMC example {ref}`mediation_analysis`.
35
+
:::
27
36
28
37
```{code-cell} ipython3
29
38
import arviz as az
@@ -146,17 +155,66 @@ def plot_moderation_effect(result, m, m_quantiles, ax=None):
146
155
)
147
156
```
148
157
149
-
# Does the effect of training upon muscularity decrease with age?
158
+
##Does the effect of training upon muscularity decrease with age?
150
159
151
160
I've taken inspiration from a blog post {cite:t}`vandenbergSPSS` which examines whether age influences (moderates) the effect of training on muscle percentage. We might speculate that more training results in higher muscle mass, at least for younger people. But it might be the case that the relationship between training and muscle mass changes with age - perhaps training is less effective at increasing muscle mass in older age?
152
161
153
-
The schematic box and arrow notation often used in the _statistical_ literature to represent moderation is shown by an arrow from the moderating variable to the line between a predictor and an outcome variable.
162
+
Let's see how we can visualize this in 3 different ways.
163
+
164
+
+++
165
+
166
+
### Statistical diagram
167
+
168
+
In this approach we might model the outcome variable (muscle mass) as a function of the predictor variables. In this case they would be age, training, and the interaction term between age and training. This is a purely statistical approach and does not make any claims about causality or the direction of the relationships.
This diagram makes it explicit that the moderation effect is the interaction term between age and training. We'll come back to why this is the case below.
193
+
194
+
+++
195
+
196
+
We could also write this in the form of an equation:
We can also draw moderation in a mode conceptual manner. This is perhaps visually simpler and easier to parse, but is less explicit. The moderation is shown by an arrow from the moderating variable to the line between a predictor and an outcome variable.
206
+
207
+
But the diagram would represent the exact same equation as shown above.
154
208
155
209

156
210
157
211
+++
158
212
159
-
It is useful to draw the same diagram out using the visual notation of _structural causal modeling_ (see below). This notation shows that both age and training causally influence muscle mass. The causal relationship also states that muscle mass is a function of both age and training. There is no specific visual notation in the SCM approach to represent moderation. Instead, that would be captured by the functional form of the relationship $f$. Note that the operator $:=$ is similar to the traditional $=$ operator, but it is used to denote a _causal_ or directional relationship rather than just equality.
213
+
### Causal diagram
214
+
215
+
+++
216
+
217
+
Finally, we could draw the same diagram from the perspective of _structural causal modeling_. This notation shows that both age and training causally influence muscle mass. There is no specific visual notation to represent moderation in this approach. Instead, that would be captured by the functional form of the relationship $f$.
Note that the operator $:=$ is similar to the traditional $=$ operator, but it is used to denote a _causal_ or directional relationship rather than just equality.
237
+
238
+
And we could, if we wanted to assume linearity, model this just as above:
Because we want to focus on the moderation concept and not the specific example it can be useful to use consistent and more abstract notation, so we will define:
179
248
- $x$ as the main predictor variable. In this example it is training.
180
249
- $y$ as the outcome variable. In this example it is muscle percentage.
181
250
- $m$ as the moderator. In this example it is age.
182
251
183
-
## The moderation model
252
+
+++
253
+
254
+
### Why is the interaction term the moderation effect?
255
+
We can see that the mean $y$ is simply a multiple linear regression with an interaction term between the two predictors, $x$ and $m$.
184
256
185
-
While the visual schematic (above) is a useful shorthand to understand complex models when you already know what moderation is, you can't derive it from the diagram alone. So let us formally specify the moderation model - it defines an outcome variable $y$ as:
257
+
We can get some insight into why this is the case by thinking about this as a multiple linear regression with $x$ and $m$ as predictor variables, but where the value of $m$ influences the relationship between $x$ and $y$. This is achieved by making the regression coefficient for $x$ is a function of $m$:
186
258
187
259
$$
188
-
y \sim \mathrm{Normal}(\beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m, \sigma^2)
260
+
y \sim \beta_0 + f(m) \cdot x + \beta_3 \cdot m
189
261
$$
190
262
191
-
where $y$, $x$, and $m$ are your observed data, and the following are the model parameters:
192
-
- $\beta_0$ is the intercept, its value does not have that much importance in the interpretation of this model.
193
-
- $\beta_1$ is the rate at which $y$ (muscle percentage) increases per unit of $x$ (training hours).
194
-
- $\beta_2$ is the coefficient for the interaction term $x \cdot m$.
195
-
- $\beta_3$ is the rate at which $y$ (muscle percentage) increases per unit of $m$ (age).
196
-
- $\sigma$ is the standard deviation of the observation noise.
263
+
and if we define that as a linear function, $f(m) = \beta_1 + \beta_2 \cdot m$, we get
197
264
198
-
We can see that the mean $y$ is simply a multiple linear regression with an interaction term between the two predictors, $x$ and $m$.
265
+
$$
266
+
y \sim \beta_0 + (\beta_1 + \beta_2 \cdot m) \cdot x + \beta_3 \cdot m
267
+
$$
199
268
200
-
We can get some insight into why this is the case by thinking about this as a multiple linear regression with $x$ and $m$ as predictor variables, but where the value of $m$ influences the relationship between $x$ and $y$. This is achieved by making the regression coefficient for $x$ is a function of $m$:
269
+
which multiplies out to
201
270
202
271
$$
203
-
y \sim \mathrm{Normal}(\beta_0 + f(m) \cdot x + \beta_3 \cdot m, \sigma^2)
272
+
y \sim \beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m
204
273
$$
205
274
206
-
and if we define that as a linear function, $f(m) = \beta_1 + \beta_2 \cdot m$, we get
275
+
:::{note}
276
+
We can use $f(m) = \beta_1 + \beta_2 \cdot m$ later to visualise the moderation effect.
277
+
:::
278
+
279
+
+++
280
+
281
+
### Specifying a Bayesian moderation model
282
+
283
+
Ok, so let's start to define our moderation model in a Bayesian manner. For this example we will treat the outcome variable as normally distributed around the mean.
207
284
208
285
$$
209
-
y \sim \mathrm{Normal}(\beta_0 + (\beta_1 + \beta_2 \cdot m) \cdot x + \beta_3 \cdot m, \sigma^2)
\mu &\sim \beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m\\
290
+
y &\sim \mathrm{Normal}(\mu, \sigma^2)
291
+
\end{aligned}
210
292
$$
211
293
212
-
We can use $f(m) = \beta_1 + \beta_2 \cdot m$ later to visualise the moderation effect.
294
+
where $y$, $x$, and $m$ are your observed data, $\mu$ is the expected outcome value, and the following are the model parameters for which we place priors upon:
295
+
- $\beta_0$ is the intercept, its value does not have that much importance in the interpretation of this model.
296
+
- $\beta_1$ is the rate at which $y$ (muscle percentage) increases per unit of $x$ (training hours).
297
+
- $\beta_2$ is the coefficient for the interaction term $x \cdot m$.
298
+
- $\beta_3$ is the rate at which $y$ (muscle percentage) increases per unit of $m$ (age).
299
+
- $\sigma$ is the standard deviation of the observation noise.
0 commit comments