Skip to content

Commit d15639a

Browse files
committed
update with full run and text
Signed-off-by: Nathaniel <[email protected]>
1 parent e9a5beb commit d15639a

File tree

2 files changed

+767
-700
lines changed

2 files changed

+767
-700
lines changed

examples/case_studies/CFA_SEM.ipynb

Lines changed: 694 additions & 694 deletions
Large diffs are not rendered by default.

examples/case_studies/CFA_SEM.myst.md

Lines changed: 73 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,9 @@ kernelspec:
2121

2222
+++
2323

24-
This is some introductory text.
24+
In the psychometrics literature the data is often derived from a strategically constructed survey aimed at a particular target phenomena. Some intuited, but not yet measured, concept that arguably plays a role in human action, motivation or sentiment. The relative “fuzziness” of the subject matter in psychometrics has had a catalyzing effect on the methodological rigour sought in the science. Survey designs are agonized over for correct tone and rhythm of sentence structure. Measurement scales are doubly checked for reliability and correctness. The literature is consulted and questions are refined. Analysis steps are justified and tested under a wealth of modelling routines.
25+
26+
Model architectures are defined and refined to better express the hypothesized structures in the data-generating process. We will see how such due diligence leads to powerful and expressive models that grant us tractability on thorny questions of human affect.
2527

2628
```{code-cell} ipython3
2729
import arviz as az
@@ -41,6 +43,11 @@ rng = np.random.default_rng(42)
4143

4244
### Latent Constructs and Measurement
4345

46+
Our data is borrowed from work by Boris Mayer and Andrew Ellis found [here](https://methodenlehre.github.io/SGSCLM-R-course/cfa-and-sem-with-lavaan.html#structural-equation-modelling-sem). They demonstrate CFA and SEM modelling with lavaan. We’ll load up their data. We have survey responses from ~300 individuals who have answered questions regarding their upbringing, self-efficacy and reported life-satisfaction. The hypothetical dependency structure in this life-satisfaction data-set posits a moderated relationship between scores related to life-satisfaction, parental and family support and self-efficacy. It is not a trivial task to be able to design a survey that can elicit answers plausibly mapped to each of these “factors” or themes, never mind finding a model of their relationship that can inform us as to the relative of impact of each on life-satisfaction outcomes.
47+
48+
First we'll pull out the data and examine some summary statistics.
49+
50+
4451
```{code-cell} ipython3
4552
df = pd.read_csv("../data/sem_data.csv")
4653
df.head()
@@ -52,13 +59,43 @@ drivers = [c for c in df.columns if not c in ["region", "gender", "age", "ID"]]
5259
corr_df = df[drivers].corr()
5360
mask = np.triu(np.ones_like(corr_df, dtype=bool))
5461
sns.heatmap(corr_df, annot=True, cmap="Blues", ax=ax, center=0, mask=mask)
62+
ax.set_title("Sample Correlations between indicator Metrics")
5563
fig, ax = plt.subplots(figsize=(20, 7))
56-
sns.heatmap(df[drivers].cov(), annot=True, cmap="Blues", ax=ax, center=0, mask=mask);
64+
sns.heatmap(df[drivers].cov(), annot=True, cmap="Blues", ax=ax, center=0, mask=mask)
65+
ax.set_title("Sample Covariances between indicator Metrics");
66+
```
67+
68+
Next we'll plot the pairplot to visualise the nature of the correlations
69+
70+
```{code-cell} ipython3
71+
ax = sns.pairplot(df[drivers], kind="reg", corner=True, diag_kind="kde")
72+
plt.suptitle("Pair Plot of Indicator Metrics with Regression Fits", fontsize=30);
5773
```
5874

5975
## Measurement Models
6076

77+
+++
78+
79+
A measurement model is a key component within the more general structural equation model. A measurement model specifies the relationships between observed variables (typically survey items or indicators) and their underlying latent constructs (often referred to as factors or latent variables). We start our presentation of SEM models more generally by focusing on a type of measurement model with it's own history - the confirmatory factor model (CFA) which specifies a particular structure to the relationships between our indicator variables and the latent factors. It is this structure which is up for confirmation in our modelling.
80+
81+
We'll start by fitting a "simple" CFA model in `PyMC` to demonstrate how the pieces fit together, we'll then expand our focus. Here we ignore the majority of our indicator variables and focus on the idea that there are two latent constructs: (1) Social Self-efficacy and (2) Life Satisfaction.
82+
83+
We're aiming to articulate a mathematical structure where our indicator variables $y_{ij}$ are determined by a latent factor $\text{Ksi}_{j}$ through an estimated factor loading $\lambda_{ij}$. Functionally we have a set of equations with error terms $\psi_i$
84+
85+
$$ y_{1} = \tau_{1} + \lambda_{11}\text{Ksi}_{1} + \psi_{1} \\
86+
y_{2} = \tau_{2} + \lambda_{21}\text{Ksi}_{1} + \psi_{2} \\
87+
... \\
88+
y_{n} = \tau_{n} + \lambda_{n2}\text{Ksi}_{2} + \psi_{3}
89+
$$
90+
91+
The goal is to articulate the relationship between the different factors in terms of the covariances between these latent terms and estimate the relationships each latent factor has with the manifest indicator variables. At a high level, we're saying the joint distribution can be represented through conditionalisation in the following schema
92+
93+
$$p(x_{i}.....x_{n} | \text{Ksi}, \Psi, \tau, \Lambda) \sim Normal(\tau + \Lambda\cdot \text{Ksi}, \Psi) $$
94+
95+
We will show how to build these structures into our model below
96+
6197
```{code-cell} ipython3
98+
# Set up coordinates for appropriate indexing of latent factors
6299
coords = {
63100
"obs": list(range(len(df))),
64101
"indicators": ["se_social_p1", "se_social_p2", "se_social_p3", "ls_p1", "ls_p2", "ls_p3"],
@@ -71,7 +108,7 @@ coords = {
71108
obs_idx = list(range(len(df)))
72109
with pm.Model(coords=coords) as model:
73110
74-
Psi = pm.InverseGamma("Psi", 5, 10, dims="indicators")
111+
# Set up Factor Loadings
75112
lambdas_ = pm.Normal("lambdas_1", 1, 10, dims=("indicators_1"))
76113
# Force a fixed scale on the factor loadings for factor 1
77114
lambdas_1 = pm.Deterministic(
@@ -82,14 +119,15 @@ with pm.Model(coords=coords) as model:
82119
lambdas_2 = pm.Deterministic(
83120
"lambdas2", pt.set_subtensor(lambdas_[0], 1), dims=("indicators_2")
84121
)
85-
tau = pm.Normal("tau", 3, 10, dims="indicators")
122+
86123
# Specify covariance structure between latent factors
87124
kappa = 0
88125
sd_dist = pm.Exponential.dist(1.0, shape=2)
89126
chol, _, _ = pm.LKJCholeskyCov("chol_cov", n=2, eta=2, sd_dist=sd_dist, compute_corr=True)
90127
ksi = pm.MvNormal("ksi", kappa, chol=chol, dims=("obs", "latent"))
91128
92-
# Construct Observation matrix
129+
# Construct Pseudo Observation matrix based on Factor Loadings
130+
tau = pm.Normal("tau", 3, 10, dims="indicators")
93131
m1 = tau[0] + ksi[obs_idx, 0] * lambdas_1[0]
94132
m2 = tau[1] + ksi[obs_idx, 0] * lambdas_1[1]
95133
m3 = tau[2] + ksi[obs_idx, 0] * lambdas_1[2]
@@ -98,6 +136,11 @@ with pm.Model(coords=coords) as model:
98136
m6 = tau[5] + ksi[obs_idx, 1] * lambdas_2[2]
99137
100138
mu = pm.Deterministic("mu", pm.math.stack([m1, m2, m3, m4, m5, m6]).T)
139+
140+
## Error Terms
141+
Psi = pm.InverseGamma("Psi", 5, 10, dims="indicators")
142+
143+
# Likelihood
101144
_ = pm.Normal(
102145
"likelihood",
103146
mu,
@@ -115,20 +158,30 @@ with pm.Model(coords=coords) as model:
115158
pm.model_to_graphviz(model)
116159
```
117160

161+
### Meausurement Model Structure
162+
163+
We can now see how the covariance structure among the latent constructs is integral piece of the overarching model design which is fed forward into our pseudo regression components and weighted with the respective lambda terms.
164+
118165
```{code-cell} ipython3
119166
az.summary(idata, var_names=["lambdas1", "lambdas2"])
120167
```
121168

169+
These factor loadings are generally important to interpret in terms of construct validity. Because we've specified one of the indicator variables to be fixed at 1, the other indicators which load on that factor should have a loading coefficient in broadly the same scale as the fixed point indicator that defines the construct scale. We're looking for consistency among the loadings to assess whether the indicators are reliable measures of the construct.
170+
122171
```{code-cell} ipython3
123172
idata
124173
```
125174

175+
Let's plot the trace diagnostics to validate the sampler has converged well to the posterior distribution.
176+
126177
```{code-cell} ipython3
127178
az.plot_trace(idata, var_names=["lambdas1", "lambdas2", "tau", "Psi", "ksi"]);
128179
```
129180

130181
### Sampling the Latent Constructs
131182

183+
One thing to highlight in particular about the Bayesian manner of fitting CFA and SEM models is that we now have access to the posterior distribution of the latent quantities. These samples can offer insight into particular individuals in our survey that is harder to glean from the multivariate presentation of the manifest variables.
184+
132185
```{code-cell} ipython3
133186
fig, axs = plt.subplots(1, 2, figsize=(20, 9))
134187
axs = axs.flatten()
@@ -156,8 +209,14 @@ ax2.set_title("Individual Life Satisfaction Metric \n On Latent Factor LS")
156209
plt.show();
157210
```
158211

212+
In this way we can identify and zero-in on individuals that appear to be outliers on one or more of the latent constructs.
213+
214+
+++
215+
159216
### Posterior Predictive Checks
160217

218+
As in more traditional Bayesian modelling, a core component of model evaluation is the assessment of the posterior predictive distribution i.e. the implied outcome distribution. Here too we can pull out draws against each of the indicator variables to assess for coherence and adequacy.
219+
161220
```{code-cell} ipython3
162221
def make_ppc(
163222
idata,
@@ -193,6 +252,8 @@ del idata
193252

194253
### Intermediate Cross-Loading Model
195254

255+
The idea of a measurment is maybe a little opaque when we only see models that fit well. Instead we want to briefly show how a in-apt measurement model gets reflected in the estimated parameters for the factor loadings. Here we specify a measurement model which attempts to couple the `se_social` and `sup_parents` indicators and bundle them into the same factor.
256+
196257
```{code-cell} ipython3
197258
coords = {
198259
"obs": list(range(len(df))),
@@ -285,6 +346,8 @@ with pm.Model(coords=coords) as model:
285346
pm.model_to_graphviz(model)
286347
```
287348

349+
Again our model samples well but the parameter estimates suggest that there is some inconsistency between the scale on which we're trying to force both sets of metrics.
350+
288351
```{code-cell} ipython3
289352
az.summary(idata, var_names=["lambdas1", "lambdas2"])
290353
```
@@ -296,6 +359,10 @@ az.plot_energy(idata, ax=axs[0])
296359
az.plot_forest(idata, var_names=["lambdas1"], combined=True, ax=axs[1]);
297360
```
298361

362+
This hints at a variety of measurement model misspecification and should force us back to the drawing board.
363+
364+
+++
365+
299366
## Full Measurement Model
300367

301368
```{code-cell} ipython3
@@ -627,7 +694,7 @@ ax.set_title("Residuals between Model Implied and Sample Correlations", fontsize
627694
```
628695

629696
```{code-cell} ipython3
630-
make_ppc(idata_sem0)
697+
make_ppc(idata_sem0, 100, drivers=drivers, dims=(3, 5))
631698
```
632699

633700
```{code-cell} ipython3

0 commit comments

Comments
 (0)