Skip to content

Commit 62d788b

Browse files
committed
Add sem structure section
Signed-off-by: Nathaniel <[email protected]>
1 parent 47da13d commit 62d788b

File tree

3 files changed

+1740
-1155
lines changed

3 files changed

+1740
-1155
lines changed

examples/case_studies/CFA_SEM.ipynb

Lines changed: 1586 additions & 1115 deletions
Large diffs are not rendered by default.

examples/case_studies/CFA_SEM.myst.md

Lines changed: 154 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,20 @@ kernelspec:
2323

2424
In the psychometrics literature the data is often derived from a strategically constructed survey aimed at a particular target phenomena. Some intuited, but not yet measured, concept that arguably plays a role in human action, motivation or sentiment. The relative “fuzziness” of the subject matter in psychometrics has had a catalyzing effect on the methodological rigour sought in the science. Survey designs are agonized over for correct tone and rhythm of sentence structure. Measurement scales are doubly checked for reliability and correctness. The literature is consulted and questions are refined. Analysis steps are justified and tested under a wealth of modelling routines.
2525

26-
Model architectures are defined and refined to better express the hypothesized structures in the data-generating process. We will see how such due diligence leads to powerful and expressive models that grant us tractability on thorny questions of human affect.
26+
Model architectures are defined and refined to better express the hypothesized structures in the data-generating process. We will see how such due diligence leads to powerful and expressive models that grant us tractability on thorny questions of human affect. We draw on Roy Levy and Robert J. Mislevy's _Bayesian Psychometric Modeling_.
2727

2828
```{code-cell} ipython3
29+
import warnings
30+
2931
import arviz as az
3032
import matplotlib.pyplot as plt
3133
import numpy as np
3234
import pandas as pd
3335
import pymc as pm
3436
import pytensor.tensor as pt
3537
import seaborn as sns
38+
39+
warnings.filterwarnings("ignore", category=RuntimeWarning)
3640
```
3741

3842
```{code-cell} ipython3
@@ -65,7 +69,7 @@ sns.heatmap(df[drivers].cov(), annot=True, cmap="Blues", ax=ax, center=0, mask=m
6569
ax.set_title("Sample Covariances between indicator Metrics");
6670
```
6771

68-
Next we'll plot the pairplot to visualise the nature of the correlations
72+
The lens here on the sample covariance matrix is common in the traditional SEM models are often fit to the data by optimising a fit to the covariance matrix. Model assessment routines often gauge the model's ability to recover the sample covariance relations. There is a slightyly different approach taken in the Bayesian approach to estimating these models which focuses on the observed data rather than the derived summary statistics. Next we'll plot the pairplot to visualise the nature of the correlations
6973

7074
```{code-cell} ipython3
7175
ax = sns.pairplot(df[drivers], kind="reg", corner=True, diag_kind="kde")
@@ -88,11 +92,11 @@ y_{2} = \tau_{2} + \lambda_{21}\text{Ksi}_{1} + \psi_{2} \\
8892
y_{n} = \tau_{n} + \lambda_{n2}\text{Ksi}_{2} + \psi_{3}
8993
$$
9094

91-
The goal is to articulate the relationship between the different factors in terms of the covariances between these latent terms and estimate the relationships each latent factor has with the manifest indicator variables. At a high level, we're saying the joint distribution can be represented through conditionalisation in the following schema
95+
The goal is to articulate the relationship between the different factors in terms of the covariances between these latent terms and estimate the relationships each latent factor has with the manifest indicator variables. At a high level, we're saying the joint distribution of the observed data can be represented through conditionalisation in the following schema
9296

9397
$$p(x_{i}.....x_{n} | \text{Ksi}, \Psi, \tau, \Lambda) \sim Normal(\tau + \Lambda\cdot \text{Ksi}, \Psi) $$
9498

95-
We will show how to build these structures into our model below
99+
This is the Bayesian approach to the estimation of CFA and SEM models. We're seeking a conditionalisation structure that can retrodict the observed data based on latent constructs and hypothetical relationships among the constructs and the observed data points. We will show how to build these structures into our model below
96100

97101
```{code-cell} ipython3
98102
# Set up coordinates for appropriate indexing of latent factors
@@ -158,6 +162,10 @@ with pm.Model(coords=coords) as model:
158162
pm.model_to_graphviz(model)
159163
```
160164

165+
Here the model structure and dependency graph becomes a little clearer. Our likelihood term models a outcome matrix of 283x6 observations i.e. the survey responses for 6 questions. These survey responses are modelled as draws from a multivariate normal $Ksi$ with a prior correlation structure between the latent constructs. We then specify how each of the outcome measures is a function of one of the latent factor modified by the appropriate factor loading `lambda`.
166+
167+
+++
168+
161169
### Meausurement Model Structure
162170

163171
We can now see how the covariance structure among the latent constructs is integral piece of the overarching model design which is fed forward into our pseudo regression components and weighted with the respective lambda terms.
@@ -183,6 +191,8 @@ az.plot_trace(idata, var_names=["lambdas1", "lambdas2", "tau", "Psi", "ksi"]);
183191
One thing to highlight in particular about the Bayesian manner of fitting CFA and SEM models is that we now have access to the posterior distribution of the latent quantities. These samples can offer insight into particular individuals in our survey that is harder to glean from the multivariate presentation of the manifest variables.
184192

185193
```{code-cell} ipython3
194+
:tags: [hide-input]
195+
186196
fig, axs = plt.subplots(1, 2, figsize=(20, 9))
187197
axs = axs.flatten()
188198
ax1 = axs[0]
@@ -342,8 +352,6 @@ with pm.Model(coords=coords) as model:
342352
random_seed=114,
343353
)
344354
idata.extend(pm.sample_posterior_predictive(idata))
345-
346-
pm.model_to_graphviz(model)
347355
```
348356

349357
Again our model samples well but the parameter estimates suggest that there is some inconsistency between the scale on which we're trying to force both sets of metrics.
@@ -367,7 +375,7 @@ This hints at a variety of measurement model misspecification and should force u
367375

368376
## Full Measurement Model
369377

370-
With this in mind we'll now specify a full measurement that maps each of our thematically similar indicator metrics to an indicidual latent construct. This mandates the postulation of 5 distinct constructs.
378+
With this in mind we'll now specify a full measurement that maps each of our thematically similar indicator metrics to an indicidual latent construct. This mandates the postulation of 5 distinct constructs where we only admit three metrics load on each construct. The choice of which metric loads on the latent construct is determined in our case by the constructs each measure is intended to measure.
371379

372380
```{code-cell} ipython3
373381
drivers = [
@@ -463,13 +471,19 @@ with pm.Model(coords=coords) as model:
463471
idata_mm.extend(pm.sample_posterior_predictive(idata_mm))
464472
```
465473

474+
### Model Evaluation Checks
475+
476+
We can see quickly here how this factor structure seems to sample better and retains a consistency of scale.
477+
466478
```{code-cell} ipython3
467479
fig, axs = plt.subplots(1, 2, figsize=(20, 9))
468480
axs = axs.flatten()
469481
az.plot_energy(idata_mm, ax=axs[0])
470482
az.plot_forest(idata_mm, var_names=["lambdas1", "lambdas2", "lambdas3"], combined=True, ax=axs[1]);
471483
```
472484

485+
We can also pull out the more typical patterns of model evaluation by assessing the fit between the posterior predicted covariances and the sample covariances. This is a sanity check to assess local model fit statistics. The below code iterates over draws from the posterior predictive distribution and calculates the covariance or correlation matrix on eah draw, we calculate the residuals on each draw between each of the covariances and then average across the draws.
486+
473487
```{code-cell} ipython3
474488
def get_posterior_resids(idata, samples=100, metric="cov"):
475489
resids = []
@@ -499,6 +513,8 @@ residuals_posterior_cov = get_posterior_resids(idata_mm, 2500)
499513
residuals_posterior_corr = get_posterior_resids(idata_mm, 2500, metric="corr")
500514
```
501515

516+
These tables lend themselves to nice plots where we can highlight the deviation from the sample covariance and correlation statistics.
517+
502518
```{code-cell} ipython3
503519
fig, ax = plt.subplots(figsize=(20, 7))
504520
mask = np.triu(np.ones_like(residuals_posterior_corr, dtype=bool))
@@ -512,56 +528,77 @@ ax = sns.heatmap(residuals_posterior_cov, annot=True, cmap="bwr", mask=mask)
512528
ax.set_title("Residuals between Model Implied and Sample Covariances", fontsize=25);
513529
```
514530

531+
But we can also do more contemporary Bayesian posterior predictive checks as we pull out the predictive posterior distribution for each of the observed metrics.
532+
515533
```{code-cell} ipython3
516534
make_ppc(idata_mm, 100, drivers=residuals_posterior_cov.columns, dims=(3, 5));
517535
```
518536

537+
### Model Analysis
538+
539+
We're not just interested in recovering the observed data patterns we also want a way of pulling out the inferences relating to the latent constructs. For instance we can pull out the factor loadings and calculate measures of variance accounted for by each of the indicator variables in this factor system and for the factors themselves.
540+
519541
```{code-cell} ipython3
520-
cov_df = pd.DataFrame(az.extract(idata_mm["posterior"])["cov"].mean(axis=2))
521-
cov_df.index = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
522-
cov_df.columns = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
542+
def make_factor_loadings_df(idata):
543+
factor_loadings = pd.DataFrame(
544+
az.summary(
545+
idata_mm, var_names=["lambdas1", "lambdas2", "lambdas3", "lambdas4", "lambdas5"]
546+
)["mean"]
547+
).reset_index()
548+
factor_loadings["factor"] = factor_loadings["index"].str.split("[", expand=True)[0]
549+
factor_loadings.columns = ["factor_loading", "factor_loading_weight", "factor"]
550+
factor_loadings["factor_loading_weight_sq"] = factor_loadings["factor_loading_weight"] ** 2
551+
factor_loadings["sum_sq_loadings"] = factor_loadings.groupby("factor")[
552+
"factor_loading_weight_sq"
553+
].transform(sum)
554+
factor_loadings["error_variances"] = az.summary(idata_mm, var_names=["Psi"])["mean"].values
555+
factor_loadings["total_indicator_variance"] = (
556+
factor_loadings["factor_loading_weight_sq"] + factor_loadings["error_variances"]
557+
)
558+
factor_loadings["total_variance"] = factor_loadings["total_indicator_variance"].sum()
559+
factor_loadings["indicator_explained_variance"] = (
560+
factor_loadings["factor_loading_weight_sq"] / factor_loadings["total_variance"]
561+
)
562+
factor_loadings["factor_explained_variance"] = (
563+
factor_loadings["sum_sq_loadings"] / factor_loadings["total_variance"]
564+
)
565+
num_cols = [c for c in factor_loadings.columns if not c in ["factor_loading", "factor"]]
566+
return factor_loadings
523567
524-
correlation_df = pd.DataFrame(az.extract(idata_mm["posterior"])["chol_cov_corr"].mean(axis=2))
525-
correlation_df.index = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
526-
correlation_df.columns = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
527568
528-
factor_loadings = pd.DataFrame(
529-
az.summary(idata_mm, var_names=["lambdas1", "lambdas2", "lambdas3", "lambdas4", "lambdas5"])[
530-
"mean"
531-
]
532-
).reset_index()
533-
factor_loadings["factor"] = factor_loadings["index"].str.split("[", expand=True)[0]
534-
factor_loadings.columns = ["factor_loading", "factor_loading_weight", "factor"]
535-
factor_loadings["factor_loading_weight_sq"] = factor_loadings["factor_loading_weight"] ** 2
536-
factor_loadings["sum_sq_loadings"] = factor_loadings.groupby("factor")[
537-
"factor_loading_weight_sq"
538-
].transform(sum)
539-
factor_loadings["error_variances"] = az.summary(idata_mm, var_names=["Psi"])["mean"].values
540-
factor_loadings["total_indicator_variance"] = (
541-
factor_loadings["factor_loading_weight_sq"] + factor_loadings["error_variances"]
542-
)
543-
factor_loadings["total_variance"] = factor_loadings["total_indicator_variance"].sum()
544-
factor_loadings["indicator_explained_variance"] = (
545-
factor_loadings["factor_loading_weight_sq"] / factor_loadings["total_variance"]
546-
)
547-
factor_loadings["factor_explained_variance"] = (
548-
factor_loadings["sum_sq_loadings"] / factor_loadings["total_variance"]
549-
)
550-
factor_loadings.style.background_gradient(
569+
factor_loadings = make_factor_loadings_df(idata_mm)
570+
num_cols = [c for c in factor_loadings.columns if not c in ["factor_loading", "factor"]]
571+
factor_loadings.style.format("{:.2f}", subset=num_cols).background_gradient(
551572
axis=0, subset=["indicator_explained_variance", "factor_explained_variance"]
552573
)
553574
```
554575

576+
We can pull out and plot the ordered weightings as a kind of feature importance plot.
577+
555578
```{code-cell} ipython3
556-
fig, ax = plt.subplots(figsize=(20, 6))
579+
fig, ax = plt.subplots(figsize=(15, 6))
557580
temp = factor_loadings[["factor_loading", "indicator_explained_variance"]].sort_values(
558581
by="indicator_explained_variance"
559582
)
560-
ax.barh(temp["factor_loading"], temp["indicator_explained_variance"], align="center")
561-
ax.set_title("Explained Variance")
583+
ax.barh(
584+
temp["factor_loading"], temp["indicator_explained_variance"], align="center", color="slateblue"
585+
)
586+
ax.set_title("Explained Variance");
562587
```
563588

589+
The goal of this kind of view isn't necessarily to find useful features as in the machine learning context, but to help understand the nature of the variation in our system. We can also pull out covariances and correlations among the latent factors
590+
564591
```{code-cell} ipython3
592+
:tags: [hide-input]
593+
594+
cov_df = pd.DataFrame(az.extract(idata_mm["posterior"])["cov"].mean(axis=2))
595+
cov_df.index = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
596+
cov_df.columns = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
597+
598+
correlation_df = pd.DataFrame(az.extract(idata_mm["posterior"])["chol_cov_corr"].mean(axis=2))
599+
correlation_df.index = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
600+
correlation_df.columns = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
601+
565602
fig, axs = plt.subplots(1, 2, figsize=(20, 6))
566603
axs = axs.flatten()
567604
mask = np.triu(np.ones_like(cov_df, dtype=bool))
@@ -571,8 +608,61 @@ axs[1].set_title("Covariance of Latent Constructs")
571608
sns.heatmap(correlation_df, annot=True, cmap="Blues", ax=axs[1], mask=mask);
572609
```
573610

611+
Which highlights the strong relationships between life-satisfaction `LS` construct, parental support `SUP_P` and social self-efficacy `SE_SOCIAL`. We can observe these patterns in the draws of our latent constructs too
612+
613+
```{code-cell} ipython3
614+
:tags: [hide-input]
615+
616+
fig, axs = plt.subplots(1, 3, figsize=(15, 10))
617+
axs = axs.flatten()
618+
ax = axs[0]
619+
ax1 = axs[1]
620+
ax2 = axs[2]
621+
az.plot_forest(idata_mm, var_names=["ksi"], combined=True, ax=ax, coords={"latent": ["SUP_P"]})
622+
az.plot_forest(
623+
idata_mm,
624+
var_names=["ksi"],
625+
combined=True,
626+
ax=ax1,
627+
colors="forestgreen",
628+
coords={"latent": ["SE_SOCIAL"]},
629+
)
630+
az.plot_forest(
631+
idata_mm,
632+
var_names=["ksi"],
633+
combined=True,
634+
ax=ax2,
635+
colors="slateblue",
636+
coords={"latent": ["LS"]},
637+
)
638+
ax.set_yticklabels([])
639+
ax.set_xlabel("SUP_P")
640+
ax1.set_yticklabels([])
641+
ax1.set_xlabel("SE_SOCIAL")
642+
ax2.set_yticklabels([])
643+
ax2.set_xlabel("LS")
644+
ax.axvline(-2, color="red")
645+
ax1.axvline(-2, color="red")
646+
ax2.axvline(-2, color="red")
647+
ax.set_title("Individual Parental Support Metric \n On Latent Factor SUP_P")
648+
ax1.set_title("Individual Social Self Efficacy \n On Latent Factor SE_SOCIAL")
649+
ax2.set_title("Individual Life Satisfaction Metric \n On Latent Factor LS")
650+
plt.show();
651+
```
652+
574653
## Bayesian Structural Equation Models
575654

655+
We've now seen how measurement models help us understand the relationships between disparate indicator variables in a kind of crude way. We have postulated a system of latent factors and derive the correlations between these factors to help us understand the strength of relationships between the broader constructs of interest. But this is kind a special case of a structural equation models. In the SEM tradition we're interested in figuring out aspects of the structural relations between variables that means want to posit dependence and independence relationship to interrogate our beliefs about influence flows through the system. For our data set we can postulate the following chain of dependencies
656+
657+
![Candidate Structural Model](structural_model_sem.png)
658+
659+
This model introduces the specific claims of dependence and the question then becomes how to model these patterns? In the next section we'll build on the structures of the basic measurement model to articulate these chain of dependence as functional equations of the "root" constructs. This allows to evaluate the same questions of model adequacy as before, but additionally we can now phrase questions about direct and indirect relationships between the latent constructs. In particular, since our focus is on what drives life-satisfaction, we can ask about the mediated effects of parental support.
660+
661+
### Model Complexity and Bayesian Sensitivity Analysis
662+
663+
These models are complicated, we're adding a bunch of new parameters and structure to the model. Each of the parameters is equipped with a prior that shapes the implications of the model specification. This is hugely expressive framework where we can encode a huge variety of dependencies and correlations With this freedom to structure of inferential model we need to be careful to assess the robustness of our inferences. As such we will here perform a quick sensitivity analysis to show how the central implications of this model vary under prior settings.
664+
665+
576666
```{code-cell} ipython3
577667
drivers = [
578668
"se_acad_p1",
@@ -732,8 +822,16 @@ model_sem2, idata_sem2 = make_indirect_sem(
732822
)
733823
```
734824

825+
The main structural feature to observe is that we've now added a bunch of regressions to our model such that some of the constructs that we took as given in the measurement model are now derived as a linear combination of others. Because we removed the correlation effect between `SE_SOCIAL` AND `SE_ACAD` we re-introduce the possibility of their correlation by adding correlated error terms to their regression equations.
826+
827+
```{code-cell} ipython3
828+
pm.model_to_graphviz(model_sem0)
829+
```
830+
831+
Next we'll see how the parameter estimates change across our prior specifications for the model
832+
735833
```{code-cell} ipython3
736-
fig, ax = plt.subplots(figsize=(10, 15))
834+
fig, ax = plt.subplots(figsize=(15, 15))
737835
az.plot_forest(
738836
[idata_sem0, idata_sem1, idata_sem2],
739837
model_names=["SEM0", "SEM1", "SEM2"],
@@ -743,6 +841,10 @@ az.plot_forest(
743841
);
744842
```
745843

844+
### Model Evaluation Checks
845+
846+
A quick evaluation of model performance suggests we do somewhat less well in recovering the sample covariance structures than we did with simpler measurement model
847+
746848
```{code-cell} ipython3
747849
residuals_posterior_cov = get_posterior_resids(idata_sem0, 2500)
748850
residuals_posterior_corr = get_posterior_resids(idata_sem0, 2500, metric="corr")
@@ -753,10 +855,18 @@ ax = sns.heatmap(residuals_posterior_corr, annot=True, cmap="bwr", center=0, mas
753855
ax.set_title("Residuals between Model Implied and Sample Correlations", fontsize=25);
754856
```
755857

858+
But the posterior predictive checks still look reasonable.
859+
756860
```{code-cell} ipython3
757861
make_ppc(idata_sem0, 100, drivers=drivers, dims=(3, 5))
758862
```
759863

864+
We can also continue to assess questions of direct and indirect effects that were obscure in the simpler measurement model.
865+
866+
+++
867+
868+
### Indirect and Direct Effects
869+
760870
```{code-cell} ipython3
761871
def calculate_effects(idata, var="SUP_P"):
762872
summary_df = az.summary(idata, var_names=["beta_r", "beta_r2"])
@@ -812,6 +922,10 @@ summary_f.index = ["SEM0", "SEM1", "SEM2"]
812922
summary_f
813923
```
814924

925+
# Conclusion
926+
927+
+++
928+
815929
## Authors
816930
- Authored by [Nathaniel Forde](https://nathanielf.github.io/posts/post-with-code/CFA_AND_SEM/CFA_AND_SEM.html) in September 2024
817931

102 KB
Loading

0 commit comments

Comments
 (0)