Skip to content

Commit eb4ed98

Browse files
committed
update plot sizes
Signed-off-by: Nathaniel <[email protected]>
1 parent 8ec7863 commit eb4ed98

File tree

2 files changed

+1094
-1073
lines changed

2 files changed

+1094
-1073
lines changed

examples/case_studies/CFA_SEM.ipynb

Lines changed: 1074 additions & 1062 deletions
Large diffs are not rendered by default.

examples/case_studies/CFA_SEM.myst.md

Lines changed: 20 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,9 @@ kernelspec:
2323

2424
In the psychometrics literature the data is often derived from a strategically constructed survey aimed at a particular target phenomena. Some intuited, but not yet measured, concept that arguably plays a role in human action, motivation or sentiment. The relative “fuzziness” of the subject matter in psychometrics has had a catalyzing effect on the methodological rigour sought in the science.
2525

26-
Survey designs are agonized over for correct tone and rhythm of sentence structure. Measurement scales are doubly checked for reliability and correctness. The literature is consulted and questions are refined. Analysis steps are justified and tested under a wealth of modelling routines. Model architectures are defined and refined to better express the hypothesized structures in the data-generating process. We will see how such due diligence leads to powerful and expressive models that grant us tractability on thorny questions of human affect. We draw on Roy Levy and Robert J. Mislevy's _Bayesian Psychometric Modeling_.
26+
Survey designs are agonized over for correct tone and rhythm of sentence structure. Measurement scales are doubly checked for reliability and correctness. The literature is consulted and questions are refined. Analysis steps are justified and tested under a wealth of modelling routines. Model architectures are defined and refined to better express the hypothesized structures in the data-generating process. We will see how such due diligence leads to powerful and expressive models that grant us tractability on thorny questions of human affect.
27+
28+
Throughout we draw on Roy Levy and Robert J. Mislevy's excellent _Bayesian Psychometric Modeling_.
2729

2830
```{code-cell} ipython3
2931
import warnings
@@ -47,35 +49,42 @@ rng = np.random.default_rng(42)
4749

4850
### Latent Constructs and Measurement
4951

50-
Our data is borrowed from work by Boris Mayer and Andrew Ellis found [here](https://methodenlehre.github.io/SGSCLM-R-course/cfa-and-sem-with-lavaan.html#structural-equation-modelling-sem). They demonstrate CFA and SEM modelling with lavaan. We’ll load up their data. We have survey responses from ~300 individuals who have answered questions regarding their upbringing, self-efficacy and reported life-satisfaction. The hypothetical dependency structure in this life-satisfaction data-set posits a moderated relationship between scores related to life-satisfaction, parental and family support and self-efficacy. It is not a trivial task to be able to design a survey that can elicit answers plausibly mapped to each of these “factors” or themes, never mind finding a model of their relationship that can inform us as to the relative of impact of each on life-satisfaction outcomes.
52+
Our data is borrowed from work by Boris Mayer and Andrew Ellis found [here](https://methodenlehre.github.io/SGSCLM-R-course/cfa-and-sem-with-lavaan.html#structural-equation-modelling-sem). They demonstrate CFA and SEM modelling with lavaan.
5153

52-
First we'll pull out the data and examine some summary statistics.
54+
We have survey responses from ~300 individuals who have answered questions regarding their upbringing, self-efficacy and reported life-satisfaction. The hypothetical dependency structure in this life-satisfaction data-set posits a moderated relationship between scores related to life-satisfaction, parental and family support and self-efficacy. It is not a trivial task to be able to design a survey that can elicit answers plausibly mapped to each of these “factors” or themes, never mind finding a model of their relationship that can inform us as to the relative of impact of each on life-satisfaction outcomes.
5355

56+
First let's pull out the data and examine some summary statistics.
5457

5558
```{code-cell} ipython3
5659
df = pd.read_csv("../data/sem_data.csv")
5760
df.head()
5861
```
5962

6063
```{code-cell} ipython3
61-
fig, ax = plt.subplots(figsize=(20, 7))
64+
fig, ax = plt.subplots(figsize=(20, 10))
6265
drivers = [c for c in df.columns if not c in ["region", "gender", "age", "ID"]]
6366
corr_df = df[drivers].corr()
6467
mask = np.triu(np.ones_like(corr_df, dtype=bool))
6568
sns.heatmap(corr_df, annot=True, cmap="Blues", ax=ax, center=0, mask=mask)
6669
ax.set_title("Sample Correlations between indicator Metrics")
67-
fig, ax = plt.subplots(figsize=(20, 7))
70+
fig, ax = plt.subplots(figsize=(20, 10))
6871
sns.heatmap(df[drivers].cov(), annot=True, cmap="Blues", ax=ax, center=0, mask=mask)
6972
ax.set_title("Sample Covariances between indicator Metrics");
7073
```
7174

72-
The lens here on the sample covariance matrix is common in the traditional SEM models are often fit to the data by optimising a fit to the covariance matrix. Model assessment routines often gauge the model's ability to recover the sample covariance relations. There is a slightyly different approach taken in the Bayesian approach to estimating these models which focuses on the observed data rather than the derived summary statistics. Next we'll plot the pairplot to visualise the nature of the correlations
75+
The lens here on the sample covariance matrix is common in the traditional SEM modeling. CFA and SEM models are often estimated by fitting parameters to the data by optimising the parameter structure of the covariance matrix. Model assessment routines often gauge the model's ability to recover the sample covariance relations. There is a slightyly different (less constrained) approach taken in the Bayesian approach to estimating these models which focuses on the observed data rather than the derived summary statistics.
76+
77+
Next we'll plot the pairplot to visualise the nature of the correlations
7378

7479
```{code-cell} ipython3
7580
ax = sns.pairplot(df[drivers], kind="reg", corner=True, diag_kind="kde")
7681
plt.suptitle("Pair Plot of Indicator Metrics with Regression Fits", fontsize=30);
7782
```
7883

84+
It's this wide-ranging set of relationships that we seek to distill in our CFA models. How can we take this complex joint distribution and structure it in a way that is plausible and interpretable?
85+
86+
+++
87+
7988
## Measurement Models
8089

8190
+++
@@ -535,14 +544,14 @@ residuals_posterior_corr = get_posterior_resids(idata_mm, 2500, metric="corr")
535544
These tables lend themselves to nice plots where we can highlight the deviation from the sample covariance and correlation statistics.
536545

537546
```{code-cell} ipython3
538-
fig, ax = plt.subplots(figsize=(20, 7))
547+
fig, ax = plt.subplots(figsize=(20, 10))
539548
mask = np.triu(np.ones_like(residuals_posterior_corr, dtype=bool))
540549
ax = sns.heatmap(residuals_posterior_corr, annot=True, cmap="bwr", mask=mask)
541550
ax.set_title("Residuals between Model Implied and Sample Correlations", fontsize=25);
542551
```
543552

544553
```{code-cell} ipython3
545-
fig, ax = plt.subplots(figsize=(20, 7))
554+
fig, ax = plt.subplots(figsize=(20, 10))
546555
ax = sns.heatmap(residuals_posterior_cov, annot=True, cmap="bwr", mask=mask)
547556
ax.set_title("Residuals between Model Implied and Sample Covariances", fontsize=25);
548557
```
@@ -595,7 +604,7 @@ factor_loadings.style.format("{:.2f}", subset=num_cols).background_gradient(
595604
We can pull out and plot the ordered weightings as a kind of feature importance plot.
596605

597606
```{code-cell} ipython3
598-
fig, ax = plt.subplots(figsize=(15, 6))
607+
fig, ax = plt.subplots(figsize=(15, 8))
599608
temp = factor_loadings[["factor_loading", "indicator_explained_variance"]].sort_values(
600609
by="indicator_explained_variance"
601610
)
@@ -618,7 +627,7 @@ correlation_df = pd.DataFrame(az.extract(idata_mm["posterior"])["chol_cov_corr"]
618627
correlation_df.index = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
619628
correlation_df.columns = ["SE_ACAD", "SE_SOCIAL", "SUP_F", "SUP_P", "LS"]
620629
621-
fig, axs = plt.subplots(1, 2, figsize=(20, 6))
630+
fig, axs = plt.subplots(1, 2, figsize=(20, 10))
622631
axs = axs.flatten()
623632
mask = np.triu(np.ones_like(cov_df, dtype=bool))
624633
sns.heatmap(cov_df, annot=True, cmap="Blues", ax=axs[0], mask=mask)
@@ -891,7 +900,7 @@ A quick evaluation of model performance suggests we do somewhat less well in rec
891900
residuals_posterior_cov = get_posterior_resids(idata_sem0, 2500)
892901
residuals_posterior_corr = get_posterior_resids(idata_sem0, 2500, metric="corr")
893902
894-
fig, ax = plt.subplots(figsize=(20, 7))
903+
fig, ax = plt.subplots(figsize=(20, 10))
895904
mask = np.triu(np.ones_like(residuals_posterior_corr, dtype=bool))
896905
ax = sns.heatmap(residuals_posterior_corr, annot=True, cmap="bwr", center=0, mask=mask)
897906
ax.set_title("Residuals between Model Implied and Sample Correlations", fontsize=25);

0 commit comments

Comments
 (0)