@@ -211,7 +211,7 @@ differently generated (and also subjective) measure of physical health. In the d
211
211
` b1602 ` , ` b102 ` , ... ` d450 ` , ` d455 ` , ... ` s770 ` etc, and the target feature is named ` phcs ` .
212
212
213
213
Per the Bürkner paper we will subselect 2 features ` d450 ` , ` d455 ` (which measure an impairment of patient
214
- walking ability on a scale ` [0 to 4] ` [ no problem to complete problem] ) and use them to predict ` phcs ` .
214
+ walking ability on a scale ` [0 to 4] ` [ ` " no problem" ` to ` " complete problem" ` ] ) and use them to predict ` phcs ` .
215
215
216
216
Quite interestingly, for feature ` d450 ` , the highest ordinal level value ` 4 ` is not seen in the dataset, so we have a
217
217
missing data problem which could further encourage the misuse of a numeric coefficient to average or "interpolate" a
@@ -346,14 +346,16 @@ df.loc[idx, ft] = df.loc[idx, ft].apply(lambda x: f"c{x}")
346
346
df[ft].unique()
347
347
```
348
348
349
+ NOTE force the categorical levels to include c4 which is valid in the data domain but unobserved
350
+
349
351
``` {code-cell} ipython3
350
352
---
351
353
colab:
352
354
base_uri: https://localhost:8080/
353
355
id: Hk-DR6akJagC
354
356
outputId: bb70e394-2623-470a-8621-efacdb107b72
355
357
---
356
- lvls = ["c0", "c1", "c2", "c3"]
358
+ lvls = ["c0", "c1", "c2", "c3", "c4" ]
357
359
df[ft] = pd.Categorical(df[ft].values, categories=lvls, ordered=True)
358
360
df[ft].cat.categories
359
361
```
@@ -474,8 +476,8 @@ def plot_numeric_vs_cat(df, ftnum="phcs", ftcat="d450") -> plt.figure:
474
476
meanprops=dict(markerfacecolor="w", markeredgecolor="#333333", marker="d", markersize=12),
475
477
)
476
478
477
- _ = sns.boxplot(x=ftnum, y=ftcat, hue=ftcat, data=df, **kws_box, ax=ax0)
478
- _ = sns.countplot(y=ftcat, hue=ftcat, data=df, ax=ax1)
479
+ _ = sns.boxplot(x=ftnum, y=ftcat, data=df, **kws_box, ax=ax0) # hue=ftcat,
480
+ _ = sns.countplot(y=ftcat, data=df, ax=ax1) # hue=ftcat, seaborn >= 0.13
479
481
_ = ax0.invert_yaxis()
480
482
_ = ax1.yaxis.label.set_visible(False)
481
483
_ = plt.setp(ax1.get_yticklabels(), visible=False)
@@ -1081,7 +1083,7 @@ plot_predicted_phcshat_d450_d455(idata=ida_ppc, mdlname="mdla")
1081
1083
This is an improved linear model where we acknowledge that the categorical features are ordinal _ and_ allow the ordinal
1082
1084
values to have a non-equal spacing, For example, it might well be that ` A > B > C ` , but the spacing is not metric:
1083
1085
instead ` A >>> B > C ` . We achieve this using a Dirichlet hyperprior to allocate hetrogenously spaced sections of a
1084
- linear ceofficient :
1086
+ linear coefficient :
1085
1087
1086
1088
$$
1087
1089
\begin{align}
@@ -1090,11 +1092,11 @@ $$
1090
1092
\\
1091
1093
\beta_{d450} &\sim \text{Normal}(0, \sigma_{\beta}) \\
1092
1094
\chi_{d450} &\sim \text{Dirichlet}(1, \text{shape}=k_{d450}) \\
1093
- \nu_{d450} &\sim \beta_{d450} * \sum_{i =0}^{i =k_{d450}}\chi_{d450} \\
1095
+ \nu_{d450} &\sim \beta_{d450} * \sum_{k =0}^{k =k_{d450}}\chi_{d450} \\
1094
1096
\\
1095
1097
\beta_{d455} &\sim \text{Normal}(0, \sigma_{\beta}) \\
1096
1098
\chi_{d455} &\sim \text{Dirichlet}(1, \text{shape}=k_{d455}) \\
1097
- \nu_{d455} &\sim \beta_{d455} * \sum_{i =0}^{i =k_{d455}}\chi_{d455} \\
1099
+ \nu_{d455} &\sim \beta_{d455} * \sum_{k =0}^{k =k_{d455}}\chi_{d455} \\
1098
1100
\\
1099
1101
lm &= \beta^{T}\mathbb{x}_{i,j} + \nu_{d450}[x_{i,d450}] + \nu_{d455}[x_{i,d455}]\\
1100
1102
\epsilon &\sim \text{InverseGamma}(11, 10) \\
0 commit comments