Skip to content

Commit 37bdf03

Browse files
committed
+ minor update: forced addtional level c4 into d450 categorical feature
+ fixed a couple of typos
1 parent 8f224da commit 37bdf03

File tree

2 files changed

+513
-506
lines changed

2 files changed

+513
-506
lines changed

examples/generalized_linear_models/GLM-ordinal-features.ipynb

Lines changed: 504 additions & 499 deletions
Large diffs are not rendered by default.

examples/generalized_linear_models/GLM-ordinal-features.myst.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,7 @@ differently generated (and also subjective) measure of physical health. In the d
211211
`b1602`, `b102`, ... `d450`, `d455`, ... `s770` etc, and the target feature is named `phcs`.
212212

213213
Per the Bürkner paper we will subselect 2 features `d450`, `d455` (which measure an impairment of patient
214-
walking ability on a scale `[0 to 4]` [no problem to complete problem]) and use them to predict `phcs`.
214+
walking ability on a scale `[0 to 4]` [`"no problem"` to `"complete problem"`]) and use them to predict `phcs`.
215215

216216
Quite interestingly, for feature `d450`, the highest ordinal level value `4` is not seen in the dataset, so we have a
217217
missing data problem which could further encourage the misuse of a numeric coefficient to average or "interpolate" a
@@ -346,14 +346,16 @@ df.loc[idx, ft] = df.loc[idx, ft].apply(lambda x: f"c{x}")
346346
df[ft].unique()
347347
```
348348

349+
NOTE force the categorical levels to include c4 which is valid in the data domain but unobserved
350+
349351
```{code-cell} ipython3
350352
---
351353
colab:
352354
base_uri: https://localhost:8080/
353355
id: Hk-DR6akJagC
354356
outputId: bb70e394-2623-470a-8621-efacdb107b72
355357
---
356-
lvls = ["c0", "c1", "c2", "c3"]
358+
lvls = ["c0", "c1", "c2", "c3", "c4"]
357359
df[ft] = pd.Categorical(df[ft].values, categories=lvls, ordered=True)
358360
df[ft].cat.categories
359361
```
@@ -474,8 +476,8 @@ def plot_numeric_vs_cat(df, ftnum="phcs", ftcat="d450") -> plt.figure:
474476
meanprops=dict(markerfacecolor="w", markeredgecolor="#333333", marker="d", markersize=12),
475477
)
476478
477-
_ = sns.boxplot(x=ftnum, y=ftcat, hue=ftcat, data=df, **kws_box, ax=ax0)
478-
_ = sns.countplot(y=ftcat, hue=ftcat, data=df, ax=ax1)
479+
_ = sns.boxplot(x=ftnum, y=ftcat, data=df, **kws_box, ax=ax0) # hue=ftcat,
480+
_ = sns.countplot(y=ftcat, data=df, ax=ax1) # hue=ftcat, seaborn >= 0.13
479481
_ = ax0.invert_yaxis()
480482
_ = ax1.yaxis.label.set_visible(False)
481483
_ = plt.setp(ax1.get_yticklabels(), visible=False)
@@ -1081,7 +1083,7 @@ plot_predicted_phcshat_d450_d455(idata=ida_ppc, mdlname="mdla")
10811083
This is an improved linear model where we acknowledge that the categorical features are ordinal _and_ allow the ordinal
10821084
values to have a non-equal spacing, For example, it might well be that `A > B > C`, but the spacing is not metric:
10831085
instead `A >>> B > C`. We achieve this using a Dirichlet hyperprior to allocate hetrogenously spaced sections of a
1084-
linear ceofficient:
1086+
linear coefficient:
10851087

10861088
$$
10871089
\begin{align}
@@ -1090,11 +1092,11 @@ $$
10901092
\\
10911093
\beta_{d450} &\sim \text{Normal}(0, \sigma_{\beta}) \\
10921094
\chi_{d450} &\sim \text{Dirichlet}(1, \text{shape}=k_{d450}) \\
1093-
\nu_{d450} &\sim \beta_{d450} * \sum_{i=0}^{i=k_{d450}}\chi_{d450} \\
1095+
\nu_{d450} &\sim \beta_{d450} * \sum_{k=0}^{k=k_{d450}}\chi_{d450} \\
10941096
\\
10951097
\beta_{d455} &\sim \text{Normal}(0, \sigma_{\beta}) \\
10961098
\chi_{d455} &\sim \text{Dirichlet}(1, \text{shape}=k_{d455}) \\
1097-
\nu_{d455} &\sim \beta_{d455} * \sum_{i=0}^{i=k_{d455}}\chi_{d455} \\
1099+
\nu_{d455} &\sim \beta_{d455} * \sum_{k=0}^{k=k_{d455}}\chi_{d455} \\
10981100
\\
10991101
lm &= \beta^{T}\mathbb{x}_{i,j} + \nu_{d450}[x_{i,d450}] + \nu_{d455}[x_{i,d455}]\\
11001102
\epsilon &\sim \text{InverseGamma}(11, 10) \\

0 commit comments

Comments
 (0)