Skip to content

Commit 56eaed5

Browse files
committed
+ added more explanatory text to 2.4.4
+ fixed a couple of typos
1 parent 318393f commit 56eaed5

File tree

2 files changed

+25
-17
lines changed

2 files changed

+25
-17
lines changed

examples/generalized_linear_models/GLM-missing-values-in-covariates.ipynb

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5373,11 +5373,14 @@
53735373
"**Observe:**\n",
53745374
"\n",
53755375
"+ Here's our auto-imputed posterior distributions (boxplots) for the missing data on the in-sample dataset `dfx_train`\n",
5376-
"+ Some observations (e.g. `o00`, `o03`) have missing values in both `c` and `d`, others (e.g `o06`, `o09`) have only one\n",
53775376
"+ These are a (very helpful!) side-effect of our model construction and let us fill-in the real-world missing values for\n",
53785377
" `c`, and `d` in `df_train`\n",
5379-
"+ We also overplot the known true values from the synthetic dataset: and the match is very close for all: usually \n",
5380-
" well-within the HDI94"
5378+
"+ Some observations (e.g. `o00`, `o03`) have missing values in both `c` and `d`, others (e.g `o04`, `o06`) have only one\n",
5379+
"+ We also overplot the known true values from the synthetic dataset: and the match is close for all: usually \n",
5380+
" well-within the HDI94\n",
5381+
"+ Where observations have more than one missing value (e.g. `o00`, `o8`, `o18`, `023` are good examples), we see the\n",
5382+
" possibility of a lack of identifiability: this is an interesting and not easily avoided side-effect of the data and\n",
5383+
" model architecture, and in the real-world we might seek to mitigate through removing observations or features."
53815384
]
53825385
},
53835386
{
@@ -5396,19 +5399,20 @@
53965399
"The following process is a bit of a hack:\n",
53975400
"\n",
53985401
"1. Firstly: Re-specify the model entirely, using `dfx_holdout`, because doing `mdla.set_data` \n",
5399-
" doesn't update the `np.ma.masked_array`, and as noted above $\\S 2.1$ **Build Model Object** we can't put nans or \n",
5402+
" doesn't update the `np.ma.masked_array`, and as noted above $\\S 2.1$ **Build Model Object** we can't put `nans` or \n",
54005403
" a `masked_array` into a `pm.Data`\n",
54015404
"1. Secondly: Sample_ppc the `xk_unobserved`, because this is a precursor to computing `yhat`, and we can't specify\n",
5402-
" a conditional order in sample_posterior_predictive\n",
5405+
" a conditional order in `sample_posterior_predictive`\n",
54035406
"2. Thirdly: Use those predictions to sample_ppc the `yhat`\n",
54045407
"\n",
54055408
"**REALITIES**\n",
54065409
"\n",
54075410
"+ This process is suboptimal for a real-world scenario wherein we want to forecast new incoming data, because we have to\n",
5408-
" keep re-specifying the model in Step 1, and Steps 2 & 3 involve manipulations of idata objects. \n",
5409-
"+ It might still be suitable for a relatively slow, (potentially batched) forecasting process on the order of seconds, \n",
5410-
" not sub-second.\n",
5411-
"+ In any case, if this were to be deployed to handle a stream of sub-second inputs, a miuch simpler way to rectify the \n",
5411+
" keep re-specifying the model in Step 1 (which opens opportunities for simple human error), and Steps 2 & 3 involve \n",
5412+
" manipulations of idata objects, which is a faff\n",
5413+
"+ It should still be suitable for a relatively slow, (potentially batched) forecasting process on the order of seconds, \n",
5414+
" not sub-second\n",
5415+
"+ In any case, if this were to be deployed to handle a stream of sub-second inputs, a much simpler way to rectify the \n",
54125416
" situation would be to ensure proper data validation / hygiene upstream and require no missing data!"
54135417
]
54145418
},

examples/generalized_linear_models/GLM-missing-values-in-covariates.myst.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1267,11 +1267,14 @@ _ = plot_xkhat_vs_xk(df_xk_unobs, mdlnm="mdla")
12671267
**Observe:**
12681268

12691269
+ Here's our auto-imputed posterior distributions (boxplots) for the missing data on the in-sample dataset `dfx_train`
1270-
+ Some observations (e.g. `o00`, `o03`) have missing values in both `c` and `d`, others (e.g `o06`, `o09`) have only one
12711270
+ These are a (very helpful!) side-effect of our model construction and let us fill-in the real-world missing values for
12721271
`c`, and `d` in `df_train`
1273-
+ We also overplot the known true values from the synthetic dataset: and the match is very close for all: usually
1272+
+ Some observations (e.g. `o00`, `o03`) have missing values in both `c` and `d`, others (e.g `o04`, `o06`) have only one
1273+
+ We also overplot the known true values from the synthetic dataset: and the match is close for all: usually
12741274
well-within the HDI94
1275+
+ Where observations have more than one missing value (e.g. `o00`, `o8`, `o18`, `023` are good examples), we see the
1276+
possibility of a lack of identifiability: this is an interesting and not easily avoided side-effect of the data and
1277+
model architecture, and in the real-world we might seek to mitigate through removing observations or features.
12751278

12761279
+++
12771280

@@ -1284,19 +1287,20 @@ _ = plot_xkhat_vs_xk(df_xk_unobs, mdlnm="mdla")
12841287
The following process is a bit of a hack:
12851288

12861289
1. Firstly: Re-specify the model entirely, using `dfx_holdout`, because doing `mdla.set_data`
1287-
doesn't update the `np.ma.masked_array`, and as noted above $\S 2.1$ **Build Model Object** we can't put nans or
1290+
doesn't update the `np.ma.masked_array`, and as noted above $\S 2.1$ **Build Model Object** we can't put `nans` or
12881291
a `masked_array` into a `pm.Data`
12891292
1. Secondly: Sample_ppc the `xk_unobserved`, because this is a precursor to computing `yhat`, and we can't specify
1290-
a conditional order in sample_posterior_predictive
1293+
a conditional order in `sample_posterior_predictive`
12911294
2. Thirdly: Use those predictions to sample_ppc the `yhat`
12921295

12931296
**REALITIES**
12941297

12951298
+ This process is suboptimal for a real-world scenario wherein we want to forecast new incoming data, because we have to
1296-
keep re-specifying the model in Step 1, and Steps 2 & 3 involve manipulations of idata objects.
1297-
+ It might still be suitable for a relatively slow, (potentially batched) forecasting process on the order of seconds,
1298-
not sub-second.
1299-
+ In any case, if this were to be deployed to handle a stream of sub-second inputs, a miuch simpler way to rectify the
1299+
keep re-specifying the model in Step 1 (which opens opportunities for simple human error), and Steps 2 & 3 involve
1300+
manipulations of idata objects, which is a faff
1301+
+ It should still be suitable for a relatively slow, (potentially batched) forecasting process on the order of seconds,
1302+
not sub-second
1303+
+ In any case, if this were to be deployed to handle a stream of sub-second inputs, a much simpler way to rectify the
13001304
situation would be to ensure proper data validation / hygiene upstream and require no missing data!
13011305

13021306
+++

0 commit comments

Comments
 (0)