|
5373 | 5373 | "**Observe:**\n",
|
5374 | 5374 | "\n",
|
5375 | 5375 | "+ Here's our auto-imputed posterior distributions (boxplots) for the missing data on the in-sample dataset `dfx_train`\n",
|
5376 |
| - "+ Some observations (e.g. `o00`, `o03`) have missing values in both `c` and `d`, others (e.g `o06`, `o09`) have only one\n", |
5377 | 5376 | "+ These are a (very helpful!) side-effect of our model construction and let us fill-in the real-world missing values for\n",
|
5378 | 5377 | " `c`, and `d` in `df_train`\n",
|
5379 |
| - "+ We also overplot the known true values from the synthetic dataset: and the match is very close for all: usually \n", |
5380 |
| - " well-within the HDI94" |
| 5378 | + "+ Some observations (e.g. `o00`, `o03`) have missing values in both `c` and `d`, others (e.g `o04`, `o06`) have only one\n", |
| 5379 | + "+ We also overplot the known true values from the synthetic dataset: and the match is close for all: usually \n", |
| 5380 | + " well-within the HDI94\n", |
| 5381 | + "+ Where observations have more than one missing value (e.g. `o00`, `o8`, `o18`, `023` are good examples), we see the\n", |
| 5382 | + " possibility of a lack of identifiability: this is an interesting and not easily avoided side-effect of the data and\n", |
| 5383 | + " model architecture, and in the real-world we might seek to mitigate through removing observations or features." |
5381 | 5384 | ]
|
5382 | 5385 | },
|
5383 | 5386 | {
|
|
5396 | 5399 | "The following process is a bit of a hack:\n",
|
5397 | 5400 | "\n",
|
5398 | 5401 | "1. Firstly: Re-specify the model entirely, using `dfx_holdout`, because doing `mdla.set_data` \n",
|
5399 |
| - " doesn't update the `np.ma.masked_array`, and as noted above $\\S 2.1$ **Build Model Object** we can't put nans or \n", |
| 5402 | + " doesn't update the `np.ma.masked_array`, and as noted above $\\S 2.1$ **Build Model Object** we can't put `nans` or \n", |
5400 | 5403 | " a `masked_array` into a `pm.Data`\n",
|
5401 | 5404 | "1. Secondly: Sample_ppc the `xk_unobserved`, because this is a precursor to computing `yhat`, and we can't specify\n",
|
5402 |
| - " a conditional order in sample_posterior_predictive\n", |
| 5405 | + " a conditional order in `sample_posterior_predictive`\n", |
5403 | 5406 | "2. Thirdly: Use those predictions to sample_ppc the `yhat`\n",
|
5404 | 5407 | "\n",
|
5405 | 5408 | "**REALITIES**\n",
|
5406 | 5409 | "\n",
|
5407 | 5410 | "+ This process is suboptimal for a real-world scenario wherein we want to forecast new incoming data, because we have to\n",
|
5408 |
| - " keep re-specifying the model in Step 1, and Steps 2 & 3 involve manipulations of idata objects. \n", |
5409 |
| - "+ It might still be suitable for a relatively slow, (potentially batched) forecasting process on the order of seconds, \n", |
5410 |
| - " not sub-second.\n", |
5411 |
| - "+ In any case, if this were to be deployed to handle a stream of sub-second inputs, a miuch simpler way to rectify the \n", |
| 5411 | + " keep re-specifying the model in Step 1 (which opens opportunities for simple human error), and Steps 2 & 3 involve \n", |
| 5412 | + " manipulations of idata objects, which is a faff\n", |
| 5413 | + "+ It should still be suitable for a relatively slow, (potentially batched) forecasting process on the order of seconds, \n", |
| 5414 | + " not sub-second\n", |
| 5415 | + "+ In any case, if this were to be deployed to handle a stream of sub-second inputs, a much simpler way to rectify the \n", |
5412 | 5416 | " situation would be to ensure proper data validation / hygiene upstream and require no missing data!"
|
5413 | 5417 | ]
|
5414 | 5418 | },
|
|
0 commit comments