- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.1k
Allow passing masked arrays with missing values to pm.Data() #6645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
passed to pm.Data() and pm.Model().set_data() - Integer masked arrays trigger an error message and provide suggested alternatives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening the PR.
We need to test if automatic imputation is working as that is the point of passing nan to observed in PyMC.
We also need to decide what to do if a user passes nan to a Mutabledata that is used as observed, only after defining the observed variable, as that won't trigger the automatic imputation routine. Maybe raise in that case?
I am also afraid MutableData+Imputation won't ever be useful during posterior predictive so maybe we shouldn't allow nan in Mutabledata at all? That avoids the problem above.
|  | ||
| def test_masked_integer_data(): | ||
| with pm.Model(): | ||
| data = np.ma.MaskedArray([1, 2, 3], [0, 0, 1]) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integers should be fine, otherwise we can't input discrete variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we can't - you cannot have an integer NumPy array with nan values, ie. this throws an error: np.array([1,2,3, np.nan[. dtype=int). That's because nan is strictly a float concept. So yes, we would not be able to allow users pass an integer masked array into pm.Data(). If they want to benefit from automatic imputation, they can today (and will be after this PR) pass a masked integer array directly into observed parameter of an RV. I have an error message that explains the options in the code:
28d15f8#diff-823b37f218229d363550b4cc387cfffa180c5c6e0e5ad0e174f2f0be7aa4692aR102
if isinstance(data, np.ma.MaskedArray):
        if "int" in str(data.dtype):
            raise TypeError(
                "Masked integer arrays (integer type datasets with missing values) are not supported by pm.Data() / pm.Model.set_data() at this time. \n"
                "Consider if using a float type fits your use case. \n"
                "Alternatively, if you want to benefit from automatic imputation in pyMC, pass a masked array directly to `observed=` parameter when defining a distribution."
            )
        else:
            ret = data.filled(fill_value=np.nan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't clear. Is any error raised if user pass floats observed to discrete variables? I think it works just fine.
| Thanks for the comments. Agreed, the PR still needs to include the part that makes Mutable/Constant data to be used in the automatic imputation process (I just opened a draft PR as per guidelines to indicate I started working on it). I planned to extend the code that does automatic imputation to work directly with  I'm not 100% sure what will happen with posterior predictive, but I think it should be the same as when passing a masked array to  | 
| The issue is what happens when you change Mutabledata between sampling and posterior predictive. PyMC will resample any variable that depends on a MutableData variable that has changed. This would include the imputed variables. If that makes sense then it's fine. I am just not sure what people need when they use MutableData for both imputation and prediction. | 
| And apologies, I didn't see you marked the PR as draft. No worries anyway :) | 
* ⬆️ UPGRADE: Autoupdate pre-commit config --------- Co-authored-by: pymc-bot <[email protected]>
passed to pm.Data() and pm.Model().set_data() - Integer masked arrays trigger an error message and provide suggested alternatives
…c into pm_data_imputation_support
| No problem at all :) I have added the rest of the PR which enables automatic imputation support for  Regarding posterior predictive - good question. I looked at how pyMC does it right now (when a masked numpy array is passed to observed) for a simple model where a covariate is missing, and I can see that: 
 The  So in a simple case, we don't seem to re-sample imputed values. However, if a user provides a new set of (missing) data, I would intuitively say it would make sense to resample the imputations, too. The missing entries in the new data are not the same as the old ones, after all. Having said that - my current PR does not seem to work with  
 I suspect it's because the automatic imputation logic only gets triggered when the model is created, and so in the second case pps fails because of nan values. I am unsure why the first case doesn't work - possibly because  Edit: actually, this even applies to situations where simple  | 
| Stumbling into this 18 months later, I'd like to be able to support the usual workflow, and allow for data to be missing in specified features: 
 The auto-impute works great for the in-sample dataset, but when I want to replace data in the  E.g.  inside model context I want to replace  Any ideas of alternative methods? | 
This PR implements the changes required to support passing masked arrays to pm.Data() as discussed in issue #6626.
(rest of PR description TBD)
...
Checklist
Major / Breaking Changes
New features
Bugfixes
Documentation
Maintenance
📚 Documentation preview 📚: https://pymc--6645.org.readthedocs.build/en/6645/