-
Notifications
You must be signed in to change notification settings - Fork 10
Move QRF implementation to microimpute package #414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Removed policyengine_us_data/utils/qrf.py as QRF is now in microimpute - Updated imports in extended_cps.py and puf.py to use microimpute.models.qrf.QRF - Kept impute_income_variables, impute_pension_contributions_to_puf, and impute_missing_demographics functions in policyengine-us-data as they are domain-specific Fixes #400
|
The CI is failing on the Test job. The changes in this PR:
The smoke test passes but the full test suite fails. This might be due to:
Since I can't see the specific error in the CI logs, it would be helpful to investigate further. |
…s.qrf The utils.qrf.QRF has the same interface as our old QRF implementation, while models.qrf.QRF has a different interface designed for the Imputer framework.
|
Update: microimpute 1.0.2 is now available on PyPI. However, there's a Python version compatibility issue:
This means the package will fail to install on Python 3.11 and 3.12. The CI is using Python 3.13 for tests, so the smoke test passes, but the full test suite is still failing. Possible solutions:
|
The QRF implementation expects single-output regression. This change trains and predicts one variable at a time.
|
Latest updates:
The smoke test passes but the full test suite is still failing. This might be due to:
|
The wealth imputation code was importing microimpute.utils.qrf.QRF but using the API from microimpute.models.qrf.QRF. This commit fixes the import to use the correct models version which has the fit() method with X_train, predictors, imputed_variables parameters.
|
I think we should be using |
|
Yeah, that got the AIs (and me) - filed PolicyEngine/microimpute#94 and refactoring now |
|
cool! let me know if you'd like me to jump in |
- Changed all imports from microimpute.utils.qrf to microimpute.models.qrf - Refactored impute_income_variables to use models.QRF API with X_train, predictors, imputed_variables - Updated impute_pension_contributions_to_puf to use models.QRF API - Updated impute_missing_demographics to use models.QRF API - Updated rent/real_estate_taxes imputation in cps.py to use models.QRF API - models.QRF provides better handling of multiple imputation variables and sequential imputation
The models.QRF expects all imputed_variables to be present in X_train. Changed impute_income_variables to calculate predictors and outputs separately, then combine them to ensure all required columns are present.
…ables The models.QRF uses sequential imputation which grows to 82+ features by the end of the imputation list. This causes memory issues and fails when variables like 'recapture_of_investment_credit' can't be calculated in the PUF dataset. Reverted to simpler approach that: - Imputes each variable independently (no sequential dependencies) - Handles failures gracefully by using zeros for failed imputations - Logs warnings for variables that can't be imputed - Matches the behavior of the original implementation
…ing variables" This reverts commit b286416.
Sequential imputation of 84 variables with growing feature sets (up to 84 features by the end) was causing memory issues. Added: - Garbage collection before and after imputation - Sampling training data to max 5000 rows to reduce memory - Reduced n_estimators from 50 to 30 - Reduced max_depth from 10 to 8 - Increased min_samples_leaf to 30 - Explicit cleanup of model objects after use
Instead of imputing all 84 variables at once (which creates 84 models with up to 84 features), batch them into groups of 20. This approach: - Still uses sequential imputation within each batch for correlations - Never has more than 27 features (7 predictors + 20 imputed) - Creates fresh QRF model for each batch - Cleans up memory between batches - Should handle the memory limitations of CI runners
|
This is taking longer and datasets are breaking likely due to the sequential imputation, now default behavior in microimpute. I'm trying a couple approaches to shrink down the ML task; @juaristi22 have you had to do this before? |
|
sorry what is "this task"? reduce the computational cost of the ML imputation task? |
- Reduced batch size from 20 to 10 variables - Sample training data to 3000 rows (down from 5000) - Pre-sample once instead of per-batch for consistency - Reduced n_estimators to 20 (from 30) - Reduced max_depth to 6 (from 8) - Increased min_samples_leaf to 50 - Added n_jobs=1 to avoid multiprocessing memory overhead - Added variable names to batch logging for better debugging
|
im not sure if we want to risk the trade-off but if memory is a very large constraint maybe we want to use a SampleQRF for the imputations in the extended_cps specfically? |
Summary
policyengine_us_data/utils/qrf.pymicroimpute.models.qrf.QRFinstead of local implementationChanges
policyengine_us_data/utils/qrf.py- QRF is now in microimpute packageextended_cps.py- refactored to use models.QRF for bulk imputationpuf.py- updated both pension and demographics imputationcps.py- updated rent/real_estate_taxes and wealth imputationsipp.py- already using correct APIutils/__init__.pyThis change avoids code duplication since microimpute already provides the same QRF functionality with better error handling and logging.
Fixes #400
CI Status