Move QRF implementation to microimpute package #414

MaxGhenis · 2025-07-30T23:38:17Z

Summary

Removed local QRF implementation from policyengine_us_data/utils/qrf.py
Updated all imports to use microimpute.models.qrf.QRF instead of local implementation
Refactored all imputation code to use the models.QRF API which provides:
- Better handling of multiple imputation variables
- Sequential imputation support
- Consistent API across all usages

Changes

Deleted policyengine_us_data/utils/qrf.py - QRF is now in microimpute package
Updated imports in:
- extended_cps.py - refactored to use models.QRF for bulk imputation
- puf.py - updated both pension and demographics imputation
- cps.py - updated rent/real_estate_taxes and wealth imputation
- sipp.py - already using correct API
Removed QRF import from utils/__init__.py
Updated microimpute version requirement to >=1.0.2

This change avoids code duplication since microimpute already provides the same QRF functionality with better error handling and logging.

Fixes #400

CI Status

Lint: ✅ Passing
Smoke tests: ✅ Passing
Full test: ❌ Failing during dataset build (investigating)

- Removed policyengine_us_data/utils/qrf.py as QRF is now in microimpute - Updated imports in extended_cps.py and puf.py to use microimpute.models.qrf.QRF - Kept impute_income_variables, impute_pension_contributions_to_puf, and impute_missing_demographics functions in policyengine-us-data as they are domain-specific Fixes #400

MaxGhenis · 2025-07-30T23:47:56Z

The CI is failing on the Test job. The changes in this PR:

Remove the local QRF implementation from
Update all imports to use instead
Fix a bug in where the wrong DataFrame was passed to predict

The smoke test passes but the full test suite fails. This might be due to:

Missing test data files that the full test suite requires
Different behavior between the local QRF and microimpute's QRF
A dependency issue with microimpute

Since I can't see the specific error in the CI logs, it would be helpful to investigate further.

…s.qrf The utils.qrf.QRF has the same interface as our old QRF implementation, while models.qrf.QRF has a different interface designed for the Imputer framework.

MaxGhenis · 2025-07-30T23:58:49Z

Update: microimpute 1.0.2 is now available on PyPI. However, there's a Python version compatibility issue:

microimpute 1.0.2 requires Python >=3.13
policyengine-us-data supports Python >=3.11, <3.14.0

This means the package will fail to install on Python 3.11 and 3.12. The CI is using Python 3.13 for tests, so the smoke test passes, but the full test suite is still failing.

Possible solutions:

Update policyengine-us-data to require Python >=3.13 (breaking change)
Ask microimpute to support Python 3.11 and 3.12
Keep the local QRF implementation for now

The QRF implementation expects single-output regression. This change trains and predicts one variable at a time.

MaxGhenis · 2025-07-31T00:06:11Z

Latest updates:

Fixed the impute_income_variables function to handle variables one by one, as QRF expects single-output regression
Fixed prediction indexing to use iloc instead of column name

The smoke test passes but the full test suite is still failing. This might be due to:

Python version compatibility (microimpute requires Python >=3.13)
Different behavior between the old and new QRF implementations
The test building actual datasets which might expose edge cases

The wealth imputation code was importing microimpute.utils.qrf.QRF but using the API from microimpute.models.qrf.QRF. This commit fixes the import to use the correct models version which has the fit() method with X_train, predictors, imputed_variables parameters.

juaristi22 · 2025-07-31T08:43:56Z

I think we should be using microimpute.models.qrf instead of microimpute.utils.qrf. The utils version is only the base implementation, its the model wrapper built on the Imputer class that handles the sequential imputation logic.

MaxGhenis · 2025-07-31T10:18:46Z

Yeah, that got the AIs (and me) - filed PolicyEngine/microimpute#94 and refactoring now

juaristi22 · 2025-07-31T10:20:50Z

cool! let me know if you'd like me to jump in

- Changed all imports from microimpute.utils.qrf to microimpute.models.qrf - Refactored impute_income_variables to use models.QRF API with X_train, predictors, imputed_variables - Updated impute_pension_contributions_to_puf to use models.QRF API - Updated impute_missing_demographics to use models.QRF API - Updated rent/real_estate_taxes imputation in cps.py to use models.QRF API - models.QRF provides better handling of multiple imputation variables and sequential imputation

The models.QRF expects all imputed_variables to be present in X_train. Changed impute_income_variables to calculate predictors and outputs separately, then combine them to ensure all required columns are present.

…ables The models.QRF uses sequential imputation which grows to 82+ features by the end of the imputation list. This causes memory issues and fails when variables like 'recapture_of_investment_credit' can't be calculated in the PUF dataset. Reverted to simpler approach that: - Imputes each variable independently (no sequential dependencies) - Handles failures gracefully by using zeros for failed imputations - Logs warnings for variables that can't be imputed - Matches the behavior of the original implementation

…ing variables" This reverts commit b286416.

…able handling

Sequential imputation of 84 variables with growing feature sets (up to 84 features by the end) was causing memory issues. Added: - Garbage collection before and after imputation - Sampling training data to max 5000 rows to reduce memory - Reduced n_estimators from 50 to 30 - Reduced max_depth from 10 to 8 - Increased min_samples_leaf to 30 - Explicit cleanup of model objects after use

Instead of imputing all 84 variables at once (which creates 84 models with up to 84 features), batch them into groups of 20. This approach: - Still uses sequential imputation within each batch for correlations - Never has more than 27 features (7 predictors + 20 imputed) - Creates fresh QRF model for each batch - Cleans up memory between batches - Should handle the memory limitations of CI runners

MaxGhenis · 2025-07-31T11:59:08Z

This is taking longer and datasets are breaking likely due to the sequential imputation, now default behavior in microimpute. I'm trying a couple approaches to shrink down the ML task; @juaristi22 have you had to do this before?

juaristi22 · 2025-07-31T12:05:18Z

sorry what is "this task"? reduce the computational cost of the ML imputation task?
i dont think what i have done would be useful in this case, i mostly did parallelizing of imputation tasks in autoimpute but that's probably not useful here as we need imputations to happen sequentially rather than in parallel.

- Reduced batch size from 20 to 10 variables - Sample training data to 3000 rows (down from 5000) - Pre-sample once instead of per-batch for consistency - Reduced n_estimators to 20 (from 30) - Reduced max_depth to 6 (from 8) - Increased min_samples_leaf to 50 - Added n_jobs=1 to avoid multiprocessing memory overhead - Added variable names to batch logging for better debugging

juaristi22 · 2025-07-31T12:17:17Z

im not sure if we want to risk the trade-off but if memory is a very large constraint maybe we want to use a SampleQRF for the imputations in the extended_cps specfically?
https://sklearn-quantile.readthedocs.io/en/latest/generated/sklearn_quantile.SampleRandomForestQuantileRegressor.html

MaxGhenis added 4 commits July 30, 2025 19:36

Fix utils/__init__.py to remove QRF import

ead55e9

Add changelog entry

876a189

Fix QRF predict call to use correct columns

2c96dc5

MaxGhenis added 2 commits July 30, 2025 19:49

Fix imports to use microimpute.utils.qrf instead of microimpute.model…

e587a39

…s.qrf The utils.qrf.QRF has the same interface as our old QRF implementation, while models.qrf.QRF has a different interface designed for the Imputer framework.

Update microimpute to >=1.0.2

2d73684

MaxGhenis added 3 commits July 30, 2025 20:00

Fix impute_income_variables to handle variables one by one

63f9b38

The QRF implementation expects single-output regression. This change trains and predicts one variable at a time.

Format code with Black

cac79ed

Fix prediction indexing to use iloc instead of column name

842f64e

MaxGhenis added 3 commits July 30, 2025 21:27

Trigger CI re-run

0a435b3

Handle both DataFrame and Series return types from predict

b697214

MaxGhenis marked this pull request as draft July 31, 2025 10:18

MaxGhenis added 8 commits July 31, 2025 06:31

Format code with black

6335794

Fix KeyError for missing imputation variables

74071fe

The models.QRF expects all imputed_variables to be present in X_train. Changed impute_income_variables to calculate predictors and outputs separately, then combine them to ensure all required columns are present.

Revert "Use non-sequential imputation to avoid memory issues and miss…

f0eb281

…ing variables" This reverts commit b286416.

Keep sequential imputation with memory optimizations and missing vari…

2f39624

…able handling

juaristi22 and others added 4 commits August 1, 2025 11:10

Update microimpute version

af2c828

skip missing imputed variables

895b921

lint

32cc33e

update microimpute version

4b73891

juaristi22 self-assigned this Aug 4, 2025

juaristi22 marked this pull request as ready for review August 4, 2025 09:10

MaxGhenis merged commit f1a29bb into main Aug 5, 2025
11 of 14 checks passed

juaristi22 deleted the move-imputations-to-microimpute branch January 15, 2026 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move QRF implementation to microimpute package #414

Move QRF implementation to microimpute package #414

Uh oh!

MaxGhenis commented Jul 30, 2025 •

edited

Loading

Uh oh!

MaxGhenis commented Jul 30, 2025

Uh oh!

MaxGhenis commented Jul 30, 2025

Uh oh!

MaxGhenis commented Jul 31, 2025

Uh oh!

juaristi22 commented Jul 31, 2025

Uh oh!

MaxGhenis commented Jul 31, 2025

Uh oh!

juaristi22 commented Jul 31, 2025

Uh oh!

MaxGhenis commented Jul 31, 2025

Uh oh!

juaristi22 commented Jul 31, 2025 •

edited

Loading

Uh oh!

juaristi22 commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Move QRF implementation to microimpute package #414

Move QRF implementation to microimpute package #414

Uh oh!

Conversation

MaxGhenis commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

CI Status

Uh oh!

MaxGhenis commented Jul 30, 2025

Uh oh!

MaxGhenis commented Jul 30, 2025

Uh oh!

MaxGhenis commented Jul 31, 2025

Uh oh!

juaristi22 commented Jul 31, 2025

Uh oh!

MaxGhenis commented Jul 31, 2025

Uh oh!

juaristi22 commented Jul 31, 2025

Uh oh!

MaxGhenis commented Jul 31, 2025

Uh oh!

juaristi22 commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juaristi22 commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MaxGhenis commented Jul 30, 2025 •

edited

Loading

juaristi22 commented Jul 31, 2025 •

edited

Loading