Skip to content

Conversation

@MaxGhenis
Copy link
Contributor

@MaxGhenis MaxGhenis commented Jul 30, 2025

Summary

  • Removed local QRF implementation from policyengine_us_data/utils/qrf.py
  • Updated all imports to use microimpute.models.qrf.QRF instead of local implementation
  • Refactored all imputation code to use the models.QRF API which provides:
    • Better handling of multiple imputation variables
    • Sequential imputation support
    • Consistent API across all usages

Changes

  1. Deleted policyengine_us_data/utils/qrf.py - QRF is now in microimpute package
  2. Updated imports in:
    • extended_cps.py - refactored to use models.QRF for bulk imputation
    • puf.py - updated both pension and demographics imputation
    • cps.py - updated rent/real_estate_taxes and wealth imputation
    • sipp.py - already using correct API
  3. Removed QRF import from utils/__init__.py
  4. Updated microimpute version requirement to >=1.0.2

This change avoids code duplication since microimpute already provides the same QRF functionality with better error handling and logging.

Fixes #400

CI Status

  • Lint: ✅ Passing
  • Smoke tests: ✅ Passing
  • Full test: ❌ Failing during dataset build (investigating)

- Removed policyengine_us_data/utils/qrf.py as QRF is now in microimpute
- Updated imports in extended_cps.py and puf.py to use microimpute.models.qrf.QRF
- Kept impute_income_variables, impute_pension_contributions_to_puf, and impute_missing_demographics functions in policyengine-us-data as they are domain-specific

Fixes #400
@MaxGhenis
Copy link
Contributor Author

The CI is failing on the Test job. The changes in this PR:

  1. Remove the local QRF implementation from
  2. Update all imports to use instead
  3. Fix a bug in where the wrong DataFrame was passed to predict

The smoke test passes but the full test suite fails. This might be due to:

  • Missing test data files that the full test suite requires
  • Different behavior between the local QRF and microimpute's QRF
  • A dependency issue with microimpute

Since I can't see the specific error in the CI logs, it would be helpful to investigate further.

…s.qrf

The utils.qrf.QRF has the same interface as our old QRF implementation, while models.qrf.QRF has a different interface designed for the Imputer framework.
@MaxGhenis
Copy link
Contributor Author

Update: microimpute 1.0.2 is now available on PyPI. However, there's a Python version compatibility issue:

  • microimpute 1.0.2 requires Python >=3.13
  • policyengine-us-data supports Python >=3.11, <3.14.0

This means the package will fail to install on Python 3.11 and 3.12. The CI is using Python 3.13 for tests, so the smoke test passes, but the full test suite is still failing.

Possible solutions:

  1. Update policyengine-us-data to require Python >=3.13 (breaking change)
  2. Ask microimpute to support Python 3.11 and 3.12
  3. Keep the local QRF implementation for now

The QRF implementation expects single-output regression. This change trains and predicts one variable at a time.
@MaxGhenis
Copy link
Contributor Author

Latest updates:

  • Fixed the impute_income_variables function to handle variables one by one, as QRF expects single-output regression
  • Fixed prediction indexing to use iloc instead of column name

The smoke test passes but the full test suite is still failing. This might be due to:

  1. Python version compatibility (microimpute requires Python >=3.13)
  2. Different behavior between the old and new QRF implementations
  3. The test building actual datasets which might expose edge cases

The wealth imputation code was importing microimpute.utils.qrf.QRF but using the API from microimpute.models.qrf.QRF. This commit fixes the import to use the correct models version which has the fit() method with X_train, predictors, imputed_variables parameters.
@juaristi22
Copy link
Collaborator

I think we should be using microimpute.models.qrf instead of microimpute.utils.qrf. The utils version is only the base implementation, its the model wrapper built on the Imputer class that handles the sequential imputation logic.

@MaxGhenis MaxGhenis marked this pull request as draft July 31, 2025 10:18
@MaxGhenis
Copy link
Contributor Author

Yeah, that got the AIs (and me) - filed PolicyEngine/microimpute#94 and refactoring now

@juaristi22
Copy link
Collaborator

cool! let me know if you'd like me to jump in

- Changed all imports from microimpute.utils.qrf to microimpute.models.qrf
- Refactored impute_income_variables to use models.QRF API with X_train, predictors, imputed_variables
- Updated impute_pension_contributions_to_puf to use models.QRF API
- Updated impute_missing_demographics to use models.QRF API
- Updated rent/real_estate_taxes imputation in cps.py to use models.QRF API
- models.QRF provides better handling of multiple imputation variables and sequential imputation
The models.QRF expects all imputed_variables to be present in X_train.
Changed impute_income_variables to calculate predictors and outputs
separately, then combine them to ensure all required columns are present.
…ables

The models.QRF uses sequential imputation which grows to 82+ features by the
end of the imputation list. This causes memory issues and fails when variables
like 'recapture_of_investment_credit' can't be calculated in the PUF dataset.

Reverted to simpler approach that:
- Imputes each variable independently (no sequential dependencies)
- Handles failures gracefully by using zeros for failed imputations
- Logs warnings for variables that can't be imputed
- Matches the behavior of the original implementation
Sequential imputation of 84 variables with growing feature sets (up to 84
features by the end) was causing memory issues. Added:
- Garbage collection before and after imputation
- Sampling training data to max 5000 rows to reduce memory
- Reduced n_estimators from 50 to 30
- Reduced max_depth from 10 to 8
- Increased min_samples_leaf to 30
- Explicit cleanup of model objects after use
Instead of imputing all 84 variables at once (which creates 84 models with
up to 84 features), batch them into groups of 20. This approach:
- Still uses sequential imputation within each batch for correlations
- Never has more than 27 features (7 predictors + 20 imputed)
- Creates fresh QRF model for each batch
- Cleans up memory between batches
- Should handle the memory limitations of CI runners
@MaxGhenis
Copy link
Contributor Author

This is taking longer and datasets are breaking likely due to the sequential imputation, now default behavior in microimpute. I'm trying a couple approaches to shrink down the ML task; @juaristi22 have you had to do this before?

@juaristi22
Copy link
Collaborator

juaristi22 commented Jul 31, 2025

sorry what is "this task"? reduce the computational cost of the ML imputation task?
i dont think what i have done would be useful in this case, i mostly did parallelizing of imputation tasks in autoimpute but that's probably not useful here as we need imputations to happen sequentially rather than in parallel.

- Reduced batch size from 20 to 10 variables
- Sample training data to 3000 rows (down from 5000)
- Pre-sample once instead of per-batch for consistency
- Reduced n_estimators to 20 (from 30)
- Reduced max_depth to 6 (from 8)
- Increased min_samples_leaf to 50
- Added n_jobs=1 to avoid multiprocessing memory overhead
- Added variable names to batch logging for better debugging
@juaristi22
Copy link
Collaborator

im not sure if we want to risk the trade-off but if memory is a very large constraint maybe we want to use a SampleQRF for the imputations in the extended_cps specfically?
https://sklearn-quantile.readthedocs.io/en/latest/generated/sklearn_quantile.SampleRandomForestQuantileRegressor.html

@juaristi22 juaristi22 self-assigned this Aug 4, 2025
@juaristi22 juaristi22 marked this pull request as ready for review August 4, 2025 09:10
@MaxGhenis MaxGhenis merged commit f1a29bb into main Aug 5, 2025
11 of 14 checks passed
@juaristi22 juaristi22 deleted the move-imputations-to-microimpute branch January 15, 2026 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move all imputations to microimpute

3 participants