You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,14 @@ which installs the development dependencies in a reference-only manner (so that
14
14
to the package code will be reflected immediately); `policyengine-us-data` is a dev package
15
15
and not intended for direct access.
16
16
17
+
## SSA Data Sources
18
+
19
+
The following SSA data sources are used in this project:
20
+
21
+
-[Latest Trustee's Report (2025)](https://www.ssa.gov/oact/TR/2025/index.html) - Source for `social_security_aux.csv` (extracted via `extract_ssa_costs.py`)
22
+
-[Single Year Supplementary Tables (2025)](https://www.ssa.gov/oact/tr/2025/lrIndex.html) - Long-range demographic and economic projections
23
+
-[Single Year Age Demographic Projections (2024 - latest published)](https://www.ssa.gov/oact/HistEst/Population/2024/Population2024.html) - Source for `SSPopJul_TR2024.csv` population data
We present a methodology for creating enhanced microsimulation datasets by combining the Current Population Survey (CPS) with the IRS Public Use File (PUF). Our two-stage approach uses quantile regression forests to impute 72 tax variables from the PUF onto CPS records, preserving distributional characteristics while maintaining household composition and member relationships. The imputation process alone does not guarantee consistency with official statistics, necessitating a reweighting step to align the combined dataset with known population totals and administrative benchmarks. We apply a reweighting algorithm that calibrates the dataset to over 7,000 targets from six sources: IRS Statistics of Income, Census population projections, Congressional Budget Office program estimates, Treasury expenditure data, Joint Committee on Taxation tax expenditure estimates, and healthcare spending patterns. The reweighting employs dropout-regularized gradient descent optimization to ensure consistency with administrative benchmarks. Validation shows the enhanced dataset reduces error in key tax components by [TO BE CALCULATED]% relative to the baseline CPS. The dataset maintains the CPS's demographic detail and geographic granularity while incorporating tax reporting data from administrative sources. We release the enhanced dataset, source code, and documentation to support policy analysis.
3
+
We present a methodology for creating enhanced microsimulation datasets by combining the
4
+
Current Population Survey (CPS) with the IRS Public Use File (PUF). Our approach uses
5
+
quantile regression forests to impute 67 tax variables from the PUF onto CPS records,
6
+
preserving distributional characteristics while maintaining household composition and member
7
+
relationships. The imputation process alone does not guarantee consistency with official
8
+
statistics, necessitating a reweighting step to align the combined dataset with known
9
+
population totals and administrative benchmarks. We apply a reweighting algorithm that
10
+
calibrates the dataset to 2,813 targets from
11
+
the IRS Statistics of Income, Census population projections, Congressional Budget
12
+
Office benefit program estimates, Treasury
13
+
expenditure data, Joint Committee on Taxation tax expenditure estimates, healthcare
14
+
spending patterns, and other benefit program costs. The reweighting employs dropout-regularized
15
+
gradient descent optimization
16
+
to ensure consistency with administrative benchmarks. Validation shows the enhanced dataset
17
+
reduces error in key tax components by [TO BE CALCULATED]% relative to the baseline CPS.
18
+
The dataset maintains the CPS's demographic detail and geographic granularity while
19
+
incorporating tax reporting data from administrative sources. We release the enhanced
20
+
dataset, source code, and documentation to support policy analysis.
Copy file name to clipboardExpand all lines: docs/conclusion.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ We present a methodology for creating enhanced microsimulation datasets that com
6
6
7
7
Our work makes several key contributions:
8
8
9
-
**Methodological Innovation**: The use of Quantile Regression Forests for imputation preserves distributional characteristics while maintaining computational efficiency. The large-scale calibration to 7,000+ targets pushes the boundaries of survey data enhancement.
9
+
**Methodological Innovation**: The use of Quantile Regression Forests for imputation preserves distributional characteristics while maintaining computational efficiency. The large-scale calibration to 2,813 targets pushes the boundaries of survey data enhancement.
10
10
11
11
**Practical Tools**: We provide open-source implementations that enable researchers to apply, modify, and extend these methods. The modular design facilitates experimentation with alternative approaches.
Copy file name to clipboardExpand all lines: docs/discussion.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ The use of Quantile Regression Forests for imputation represents an advance over
22
22
- Maintains realistic variable correlations
23
23
- Allows uncertainty quantification
24
24
25
-
The large-scale calibration to 7,000+ targets ensures consistency with administrative benchmarks across multiple dimensions simultaneously.
25
+
The large-scale calibration to 2,813 targets ensures consistency with administrative benchmarks across multiple dimensions simultaneously.
26
26
27
27
### Practical Advantages
28
28
@@ -44,7 +44,7 @@ These assumptions may not hold perfectly, particularly for subpopulations that t
44
44
45
45
### Calibration Trade-offs
46
46
47
-
With 7,000+ targets, perfect fit to all benchmarks is impossible. The optimization must balance competing objectives across target types, the relative importance of different statistics, stability of resulting weights, and preservation of household relationships.
47
+
With 2,813 targets, perfect fit to all benchmarks is impossible. The optimization must balance competing objectives across target types, the relative importance of different statistics, stability of resulting weights, and preservation of household relationships.
48
48
49
49
Users should consult validation metrics for targets most relevant to their analysis.
The Current Population Survey (CPS) Annual Social and Economic Supplement provides detailed household demographics, family relationships, and program participation data for a representative sample of US households. However, it suffers from well-documented income underreporting, particularly at the top of the distribution. The IRS Public Use File (PUF) contains accurate tax return information but lacks household structure, demographic detail, and state identifiers needed for comprehensive policy analysis.
6
6
7
-
This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through a two-stage enhancement process—imputation followed by reweighting—we create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.
7
+
This paper presents a methodology for creating an Enhanced CPS dataset that combines the strengths of both sources. Through an enhancement process—imputation followed by reweighting—we create a dataset suitable for analyzing both tax and transfer policies at federal and state levels.
8
8
9
9
## Related Work
10
10
@@ -14,13 +14,13 @@ Economic researchers address dataset limitations through various strategies. The
14
14
15
15
Statistical agencies and researchers employ reweighting methods to align survey data with administrative totals. The Luxembourg Income Study uses calibration to improve cross-national comparability {cite:p}`gornick2013`. The Urban-Brookings Tax Policy Center employs reweighting in their microsimulation model but relies on proprietary data that cannot be shared publicly {cite:p}`khitatrakun2016`.
16
16
17
-
Our approach differs from previous efforts in three key ways. First, we employ quantile regression forests to preserve distributional characteristics during imputation, improving upon traditional hot-deck and regression-based methods that may distort variable relationships. We conduct robustness checks comparing QRF performance to gradient boosting and neural network approaches, finding QRF provides the best balance of accuracy and interpretability. Second, we calibrate to over 7,000 targets from multiple administrative sources, far exceeding the scope of previous calibration efforts which typically use fewer than 100 targets. Third, we provide a fully open-source implementation enabling reproducibility and collaborative improvement, addressing the transparency limitations of existing proprietary models.
17
+
Our approach differs from previous efforts in three key ways. First, we employ quantile regression forests to preserve distributional characteristics during imputation, improving upon traditional hot-deck and regression-based methods that may distort variable relationships. We conduct robustness checks comparing QRF performance to gradient boosting and neural network approaches, finding QRF provides the best balance of accuracy and interpretability. Second, we calibrate to 2,813 targets from multiple administrative sources, far exceeding the scope of previous calibration efforts which typically use fewer than 100 targets. Third, we provide a fully open-source implementation enabling reproducibility and collaborative improvement, addressing the transparency limitations of existing proprietary models.
18
18
19
19
## Contributions
20
20
21
21
This paper makes three main contributions to the economic and public policy literature. Methodologically, we demonstrate how quantile regression forests can effectively impute detailed tax variables while preserving their joint distribution and relationship to demographics. This advances the statistical matching literature by showing how modern machine learning methods can overcome limitations of traditional hot-deck and parametric approaches. The preservation of distributional characteristics is particularly important for tax policy analysis where outcomes often depend on complex interactions between income sources and household characteristics.
22
22
23
-
Our empirical contribution involves creating and validating a publicly available enhanced dataset that addresses longstanding data limitations in microsimulation modeling. By combining the demographic richness of the CPS with the tax precision of the PUF, we enable analyses that were previously infeasible with public data. The dataset's calibration to over 7,000 administrative targets ensures consistency with official statistics across multiple dimensions simultaneously.
23
+
Our empirical contribution involves creating and validating a publicly available enhanced dataset that addresses longstanding data limitations in microsimulation modeling. By combining the demographic richness of the CPS with the tax precision of the PUF, we enable analyses that were previously infeasible with public data. The dataset's calibration to 2,813 administrative targets ensures consistency with official statistics across multiple dimensions simultaneously.
24
24
25
25
From a practical perspective, we provide open-source tools and comprehensive documentation that enable researchers to apply these methods, modify the approach, or build upon our work. This transparency contrasts with existing proprietary models and supports reproducible research. Government agencies could use our framework to enhance their own microsimulation capabilities, while academic researchers gain access to data suitable for analyzing distributional impacts of tax and transfer policies. The modular design allows incremental improvements as new data sources become available.
0 commit comments