diff --git a/changelog_entry.yaml b/changelog_entry.yaml index e69de29b..8aa47b95 100644 --- a/changelog_entry.yaml +++ b/changelog_entry.yaml @@ -0,0 +1,5 @@ +- bump: minor + changes: + added: + - SSN card type imputation algorithm. + - Family correlation adjustment to align parent-child SSN status. \ No newline at end of file diff --git a/docs/SSN_statuses_imputation.ipynb b/docs/SSN_statuses_imputation.ipynb new file mode 100644 index 00000000..f4f14efa --- /dev/null +++ b/docs/SSN_statuses_imputation.ipynb @@ -0,0 +1,311 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cell-1", + "metadata": {}, + "source": [ + "# Imputing SSN statuses\n", + "\n", + "This documentation outlines the implementation of SSN status imputation within the Enhanced CPS dataset, using the ASEC Undocumented Algorithm. The ASEC Undocumented Algorithm applies a process-of-elimination method to identify likely undocumented individuals in the CPS. It systematically removes people with clear evidence of legal immigration status, such as U.S. citizenship, lawful permanent residence, or work-authorized visas. Those remaining are flagged as likely undocumented and assigned an SSN card type accordingly.\n", + "\n", + "The Enhanced CPS dataset incorporates this imputation to improve accuracy in microsimulation analysis. This includes modelling eligibility and take-up for policies that depend on SSN status—such as the Child Tax Credit (CTC)—and validating distributional impacts under reform scenarios. Most part of this implementation follows the methodology described in [Ryan (2022)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4662801)." + ] + }, + { + "cell_type": "markdown", + "id": "8f2e6f2f", + "metadata": {}, + "source": [ + "## Algorithm steps\n", + "\n", + "The algorithm is implemented in the `add_ssn_card_type()` function. The algorithm assigns SSN card type codes based on immigration status: **Code 0** (`\"NONE\"`) for likely undocumented immigrants, **Code 1** (`\"CITIZEN\"`) for U.S. citizens (by birth or naturalization), **Code 2** (`\"NON_CITIZEN_VALID_EAD\"`) for non-citizens with work or study authorization, and **Code 3** (`\"OTHER_NON_CITIZEN\"`) for non-citizens with other indicators of legal status.\n", + "\n", + "The following steps explain how individuals are classified and assigned SSN card types based on citizenship, legal status indicators, and probabilistic adjustments to match population targets.\n", + "\n", + "### Step 1: citizen classification \n", + "Individuals are reassigned to Code 1 if they are identified as U.S. citizens based on their `PRCITSHP` value. Codes 1 through 4 capture all forms of U.S. citizenship, including native-born and naturalized citizens. Non-citizens (`PRCITSHP == 5`) are retained for further evaluation.\n", + "\n", + "### Step 2: ASEC undocumented algorithm conditions \n", + "The algorithm applies 14 sequential conditions, derived from the ASEC Undocumented Algorithm, to identify non-citizens with legal status indicators. These conditions rely on administrative data such as arrival year, program participation, and employment history. Individuals meeting any condition are reassigned to Code 2 (valid work/study authorization) or Code 3 (other legal status).\n", + "\n", + "- Condition 1*: Flags individuals who arrived before 1982 (`PEINUSYR` codes 1–7), a group eligible for IRCA amnesty. \n", + "- *Condition 2*: Identifies naturalized citizens (`PRCITSHP == 4`) who meet residency requirements—either 5+ years in the U.S. or 3+ years and married to a U.S. citizen. \n", + "- *Condition 3*: Reassigns individuals receiving Medicare (`MCARE == 1`), as eligibility implies legal status. \n", + "- *Condition 4*: Includes recipients of federal pensions (`PEN_SC1 == 3` or `PEN_SC2 == 3`), indicating lawful employment history. \n", + "- *Condition 5*: Captures those receiving Social Security Disability benefits (`RESNSS1 == 2` or `RESNSS2 == 2`). \n", + "- *Condition 6*: Identifies individuals with Indian Health Service coverage (`IHSFLG == 1`), indicating tribal affiliation or eligibility. \n", + "- *Condition 7*: Flags Medicaid recipients (`CAID == 1`), with the assumption that state-level restrictions are not modeled. \n", + "- *Condition 8*: Includes individuals with CHAMPVA health insurance (`CHAMPVA == 1`), which covers veterans’ families. \n", + "- *Condition 9*: Reassigns those with military health insurance (`MIL == 1`), such as TRICARE. \n", + "- *Condition 10*: Identifies government employees (`PEIO1COW` codes 1–3 or `A_MJOCC == 11`), assuming legal work authorization is required. \n", + "- *Condition 11*: Flags Social Security beneficiaries (`SS_YN == 1`). \n", + "- *Condition 12*: Uses housing subsidy participation (`SPM_CAPHOUSESUB > 0`, mapped from SPM units) as a legal status proxy. \n", + "- *Condition 13*: Identifies veterans or military personnel (`PEAFEVER == 1` or `A_MJOCC == 11`). \n", + "- *Condition 14*: Captures SSI recipients (`SSI_YN == 1`), assuming eligibility implies legal presence.\n", + "\n", + "### Step 3: target-driven EAD assignment for workers \n", + "To align with Pew Research estimates, the algorithm targets 8.3 million undocumented workers with valid employment authorization (EAD). Among non-citizens not already in Code 3, those with earnings (`WSAL_VAL > 0` or `SEMP_VAL > 0`) are eligible. The `select_random_subset_to_target()` function, seeded with 0, randomly assigns enough of these individuals to Code 2 to meet the employment-based target.\n", + "\n", + "### Step 4: target-driven EAD assignment for students \n", + "A separate target is applied for undocumented students in higher education, estimated at roughly 399,000 (21% of 1.9 million, based on Higher Ed Immigration Portal data). Eligible individuals are non-citizens currently in college (`A_HSCOL == 2`) and not already in Code 3. A second call to `select_random_subset_to_target()` with seed 1 randomly reassigns the required number to Code 2.\n", + "\n", + "### Step 5: probabilistic family correlation adjustment \n", + "As a final step, the algorithm ensures the total undocumented population reaches the 13 million target. If needed, it uses a probabilistic adjustment to move some Code 3 household members to Code 0 within mixed-status families—households already containing Code 0 individuals. The function identifies these households, calculates the additional undocumented count needed, and randomly selects Code 3 members to reassign using a random seed of 100. This adjustment accounts for under-identification in prior steps and reflects real-world family compositions.\n", + "\n", + "The following section displays the population results." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "cell-6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "### Initialization\n", + "- **Step 0 - Initial**: Code 0 people = 320,890,854\n", + "\n", + "### Citizen Classification\n", + "- **Step 1 - Citizens**: Moved to Code 1 = 295,419,820\n", + "\n", + "### ASEC Conditions Analysis\n", + "- **ASEC Conditions**: Current Code 0 people = 25,471,035\n", + "\n", + "### Individual ASEC Conditions (Detailed Breakdown)\n", + "*Each condition identifies people with indicators of legal status who qualify for Code 3*\n", + "\n", + "- **Condition 1 - Pre-1982 arrivals**: 981,447 people qualify for Code 3\n", + "- **Condition 2 - Eligible naturalized citizens**: 0 people qualify for Code 3\n", + "- **Condition 3 - Medicare recipients**: 1,918,043 people qualify for Code 3\n", + "- **Condition 4 - Federal retirement benefits**: 6,783 people qualify for Code 3\n", + "- **Condition 5 - Social Security disability**: 197,206 people qualify for Code 3\n", + "- **Condition 6 - Indian Health Service coverage**: 1,776 people qualify for Code 3\n", + "- **Condition 7 - Medicaid recipients**: 5,406,195 people qualify for Code 3\n", + "- **Condition 8 - CHAMPVA recipients**: 13,149 people qualify for Code 3\n", + "- **Condition 9 - Military health insurance**: 155,431 people qualify for Code 3\n", + "- **Condition 10 - Government employees**: 696,278 people qualify for Code 3\n", + "- **Condition 11 - Social Security recipients**: 1,408,194 people qualify for Code 3\n", + "- **Condition 12 - Housing assistance**: 886,623 people qualify for Code 3\n", + "- **Condition 13 - Veterans/Military personnel**: 75,330 people qualify for Code 3\n", + "- **Condition 14 - SSI recipients**: 231,009 people qualify for Code 3\n", + "\n", + "- **After conditions**: Code 0 people = 16,982,869\n", + "\n", + "### Target Information\n", + "- **Before adjustment**: Undocumented workers = 11,819,403\n", + "- **Target**: Undocumented workers target = 8,300,000\n", + "- **Before adjustment**: Undocumented students = 911,958\n", + "- **Target**: Undocumented students target = 399,000\n", + "\n", + "### EAD Assignment\n", + "- **Step 3 - EAD workers**: Moved from Code 0 to Code 2 = 3,524,259\n", + "- **Step 4 - EAD students**: Moved from Code 0 to Code 2 = 529,383\n", + "- **After EAD assignment**: Code 0 people = 12,988,180\n", + "\n", + "### Family Correlation (Final Step)\n", + "- **Step 5 - Family correlation**: Changed from Code 3 to Code 0 = 12,161\n", + "- **After family correlation**: Code 0 people = 13,000,341\n", + "\n", + "### Final Results\n", + "- **Final**: Code 0 (NONE) = 13,000,341\n", + "- **Final**: Code 1 (CITIZEN) = 295,419,820\n", + "- **Final**: Code 2 (NON_CITIZEN_VALID_EAD) = 3,994,689\n", + "- **Final**: Code 3 (OTHER_NON_CITIZEN) = 8,476,005\n", + "- **Final**: Total undocumented (Code 0) = 13,000,341\n", + "- **Final**: Undocumented target = 13,000,000\n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "import os\n", + "\n", + "csv_path = \"asec_population_log.csv\"\n", + "df = pd.read_csv(csv_path)\n", + "\n", + "if not df.empty:\n", + " def get_population(step, description):\n", + " \"\"\"Helper function to get population for a specific step and description\"\"\"\n", + " result = df[(df['step'] == step) & (df['description'] == description)]\n", + " if not result.empty:\n", + " return f\"{result.iloc[0]['population']:,.0f}\"\n", + " return \"Not found\"\n", + " \n", + " print(\"### Initialization\")\n", + " print(f\"- **Step 0 - Initial**: Code 0 people = {get_population('Step 0 - Initial', 'Code 0 people')}\")\n", + " print()\n", + " \n", + " print(\"### Citizen Classification\")\n", + " print(f\"- **Step 1 - Citizens**: Moved to Code 1 = {get_population('Step 1 - Citizens', 'Moved to Code 1')}\")\n", + " print()\n", + " \n", + " print(\"### ASEC Conditions Analysis\")\n", + " print(f\"- **ASEC Conditions**: Current Code 0 people = {get_population('ASEC Conditions', 'Current Code 0 people')}\")\n", + " print()\n", + " \n", + " print(\"### Individual ASEC Conditions (Detailed Breakdown)\")\n", + " print(\"*Each condition identifies people with indicators of legal status who qualify for Code 3*\\n\")\n", + " \n", + " # Define condition mappings for lookup\n", + " condition_names = {\n", + " 1: \"Pre-1982 arrivals\",\n", + " 2: \"Eligible naturalized citizens\", \n", + " 3: \"Medicare recipients\",\n", + " 4: \"Federal retirement benefits\",\n", + " 5: \"Social Security disability\",\n", + " 6: \"Indian Health Service coverage\",\n", + " 7: \"Medicaid recipients\",\n", + " 8: \"CHAMPVA recipients\",\n", + " 9: \"Military health insurance\",\n", + " 10: \"Government employees\",\n", + " 11: \"Social Security recipients\",\n", + " 12: \"Housing assistance\",\n", + " 13: \"Veterans/Military personnel\",\n", + " 14: \"SSI recipients\"\n", + " }\n", + " \n", + " for i in range(1, 15):\n", + " condition_name = condition_names[i]\n", + " condition_pop = get_population(f'Condition {i}', f'{condition_name} qualify for Code 3')\n", + " print(f\"- **Condition {i:2d} - {condition_name}**: {condition_pop} people qualify for Code 3\")\n", + " \n", + " print()\n", + " print(f\"- **After conditions**: Code 0 people = {get_population('After conditions', 'Code 0 people')}\")\n", + " print()\n", + " \n", + " print(\"### Target Information\")\n", + " print(f\"- **Before adjustment**: Undocumented workers = {get_population('Before adjustment', 'Undocumented workers')}\")\n", + " print(f\"- **Target**: Undocumented workers target = {get_population('Target', 'Undocumented workers target')}\")\n", + " print(f\"- **Before adjustment**: Undocumented students = {get_population('Before adjustment', 'Undocumented students')}\")\n", + " print(f\"- **Target**: Undocumented students target = {get_population('Target', 'Undocumented students target')}\")\n", + " print()\n", + " \n", + " print(\"### EAD Assignment\")\n", + " print(f\"- **Step 3 - EAD workers**: Moved from Code 0 to Code 2 = {get_population('Step 3 - EAD workers', 'Moved from Code 0 to Code 2')}\")\n", + " print(f\"- **Step 4 - EAD students**: Moved from Code 0 to Code 2 = {get_population('Step 4 - EAD students', 'Moved from Code 0 to Code 2')}\")\n", + " print(f\"- **After EAD assignment**: Code 0 people = {get_population('After EAD assignment', 'Code 0 people')}\")\n", + " print()\n", + " \n", + " print(\"### Family Correlation (Final Step)\")\n", + " print(f\"- **Step 5 - Family correlation**: Changed from Code 3 to Code 0 = {get_population('Step 5 - Family correlation', 'Changed from Code 3 to Code 0')}\")\n", + " print(f\"- **After family correlation**: Code 0 people = {get_population('After family correlation', 'Code 0 people')}\")\n", + " print()\n", + " \n", + " print(\"### Final Results\")\n", + " print(f\"- **Final**: Code 0 (NONE) = {get_population('Final', 'Code 0 (NONE)')}\")\n", + " print(f\"- **Final**: Code 1 (CITIZEN) = {get_population('Final', 'Code 1 (CITIZEN)')}\")\n", + " print(f\"- **Final**: Code 2 (NON_CITIZEN_VALID_EAD) = {get_population('Final', 'Code 2 (NON_CITIZEN_VALID_EAD)')}\")\n", + " print(f\"- **Final**: Code 3 (OTHER_NON_CITIZEN) = {get_population('Final', 'Code 3 (OTHER_NON_CITIZEN)')}\")\n", + " print(f\"- **Final**: Total undocumented (Code 0) = {get_population('Final', 'Total undocumented (Code 0)')}\")\n", + " print(f\"- **Final**: Undocumented target = {get_population('Final', 'Undocumented target')}\")" + ] + }, + { + "cell_type": "markdown", + "id": "cell-9", + "metadata": {}, + "source": [ + "## SSN card type calibration\n", + "\n", + "The ASEC Undocumented Algorithm is integrated into PolicyEngine's calibration system to ensure that the simulated undocumented population aligns with authoritative external estimates. The calibration specifically targets individuals assigned an SSN card type of \"NONE\" (likely undocumented) and adjusts their share in the population to match year-specific benchmarks. These targets are drawn from high-quality sources, including the DHS Office of Homeland Security Statistics ([11.0 million](https://ohss.dhs.gov/sites/default/files/2024-06/2024_0418_ohss_estimates-of-the-unauthorized-immigrant-population-residing-in-the-united-states-january-2018%25E2%2580%2593january-2022.pdf) for 2022), the Center for Migration Studies ([12.2 million](https://cmsny.org/publications/the-undocumented-population-in-the-united-states-increased-to-12-million-in-2023/) for 2023), and a Reuters synthesis of expert projections ([13.0 million](https://www.reuters.com/data/who-are-immigrants-who-could-be-targeted-trumps-mass-deportation-plans-2024-12-18/) for 2024 and 2025). This integration into the loss function ensures that PolicyEngine’s microsimulations remain grounded in current demographic realities." + ] + }, + { + "cell_type": "markdown", + "id": "cell-10", + "metadata": {}, + "source": [ + "## Child Tax Credit reform impact by immigration status\n", + "\n", + "In the following analysis, we use the SSN card type imputation to evaluate how immigration status shapes eligibility for the Child Tax Credit (CTC). Specifically, we assess the effect of CTC reform on the number of child recipients, comparing baseline and reform scenarios to validate the dataset’s ability to capture policy-driven changes across mixed-status and undocumented households." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "fc5aed00", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Baseline child CTC recipients: 61,893,530\n", + "Reform child CTC recipients: 58,258,260\n", + "Difference: 3,635,270\n" + ] + } + ], + "source": [ + "# Child Tax Credit Reform Recipient Difference Analysis\n", + "\n", + "from policyengine_us_data.datasets.cps import EnhancedCPS_2024\n", + "from policyengine_us import Microsimulation\n", + "from policyengine_core.reforms import Reform\n", + "\n", + "# Define the CTC reform (makes the reconciliation CTC permanently active)\n", + "ctc_reform = Reform.from_dict(\n", + " {\n", + " \"gov.contrib.reconciliation.ctc.in_effect\": {\n", + " \"2025-01-01.2100-12-31\": True\n", + " }\n", + " },\n", + " country_id=\"us\",\n", + ")\n", + "\n", + "# Create microsimulations for baseline and reform scenarios\n", + "baseline_sim = Microsimulation(dataset=EnhancedCPS_2024)\n", + "reform_sim = Microsimulation(dataset=EnhancedCPS_2024, reform=ctc_reform)\n", + "\n", + "# Compute CTC recipients in baseline\n", + "baseline_is_child = baseline_sim.calculate(\"is_child\")\n", + "baseline_ctc_value = baseline_sim.calculate(\"ctc_value\", map_to=\"person\")\n", + "baseline_ctc_max = baseline_sim.calculate(\"ctc_individual_maximum\")\n", + "baseline_recipients = (\n", + " baseline_is_child * (baseline_ctc_value > 0) * (baseline_ctc_max > 0)\n", + ").sum()\n", + "\n", + "# Compute CTC recipients in reform\n", + "reform_is_child = reform_sim.calculate(\"is_child\")\n", + "reform_ctc_value = reform_sim.calculate(\"ctc_value\", map_to=\"person\")\n", + "reform_ctc_max = reform_sim.calculate(\"ctc_individual_maximum\")\n", + "reform_recipients = (\n", + " reform_is_child * (reform_ctc_value > 0) * (reform_ctc_max > 0)\n", + ").sum()\n", + "\n", + "# Difference in number of child CTC recipients\n", + "recipient_difference = baseline_recipients - reform_recipients\n", + "\n", + "# Report results\n", + "print(f\"Baseline child CTC recipients: {baseline_recipients:,.0f}\")\n", + "print(f\"Reform child CTC recipients: {reform_recipients:,.0f}\")\n", + "print(f\"Difference: {recipient_difference:,.0f}\")\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/_toc.yml b/docs/_toc.yml index 168ae5f7..2852bcba 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -1,6 +1,7 @@ format: jb-book root: intro chapters: +- file: SSN_statuses_imputation.ipynb - file: validation.ipynb - file: results.ipynb - file: imputation.ipynb diff --git a/policyengine_us_data/datasets/cps/census_cps.py b/policyengine_us_data/datasets/cps/census_cps.py index cc689571..ee35947d 100644 --- a/policyengine_us_data/datasets/cps/census_cps.py +++ b/policyengine_us_data/datasets/cps/census_cps.py @@ -302,5 +302,24 @@ class CensusCPS_2018(CensusCPS): "PRCITSHP", "NOW_GRP", "POCCU2", + "PEINUSYR", + "MCARE", + "PEN_SC1", + "PEN_SC2", + "RESNSS1", + "RESNSS2", + "IHSFLG", + "CAID", + "CHAMPVA", + "PEIO1COW", + "A_MJOCC", + "SS_YN", + "PEAFEVER", + "SSI_YN", + "RESNSSI1", + "RESNSSI2", + "PENATVTY", + "PEIOOCC", + "MIL", "A_HRS1", ] diff --git a/policyengine_us_data/datasets/cps/cps.py b/policyengine_us_data/datasets/cps/cps.py index 5336db72..72cbb7e6 100644 --- a/policyengine_us_data/datasets/cps/cps.py +++ b/policyengine_us_data/datasets/cps/cps.py @@ -61,7 +61,14 @@ def generate(self): logging.info("Adding previous year income variables") add_previous_year_income(self, cps) logging.info("Adding SSN card type") - add_ssn_card_type(cps, person) + add_ssn_card_type( + cps, + person, + spm_unit, + undocumented_target=13e6, + undocumented_workers_target=8.3e6, + undocumented_students_target=0.21 * 1.9e6, + ) logging.info("Adding family variables") add_spm_variables(cps, spm_unit) logging.info("Adding household variables") @@ -682,48 +689,769 @@ def add_previous_year_income(self, cps: h5py.File) -> None: ].values -def add_ssn_card_type(cps: h5py.File, person: pd.DataFrame) -> None: +def add_ssn_card_type( + cps: h5py.File, + person: pd.DataFrame, + spm_unit: pd.DataFrame, + undocumented_target: float = 13e6, + undocumented_workers_target: float = 8.3e6, + undocumented_students_target: float = 0.21 * 1.9e6, +) -> None: """ - Deterministically assign SSA card type based on PRCITSHP and student/employment status. - Code: - - 1: Citizen (PRCITSHP 1–4) - - 2: Foreign-born, noncitizen but likely on valid EAD (student or worker) - - 0: Other noncitizens (to refine or default) + Assign SSN card type using PRCITSHP, employment status, and ASEC-UA conditions. + Codes: + - 0: "NONE" - Likely undocumented immigrants + - 1: "CITIZEN" - US citizens (born or naturalized) + - 2: "NON_CITIZEN_VALID_EAD" - Non-citizens with work/study authorization + - 3: "OTHER_NON_CITIZEN" - Non-citizens with indicators of legal status """ + + # Initialize CSV logging for population tracking + population_log = [] + + def select_random_subset_to_target( + eligible_ids, current_weighted, target_weighted, random_seed=None + ): + """ + Randomly select subset to move current weighted population to target. + + Args: + eligible_ids: Array of person indices eligible for selection + current_weighted: Current weighted total + target_weighted: Target weighted total + random_seed: Random seed for reproducibility + + Returns: + Array of selected person indices + """ + if len(eligible_ids) == 0: + return np.array([], dtype=int) + + # Calculate how much weighted population needs to be moved + if current_weighted > target_weighted: + excess_weighted = current_weighted - target_weighted + # Calculate fraction to move randomly + total_reassignable_weight = np.sum(person_weights[eligible_ids]) + share_to_move = excess_weighted / total_reassignable_weight + share_to_move = min(share_to_move, 1.0) # Cap at 100% + else: + # Calculate how much to move to reach target (for EAD case) + needed_weighted = ( + current_weighted - target_weighted + ) # Will be negative + total_weight = np.sum(person_weights[eligible_ids]) + share_to_move = abs(needed_weighted) / total_weight + share_to_move = min(share_to_move, 1.0) # Cap at 100% + + if share_to_move > 0: + if random_seed is not None: + if current_weighted > target_weighted: + # Use new rng for refinement + rng = np.random.default_rng(seed=random_seed) + random_draw = rng.random(len(eligible_ids)) + assign_mask = random_draw < share_to_move + selected = eligible_ids[assign_mask] + else: + # Use old np.random for EAD to maintain compatibility + np.random.seed(random_seed) + n_to_move = int(len(eligible_ids) * share_to_move) + selected = np.random.choice( + eligible_ids, size=n_to_move, replace=False + ) + else: + selected = np.array([], dtype=int) + else: + selected = np.array([], dtype=int) + + return selected + + # Get household weights for population calculations + household_ids = cps["household_id"] + household_weights = cps["household_weight"] + person_household_ids = cps["person_household_id"] + household_to_weight = dict(zip(household_ids, household_weights)) + person_weights = np.array( + [household_to_weight.get(hh_id, 0) for hh_id in person_household_ids] + ) + + # Initialize all persons as code 0 ssn_card_type = np.full(len(person), 0) + initial_population = np.sum(person_weights[ssn_card_type == 0]) + print(f"Step 0 - Initial: Code 0 people: {initial_population:,.0f}") + population_log.append( + { + "step": "Step 0 - Initial", + "description": "Code 0 people", + "population": initial_population, + } + ) + + # ============================================================================ + # PRIMARY CLASSIFICATIONS + # ============================================================================ + + # Code 1: All US Citizens (naturalized and born) + citizens_mask = np.isin(person.PRCITSHP, [1, 2, 3, 4]) + ssn_card_type[citizens_mask] = 1 + noncitizens = person.PRCITSHP == 5 + citizens_moved = np.sum(person_weights[citizens_mask]) + print(f"Step 1 - Citizens: Moved {citizens_moved:,.0f} people to Code 1") + population_log.append( + { + "step": "Step 1 - Citizens", + "description": "Moved to Code 1", + "population": citizens_moved, + } + ) + + # ============================================================================ + # ASEC UNDOCUMENTED ALGORITHM CONDITIONS + # Remove individuals with indicators of legal status from code 0 pool + # ============================================================================ + + # paper source: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4662801 + # Helper mask: Only apply conditions to non-citizens without clear authorization + potentially_undocumented = ~np.isin(ssn_card_type, [1, 2]) + + current_code_0 = np.sum(person_weights[ssn_card_type == 0]) + print(f"\nASEC Conditions - Current Code 0 people: {current_code_0:,.0f}") + population_log.append( + { + "step": "ASEC Conditions", + "description": "Current Code 0 people", + "population": current_code_0, + } + ) + + # CONDITION 1: Pre-1982 Arrivals (IRCA Amnesty Eligible) + # PEINUSYR values indicating arrival before 1982: + # 01 = Before 1950 + # 02 = 1950–1959 + # 03 = 1960–1964 + # 04 = 1965–1969 + # 05 = 1970–1974 + # 06 = 1975–1979 + # 07 = 1980–1981 + arrived_before_1982 = np.isin(person.PEINUSYR, [1, 2, 3, 4, 5, 6, 7]) + condition_1_mask = potentially_undocumented & arrived_before_1982 + condition_1_count = np.sum(person_weights[condition_1_mask]) + print( + f"Condition 1 - Pre-1982 arrivals: {condition_1_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 1", + "description": "Pre-1982 arrivals qualify for Code 3", + "population": condition_1_count, + } + ) + + # CONDITION 2: Eligible Naturalized Citizens + is_naturalized = person.PRCITSHP == 4 + is_adult = person.A_AGE >= 18 + # 5+ years in US (codes 8-26: 1982-2019) + has_five_plus_years = np.isin(person.PEINUSYR, list(range(8, 27))) + # 3+ years in US + married (codes 8-27: 1982-2021) + has_three_plus_years = np.isin(person.PEINUSYR, list(range(8, 28))) + is_married = person.A_MARITL.isin([1, 2]) & (person.A_SPOUSE > 0) + eligible_naturalized = ( + is_naturalized + & is_adult + & (has_five_plus_years | (has_three_plus_years & is_married)) + ) + condition_2_mask = potentially_undocumented & eligible_naturalized + condition_2_count = np.sum(person_weights[condition_2_mask]) + print( + f"Condition 2 - Eligible naturalized citizens: {condition_2_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 2", + "description": "Eligible naturalized citizens qualify for Code 3", + "population": condition_2_count, + } + ) + + # CONDITION 3: Medicare Recipients + has_medicare = person.MCARE == 1 + condition_3_mask = potentially_undocumented & has_medicare + condition_3_count = np.sum(person_weights[condition_3_mask]) + print( + f"Condition 3 - Medicare recipients: {condition_3_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 3", + "description": "Medicare recipients qualify for Code 3", + "population": condition_3_count, + } + ) + + # CONDITION 4: Federal Retirement Benefits + has_federal_pension = np.isin(person.PEN_SC1, [3]) | np.isin( + person.PEN_SC2, [3] + ) # Federal government pension + condition_4_mask = potentially_undocumented & has_federal_pension + condition_4_count = np.sum(person_weights[condition_4_mask]) + print( + f"Condition 4 - Federal retirement benefits: {condition_4_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 4", + "description": "Federal retirement benefits qualify for Code 3", + "population": condition_4_count, + } + ) + + # CONDITION 5: Social Security Disability + has_ss_disability = np.isin(person.RESNSS1, [2]) | np.isin( + person.RESNSS2, [2] + ) # Disabled (adult or child) + condition_5_mask = potentially_undocumented & has_ss_disability + condition_5_count = np.sum(person_weights[condition_5_mask]) + print( + f"Condition 5 - Social Security disability: {condition_5_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 5", + "description": "Social Security disability qualify for Code 3", + "population": condition_5_count, + } + ) + + # CONDITION 6: Indian Health Service Coverage + has_ihs = person.IHSFLG == 1 + condition_6_mask = potentially_undocumented & has_ihs + condition_6_count = np.sum(person_weights[condition_6_mask]) + print( + f"Condition 6 - Indian Health Service coverage: {condition_6_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 6", + "description": "Indian Health Service coverage qualify for Code 3", + "population": condition_6_count, + } + ) + + # CONDITION 7: Medicaid Recipients (simplified - no state adjustments) + has_medicaid = person.CAID == 1 + condition_7_mask = potentially_undocumented & has_medicaid + condition_7_count = np.sum(person_weights[condition_7_mask]) + print( + f"Condition 7 - Medicaid recipients: {condition_7_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 7", + "description": "Medicaid recipients qualify for Code 3", + "population": condition_7_count, + } + ) + + # CONDITION 8: CHAMPVA Recipients + has_champva = person.CHAMPVA == 1 + condition_8_mask = potentially_undocumented & has_champva + condition_8_count = np.sum(person_weights[condition_8_mask]) + print( + f"Condition 8 - CHAMPVA recipients: {condition_8_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 8", + "description": "CHAMPVA recipients qualify for Code 3", + "population": condition_8_count, + } + ) + + # CONDITION 9: Military Health Insurance + has_military_insurance = person.MIL == 1 + condition_9_mask = potentially_undocumented & has_military_insurance + condition_9_count = np.sum(person_weights[condition_9_mask]) + print( + f"Condition 9 - Military health insurance: {condition_9_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 9", + "description": "Military health insurance qualify for Code 3", + "population": condition_9_count, + } + ) + + # CONDITION 10: Government Employees + is_government_worker = np.isin( + person.PEIO1COW, [1, 2, 3] + ) # Fed/state/local gov + is_military_occupation = person.A_MJOCC == 11 # Military occupation + is_government_employee = is_government_worker | is_military_occupation + condition_10_mask = potentially_undocumented & is_government_employee + condition_10_count = np.sum(person_weights[condition_10_mask]) + print( + f"Condition 10 - Government employees: {condition_10_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 10", + "description": "Government employees qualify for Code 3", + "population": condition_10_count, + } + ) + + # CONDITION 11: Social Security Recipients + has_social_security = person.SS_YN == 1 + condition_11_mask = potentially_undocumented & has_social_security + condition_11_count = np.sum(person_weights[condition_11_mask]) + print( + f"Condition 11 - Social Security recipients: {condition_11_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 11", + "description": "Social Security recipients qualify for Code 3", + "population": condition_11_count, + } + ) + + # CONDITION 12: Housing Assistance + spm_housing_map = dict(zip(spm_unit.SPM_ID, spm_unit.SPM_CAPHOUSESUB)) + has_housing_assistance = person.SPM_ID.map(spm_housing_map).fillna(0) > 0 + condition_12_mask = potentially_undocumented & has_housing_assistance + condition_12_count = np.sum(person_weights[condition_12_mask]) + print( + f"Condition 12 - Housing assistance: {condition_12_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 12", + "description": "Housing assistance qualify for Code 3", + "population": condition_12_count, + } + ) + + # CONDITION 13: Veterans/Military Personnel + is_veteran = person.PEAFEVER == 1 + is_current_military = person.A_MJOCC == 11 + is_military_connected = is_veteran | is_current_military + condition_13_mask = potentially_undocumented & is_military_connected + condition_13_count = np.sum(person_weights[condition_13_mask]) + print( + f"Condition 13 - Veterans/Military personnel: {condition_13_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 13", + "description": "Veterans/Military personnel qualify for Code 3", + "population": condition_13_count, + } + ) - # Code 1: Citizens - ssn_card_type[np.isin(person.PRCITSHP, [1, 2, 3, 4])] = 1 + # CONDITION 14: SSI Recipients (simplified - assumes all SSI is for recipient) + has_ssi = person.SSI_YN == 1 + condition_14_mask = potentially_undocumented & has_ssi + condition_14_count = np.sum(person_weights[condition_14_mask]) + print( + f"Condition 14 - SSI recipients: {condition_14_count:,.0f} people qualify for Code 3" + ) + population_log.append( + { + "step": "Condition 14", + "description": "SSI recipients qualify for Code 3", + "population": condition_14_count, + } + ) - # Code 2: Noncitizens (PRCITSHP == 5) who are working or studying - noncitizen_mask = person.PRCITSHP == 5 - is_worker = (person.WSAL_VAL > 0) | (person.SEMP_VAL > 0) # worker - is_student = person.A_HSCOL == 2 # student - ead_like_mask = noncitizen_mask & (is_worker | is_student) - ssn_card_type[ead_like_mask] = 2 + # ============================================================================ + # CONSOLIDATED ASSIGNMENT OF ASSUMED DOCUMENTED STATUS + # ============================================================================ + + # Combine all conditions that indicate legal status + assumed_documented = ( + arrived_before_1982 + | eligible_naturalized + | has_medicare + | has_federal_pension + | has_ss_disability + | has_ihs + | has_medicaid + | has_champva + | has_military_insurance + | is_government_employee + | has_social_security + | has_housing_assistance + | is_military_connected + | has_ssi + ) - # Step 3: Refine remaining 0s into 0 or 3 - share_code_3 = 0.3 # IRS/SSA target share of SSA-benefit-only cards - rng = np.random.default_rng(seed=42) - to_refine = (ssn_card_type == 0) & noncitizen_mask - refine_indices = np.where(to_refine)[0] + # Apply single assignment for all conditions + ssn_card_type[potentially_undocumented & assumed_documented] = 3 + # print(f"Step 2 - Documented indicators: Moved {np.sum(person_weights[potentially_undocumented & assumed_documented]):,.0f} people from Code 0 to Code 3") - if len(refine_indices) > 0: - draw = rng.random(len(refine_indices)) - assign_code_3 = draw < share_code_3 - ssn_card_type[refine_indices[assign_code_3]] = 3 + # Calculate undocumented workers and students after ASEC conditions + undocumented_workers_mask = ( + (ssn_card_type == 0) + & noncitizens + & ((person.WSAL_VAL > 0) | (person.SEMP_VAL > 0)) + ) + undocumented_students_mask = ( + (ssn_card_type == 0) & noncitizens & (person.A_HSCOL == 2) + ) + undocumented_workers_count = np.sum( + person_weights[undocumented_workers_mask] + ) + undocumented_students_count = np.sum( + person_weights[undocumented_students_mask] + ) + + after_conditions_code_0 = np.sum(person_weights[ssn_card_type == 0]) + print(f"After conditions - Code 0 people: {after_conditions_code_0:,.0f}") + print( + f" - Undocumented workers before adjustment: {undocumented_workers_count:,.0f} (target: {undocumented_workers_target:,.0f})" + ) + print( + f" - Undocumented students before adjustment: {undocumented_students_count:,.0f} (target: {undocumented_students_target:,.0f})" + ) + + population_log.append( + { + "step": "After conditions", + "description": "Code 0 people", + "population": after_conditions_code_0, + } + ) + population_log.append( + { + "step": "Before adjustment", + "description": "Undocumented workers", + "population": undocumented_workers_count, + } + ) + population_log.append( + { + "step": "Target", + "description": "Undocumented workers target", + "population": undocumented_workers_target, + } + ) + population_log.append( + { + "step": "Before adjustment", + "description": "Undocumented students", + "population": undocumented_students_count, + } + ) + population_log.append( + { + "step": "Target", + "description": "Undocumented students target", + "population": undocumented_students_target, + } + ) + + # ============================================================================ + # CODE 2 NON-CITIZEN WITH WORK/STUDY AUTHORIZATION + # ============================================================================ + + # Code 2: Non-citizens with work/study authorization (likely valid EAD) + # Only consider people still in Code 0 (undocumented) after ASEC conditions + worker_mask = ( + (ssn_card_type != 3) + & noncitizens + & ((person.WSAL_VAL > 0) | (person.SEMP_VAL > 0)) + ) + student_mask = (ssn_card_type != 3) & noncitizens & (person.A_HSCOL == 2) + + # Calculate target-driven worker assignment + # Target: 8.3 million undocumented workers (from Pew Research) + # https://www.pewresearch.org/short-reads/2024/07/22/what-we-know-about-unauthorized-immigrants-living-in-the-us/ + + # Get worker IDs + worker_ids = person[worker_mask].index + + # Use function to select workers for EAD + total_weighted_workers = np.sum(person_weights[worker_ids]) + selected_workers = select_random_subset_to_target( + worker_ids, + total_weighted_workers, + undocumented_workers_target, + random_seed=0, + ) + + # Calculate target-driven student assignment + # Target: 21% of 1.9 million = ~399k undocumented students (from Higher Ed Immigration Portal) + # https://www.higheredimmigrationportal.org/research/immigrant-origin-students-in-u-s-higher-education-updated-august-2024/ + + student_ids = person[student_mask].index + + # Use function to select students for EAD + total_weighted_students = np.sum(person_weights[student_ids]) + selected_students = select_random_subset_to_target( + student_ids, + total_weighted_students, + undocumented_students_target, + random_seed=1, + ) + + # Assign code 2 + ssn_card_type[selected_workers] = 2 + ssn_card_type[selected_students] = 2 + ead_workers_moved = np.sum(person_weights[selected_workers]) + ead_students_moved = np.sum(person_weights[selected_students]) + after_ead_code_0 = np.sum(person_weights[ssn_card_type == 0]) + + print( + f"Step 3 - EAD workers: Moved {ead_workers_moved:,.0f} people from Code 0 to Code 2" + ) + print( + f"Step 4 - EAD students: Moved {ead_students_moved:,.0f} people from Code 0 to Code 2" + ) + print(f"After EAD assignment - Code 0 people: {after_ead_code_0:,.0f}") + + population_log.append( + { + "step": "Step 3 - EAD workers", + "description": "Moved from Code 0 to Code 2", + "population": ead_workers_moved, + } + ) + population_log.append( + { + "step": "Step 4 - EAD students", + "description": "Moved from Code 0 to Code 2", + "population": ead_students_moved, + } + ) + population_log.append( + { + "step": "After EAD assignment", + "description": "Code 0 people", + "population": after_ead_code_0, + } + ) + + final_counts = pd.Series(ssn_card_type).value_counts().sort_index() + + # ============================================================================ + # PROBABILISTIC FAMILY CORRELATION ADJUSTMENT + # ============================================================================ + + # Probabilistic family correlation: Only move code 3 household members to code 0 + # if needed to hit the undocumented target. This preserves mixed-status families + # (citizens living with undocumented) while still achieving target-driven correlation. + + # Use existing household data + person_household_ids = cps["person_household_id"] + + # Track before state + code_0_before = np.sum(person_weights[ssn_card_type == 0]) + + # Calculate how many more undocumented people we need to hit target + current_undocumented = code_0_before + undocumented_needed = max(0, undocumented_target - current_undocumented) + + print( + f"Current undocumented: {current_undocumented:,.0f}, Target: {undocumented_target:,.0f}" + ) + print(f"Additional undocumented needed: {undocumented_needed:,.0f}") + + families_adjusted = 0 + + if undocumented_needed > 0: + # Identify households with mixed status (code 0 + code 3 members) + mixed_household_candidates = [] + + unique_households = np.unique(person_household_ids) + + for household_id in unique_households: + household_mask = person_household_ids == household_id + household_ssn_codes = ssn_card_type[household_mask] + + # Check if household has both undocumented (code 0) AND code 3 members + has_undocumented = (household_ssn_codes == 0).any() + has_code3 = (household_ssn_codes == 3).any() + + if has_undocumented and has_code3: + # Find code 3 indices in this household + household_indices = np.where(household_mask)[0] + code_3_indices = household_indices[household_ssn_codes == 3] + mixed_household_candidates.extend(code_3_indices) + + # Randomly select from eligible code 3 members in mixed households to hit target + if len(mixed_household_candidates) > 0: + mixed_household_candidates = np.array(mixed_household_candidates) + candidate_weights = person_weights[mixed_household_candidates] + + # Use probabilistic selection to hit target + selected_indices = select_random_subset_to_target( + mixed_household_candidates, + current_undocumented, + undocumented_target, + random_seed=100, # Different seed for family correlation + ) + + if len(selected_indices) > 0: + ssn_card_type[selected_indices] = 0 + families_adjusted = len(selected_indices) + print( + f"Selected {len(selected_indices)} people from {len(mixed_household_candidates)} candidates in mixed households" + ) + else: + print( + "No additional family members selected (target already reached)" + ) + else: + print("No mixed-status households found for family correlation") + else: + print( + "No additional undocumented people needed - target already reached" + ) + + # Calculate the weighted impact + code_0_after = np.sum(person_weights[ssn_card_type == 0]) + weighted_change = code_0_after - code_0_before + + print( + f"Step 5 - Probabilistic family correlation: Changed {weighted_change:,.0f} people from Code 3 to Code 0" + ) + print(f"After family correlation - Code 0 people: {code_0_after:,.0f}") + + population_log.append( + { + "step": "Step 5 - Family correlation", + "description": "Changed from Code 3 to Code 0", + "population": weighted_change, + } + ) + population_log.append( + { + "step": "After family correlation", + "description": "Code 0 people", + "population": code_0_after, + } + ) + + # ============================================================================ + # CONVERT TO STRING LABELS AND STORE + # ============================================================================ code_to_str = { - 0: "NONE", - 1: "CITIZEN", - 2: "NON_CITIZEN_VALID_EAD", - 3: "OTHER_NON_CITIZEN", + 0: "NONE", # Likely undocumented immigrants + 1: "CITIZEN", # US citizens + 2: "NON_CITIZEN_VALID_EAD", # Non-citizens with work/study authorization + 3: "OTHER_NON_CITIZEN", # Non-citizens with indicators of legal status } ssn_card_type_str = ( pd.Series(ssn_card_type).map(code_to_str).astype("S").values ) cps["ssn_card_type"] = ssn_card_type_str + # Final population summary + print(f"\nFinal populations:") + for code, label in code_to_str.items(): + pop = np.sum(person_weights[ssn_card_type == code]) + print(f" Code {code} ({label}): {pop:,.0f}") + population_log.append( + { + "step": "Final", + "description": f"Code {code} ({label})", + "population": pop, + } + ) + + final_undocumented = np.sum(person_weights[ssn_card_type == 0]) + print( + f"Total undocumented (Code 0): {final_undocumented:,.0f} (target: {undocumented_target:,.0f})" + ) + population_log.append( + { + "step": "Final", + "description": "Total undocumented (Code 0)", + "population": final_undocumented, + } + ) + population_log.append( + { + "step": "Final", + "description": "Undocumented target", + "population": undocumented_target, + } + ) + + # Save population log to CSV + import os + + log_df = pd.DataFrame(population_log) + csv_path = os.path.join( + os.path.dirname(__file__), + "..", + "..", + "..", + "docs", + "asec_population_log.csv", + ) + log_df.to_csv(csv_path, index=False) + print(f"Population log saved to: {csv_path}") + + # Update documentation with actual numbers + _update_documentation_with_numbers(log_df, os.path.dirname(csv_path)) + + +def _update_documentation_with_numbers(log_df, docs_dir): + """Update the documentation file with actual population numbers from CSV""" + import os + + doc_path = os.path.join(docs_dir, "SSN_statuses_imputation.ipynb") + + if not os.path.exists(doc_path): + print(f"Documentation file not found at: {doc_path}") + return + + # Create mapping of step/description to population for easy lookup + data_map = {} + for _, row in log_df.iterrows(): + key = (row["step"], row["description"]) + data_map[key] = row["population"] + + # Read the documentation file + with open(doc_path, "r", encoding="utf-8") as f: + content = f.read() + + # Define replacements based on our logging structure + replacements = { + "- **Step 0 - Initial**: Code 0 people = *[Run cps.py to populate]*": lambda: f"- **Step 0 - Initial**: Code 0 people = {data_map.get(('Step 0 - Initial', 'Code 0 people'), 0):,.0f}", + "- **Step 1 - Citizens**: Moved to Code 1 = *[Run cps.py to populate]*": lambda: f"- **Step 1 - Citizens**: Moved to Code 1 = {data_map.get(('Step 1 - Citizens', 'Moved to Code 1'), 0):,.0f}", + "- **ASEC Conditions**: Current Code 0 people = *[Run cps.py to populate]*": lambda: f"- **ASEC Conditions**: Current Code 0 people = {data_map.get(('ASEC Conditions', 'Current Code 0 people'), 0):,.0f}", + "- **After conditions**: Code 0 people = *[Run cps.py to populate]*": lambda: f"- **After conditions**: Code 0 people = {data_map.get(('After conditions', 'Code 0 people'), 0):,.0f}", + "- **Before adjustment**: Undocumented workers = *[Run cps.py to populate]*": lambda: f"- **Before adjustment**: Undocumented workers = {data_map.get(('Before adjustment', 'Undocumented workers'), 0):,.0f}", + "- **Target**: Undocumented workers target = *[Run cps.py to populate]*": lambda: f"- **Target**: Undocumented workers target = {data_map.get(('Target', 'Undocumented workers target'), 0):,.0f}", + "- **Before adjustment**: Undocumented students = *[Run cps.py to populate]*": lambda: f"- **Before adjustment**: Undocumented students = {data_map.get(('Before adjustment', 'Undocumented students'), 0):,.0f}", + "- **Target**: Undocumented students target = *[Run cps.py to populate]*": lambda: f"- **Target**: Undocumented students target = {data_map.get(('Target', 'Undocumented students target'), 0):,.0f}", + "- **Step 3 - EAD workers**: Moved from Code 0 to Code 2 = *[Run cps.py to populate]*": lambda: f"- **Step 3 - EAD workers**: Moved from Code 0 to Code 2 = {data_map.get(('Step 3 - EAD workers', 'Moved from Code 0 to Code 2'), 0):,.0f}", + "- **Step 4 - EAD students**: Moved from Code 0 to Code 2 = *[Run cps.py to populate]*": lambda: f"- **Step 4 - EAD students**: Moved from Code 0 to Code 2 = {data_map.get(('Step 4 - EAD students', 'Moved from Code 0 to Code 2'), 0):,.0f}", + "- **After EAD assignment**: Code 0 people = *[Run cps.py to populate]*": lambda: f"- **After EAD assignment**: Code 0 people = {data_map.get(('After EAD assignment', 'Code 0 people'), 0):,.0f}", + "- **Step 5 - Family correlation**: Changed from Code 3 to Code 0 = *[Run cps.py to populate]*": lambda: f"- **Step 5 - Family correlation**: Changed from Code 3 to Code 0 = {data_map.get(('Step 5 - Family correlation', 'Changed from Code 3 to Code 0'), 0):,.0f}", + "- **After family correlation**: Code 0 people = *[Run cps.py to populate]*": lambda: f"- **After family correlation**: Code 0 people = {data_map.get(('After family correlation', 'Code 0 people'), 0):,.0f}", + "- **Final**: Code 0 (NONE) = *[Run cps.py to populate]*": lambda: f"- **Final**: Code 0 (NONE) = {data_map.get(('Final', 'Code 0 (NONE)'), 0):,.0f}", + "- **Final**: Code 1 (CITIZEN) = *[Run cps.py to populate]*": lambda: f"- **Final**: Code 1 (CITIZEN) = {data_map.get(('Final', 'Code 1 (CITIZEN)'), 0):,.0f}", + "- **Final**: Code 2 (NON_CITIZEN_VALID_EAD) = *[Run cps.py to populate]*": lambda: f"- **Final**: Code 2 (NON_CITIZEN_VALID_EAD) = {data_map.get(('Final', 'Code 2 (NON_CITIZEN_VALID_EAD)'), 0):,.0f}", + "- **Final**: Code 3 (OTHER_NON_CITIZEN) = *[Run cps.py to populate]*": lambda: f"- **Final**: Code 3 (OTHER_NON_CITIZEN) = {data_map.get(('Final', 'Code 3 (OTHER_NON_CITIZEN)'), 0):,.0f}", + "- **Final**: Total undocumented (Code 0) = *[Run cps.py to populate]*": lambda: f"- **Final**: Total undocumented (Code 0) = {data_map.get(('Final', 'Total undocumented (Code 0)'), 0):,.0f}", + "- **Final**: Undocumented target = *[Run cps.py to populate]*": lambda: f"- **Final**: Undocumented target = {data_map.get(('Final', 'Undocumented target'), 0):,.0f}", + } + + # Apply replacements + for old_text, replacement_func in replacements.items(): + if old_text in content: + content = content.replace(old_text, replacement_func()) + + # Write updated content back to file + with open(doc_path, "w", encoding="utf-8") as f: + f.write(content) + + print(f"Documentation updated with population numbers: {doc_path}") + def add_tips(self, cps: h5py.File): self.save_dataset(cps) diff --git a/policyengine_us_data/tests/test_datasets/test_enhanced_cps.py b/policyengine_us_data/tests/test_datasets/test_enhanced_cps.py index 280ffbb2..9cab790a 100644 --- a/policyengine_us_data/tests/test_datasets/test_enhanced_cps.py +++ b/policyengine_us_data/tests/test_datasets/test_enhanced_cps.py @@ -97,7 +97,7 @@ def test_ssn_card_type_none_target(): from policyengine_us import Microsimulation import numpy as np - TARGET_COUNT = 11e6 + TARGET_COUNT = 13e6 TOLERANCE = 0.2 # Allow ±20% error sim = Microsimulation(dataset=EnhancedCPS_2024) @@ -114,6 +114,73 @@ def test_ssn_card_type_none_target(): assert pct_error < TOLERANCE +def test_ctc_reform_child_recipient_difference(): + """ + Test CTC reform impact for validation purposes only. + Note: This is no longer actively targeted in loss matrix calibration + due to uncertainty around assumptions from hearing comments. + """ + from policyengine_us_data.datasets.cps import EnhancedCPS_2024 + from policyengine_us import Microsimulation + from policyengine_core.reforms import Reform + + TARGET_COUNT = 2e6 + TOLERANCE = 4 # Allow ±400% error + + # Define the CTC reform + ctc_reform = Reform.from_dict( + { + "gov.contrib.reconciliation.ctc.in_effect": { + "2025-01-01.2100-12-31": True + } + }, + country_id="us", + ) + + # Create baseline and reform simulations + baseline_sim = Microsimulation(dataset=EnhancedCPS_2024) + reform_sim = Microsimulation(dataset=EnhancedCPS_2024, reform=ctc_reform) + + # Calculate baseline CTC recipients (children with ctc_individual_maximum > 0 and ctc_value > 0) + baseline_is_child = baseline_sim.calculate("is_child") + baseline_ctc_individual_maximum = baseline_sim.calculate( + "ctc_individual_maximum" + ) + baseline_ctc_value = baseline_sim.calculate("ctc_value", map_to="person") + baseline_child_ctc_recipients = ( + baseline_is_child + * (baseline_ctc_individual_maximum > 0) + * (baseline_ctc_value > 0) + ).sum() + + # Calculate reform CTC recipients (children with ctc_individual_maximum > 0 and ctc_value > 0) + reform_is_child = reform_sim.calculate("is_child") + reform_ctc_individual_maximum = reform_sim.calculate( + "ctc_individual_maximum" + ) + reform_ctc_value = reform_sim.calculate("ctc_value", map_to="person") + reform_child_ctc_recipients = ( + reform_is_child + * (reform_ctc_individual_maximum > 0) + * (reform_ctc_value > 0) + ).sum() + + # Calculate the difference (baseline - reform child CTC recipients) + ctc_recipient_difference = ( + baseline_child_ctc_recipients - reform_child_ctc_recipients + ) + + pct_error = abs((ctc_recipient_difference - TARGET_COUNT) / TARGET_COUNT) + + print( + f"CTC reform child recipient difference: {ctc_recipient_difference:.0f}, target: {TARGET_COUNT:.0f}, error: {pct_error:.2%}" + ) + print( + "Note: CTC targeting removed from calibration - this is validation only" + ) + assert pct_error < TOLERANCE + + def test_aca_calibration(): import pandas as pd diff --git a/policyengine_us_data/utils/loss.py b/policyengine_us_data/utils/loss.py index e885cbab..9480a046 100644 --- a/policyengine_us_data/utils/loss.py +++ b/policyengine_us_data/utils/loss.py @@ -426,10 +426,23 @@ def build_loss_matrix(dataset: type, time_period): ssn_type_mask, "person", "household" ) - # Target value - replace with actual target values from SSA/IRS data + # Target undocumented population by year based on various sources if card_type_str == "NONE": - # https://www.pewresearch.org/race-and-ethnicity/2018/11/27/u-s-unauthorized-immigrant-total-dips-to-lowest-level-in-a-decade/ - target_count = 11e6 + undocumented_targets = { + 2022: 11.0e6, # Official DHS Office of Homeland Security Statistics estimate for 1 Jan 2022 + # https://ohss.dhs.gov/sites/default/files/2024-06/2024_0418_ohss_estimates-of-the-unauthorized-immigrant-population-residing-in-the-united-states-january-2018%25E2%2580%2593january-2022.pdf + 2023: 12.2e6, # Center for Migration Studies ACS-based residual estimate (published May 2025) + # https://cmsny.org/publications/the-undocumented-population-in-the-united-states-increased-to-12-million-in-2023/ + 2024: 13.0e6, # Reuters synthesis of experts ahead of 2025 change ("~13-14 million") - central value + # https://www.reuters.com/data/who-are-immigrants-who-could-be-targeted-trumps-mass-deportation-plans-2024-12-18/ + 2025: 13.0e6, # Same midpoint carried forward - CBP data show 95% drop in border apprehensions + } + if time_period <= 2022: + target_count = 11.0e6 # Use 2022 value for earlier years + elif time_period >= 2025: + target_count = 13.0e6 # Use 2025 value for later years + else: + target_count = undocumented_targets[time_period] targets_array.append(target_count) diff --git a/pyproject.toml b/pyproject.toml index ac07f46f..5d56c9c8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -27,6 +27,8 @@ dependencies = [ "pip-system-certs", "google-cloud-storage", "google-auth", + "scipy<1.13", + "statsmodels>=0.14.0", ] [project.optional-dependencies]