FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline #28

yashinlin · 2025-12-07T12:49:14Z

Summary of Changes
This branch integrates the 2024 EAVS data and completes a refactoring of the cleaning pipeline for consistency across all years (2020, 2022, 2024)

The history has been interactively rebased into three clean, logical commits for easy review:

Commit	Scope	Description
b08d576	Core Logic	Implements the main clean_2024() logic and updates all YAML files (2020, 2022, 2024) to standardize column mappings based on CLC priority variables (A-list, C1-C9, E1-E3).
a87a133	Output Pipeline	Enables multi-format output (parquet, xlsx, csv) for combined and single-year datasets, simplifying final data delivery.
59ce3ee	Utility	Adds the survey SHA256 calculator to utils/.

Functional Outcomes & Fixes
2024 Data Integration: The pipeline successfully downloads, cleans, and integrates the 2024 EAVS data source.

Pipeline Stability & Validation: The combined pipeline runs cleanly and validates all three years (2020, 2022, 2024) end-to-end.

Fixes Included:

Schema Compatibility: Resolved fips_code column validation failure in Pandera by changing the column type from Series[str] to Series[String] (Pandas nullable string dtype).
File Reading Integrity: Resolved issues reading the 2024 Excel file caused by Windows metadata streams in WSL environments.
Integration of timeseries cleaner into main pipeline Integrated the timeseries cleaner into the main clean pipeline so it runs automatically when executing python -m eavs.clean, while keeping the timeseries logic in its own module for standalone use. It also cleans up schema placement and ensures consistent Parquet outputs across datasets. Both clean.py and clean_timeseries.py were run locally to confirm validation and outputs.

Testing and Verification Steps
The history of this branch was cleaned and successfully rebased onto main (Andrew's last commit: 861308f). To verify locally, reviewers can ensure the local environment is up-to-date and run the download and clean steps:

Update local main:

Bash

git checkout main
git pull upstream main
Run Pipeline:

Bash

uv run -m eavs.download
uv run -m eavs.clean

Expected Result: The pipeline should finish with a SUCCESS message after validating the combined data, generating new multi-format files in the data/cleaned directory.

Adds the clean_2024() function, creates the 2024.yaml file, and updates all year YAMLs (2020, 2022) with CLC priority variables (A-list, C1-C9, E1-E3). This also incorporates robust column filtering and essential pipeline stabilization fixes.

Implements data output into parquet, xlsx, and csv formats for both individual years and combined data. Fixes pipeline execution and updates final column mapping logic.

Introduces the calculate_sha256 utility script to ensure data integrity.

yashinlin · 2025-12-16T09:18:17Z

Update: I’ve resolved the merge conflicts and integrated timeseries cleaning so it runs via uv run -m eavs.clean. Both per-year and timeseries pipelines were run locally.

tuanpham96

I think things generally look ok. There seems to still be issues with the fips codes though.

tuanpham96 · 2026-01-21T22:37:09Z

calculate_sha256.py

Just a comment, I believe shasum usually comes with with most OSes.

Anyways, there are two files of calculate_sh256.py.

I think it's ok to leave it in here as a utility function in the utils folder and delete this file instead.

tuanpham96 · 2026-01-21T22:51:36Z

eavs/assets/column_mappings/2020.yaml

+  - name: provisional_ballots_rejected_total
+    dtype: int64
+    raw_name: E1d
+    description: "Number of provisional ballots rejected."


I'm not entirely sure what the difference between E1d and E3a is. Is this duplicate?

- name: provisional_ballots_rejected_total dtype: int64 raw_name: E1d description: "Number of provisional ballots rejected." - name: provisional_ballots_rejected_total_2 dtype: int64 raw_name: E3a description: "Total provisional ballots rejected."

tuanpham96 · 2026-01-21T22:53:45Z

eavs/assets/column_mappings/2020.yaml

+  # -------------------------------
+  # Permanent Mail Registrants (C2)
+  # -------------------------------
+  - name: total_transmitted_permanent_mail


since the way you name variables is typically by category first, I wonder if this should be permanent_mail_transmitted_total for consistency. It's fine either way.

tuanpham96 · 2026-01-21T23:16:42Z

eavs/clean.py

+        log.warning(f"Raw EAVS file not found for year {year} within {raw_data_dir}")
+        return pd.DataFrame()
+
+    data_path = excel_files[0]


should probably add an assert statement just in case:

assert len(excel_files) == 1

tuanpham96 · 2026-01-21T23:28:19Z

eavs/clean.py

+
+    # Add year column and normalize FIPS
+    df['year'] = year
+    df['fips_code'] = df['fips_code'].astype(str).str.zfill(5).str[:5]


I'd suggest adding a bit more comment here for future references, since this was a recurring issue/discussion.

Also, I think there's still some issue with this. There are some entries from Wisconsin and Maine with fips code starting at 00. For example, after cleaning, the entries from Adams county in Wisconsin have multiple fips codes:

jurisdiction_name state fips_code 0 CITY OF ADAMS - ADAMS COUNTY WISCONSIN 00275 1 TOWN OF ADAMS - ADAMS COUNTY WISCONSIN 00300 2 TOWN OF BIG FLATS - ADAMS COUNTY WISCONSIN 07300 3 TOWN OF COLBURN - ADAMS COUNTY WISCONSIN 16075 4 TOWN OF DELL PRAIRIE - ADAMS COUNTY WISCONSIN 19575 5 TOWN OF EASTON - ADAMS COUNTY WISCONSIN 22000 6 VILLAGE OF FRIENDSHIP - ADAMS COUNTY WISCONSIN 27950 7 TOWN OF JACKSON - ADAMS COUNTY WISCONSIN 37625 8 TOWN OF LEOLA - ADAMS COUNTY WISCONSIN 43425 9 TOWN OF LINCOLN - ADAMS COUNTY WISCONSIN 44250 10 TOWN OF MONROE - ADAMS COUNTY WISCONSIN 53725 11 TOWN OF NEW CHESTER - ADAMS COUNTY WISCONSIN 56525 12 TOWN OF NEW HAVEN - ADAMS COUNTY WISCONSIN 56750 13 TOWN OF PRESTON - ADAMS COUNTY WISCONSIN 65450 14 TOWN OF QUINCY - ADAMS COUNTY WISCONSIN 65825 15 TOWN OF RICHFIELD - ADAMS COUNTY WISCONSIN 67425 16 TOWN OF ROME - ADAMS COUNTY WISCONSIN 69275 17 TOWN OF SPRINGVILLE - ADAMS COUNTY WISCONSIN 76350 18 TOWN OF STRONGS PRAIRIE - ADAMS COUNTY WISCONSIN 77800

Based on Wikipedia, Adams county code is 55001.

Based on the Census website for Wisconsin, these are likely fips code for county subdivisions.

Is that right? Do we need county level or subdivision level for downstream analyses?

yashinlin added 3 commits December 7, 2025 19:50

FEAT: Enable multi-format data export and streamline pipeline execution

864a9fe

Implements data output into parquet, xlsx, and csv formats for both individual years and combined data. Fixes pipeline execution and updates final column mapping logic.

CHORE: Add utility script for calculating survey SHA256 hashes

3d76709

Introduces the calculate_sha256 utility script to ensure data integrity.

yashinlin requested review from amitchharper and tuanpham96 December 7, 2025 12:49

"FIX: Add openpyxl dependency to pyproject.toml for Excel file reading."

fac7455

tuanpham96 requested changes Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline #28

FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline #28

Uh oh!

yashinlin commented Dec 7, 2025 •

edited

Loading

Uh oh!

yashinlin commented Dec 16, 2025

Uh oh!

tuanpham96 left a comment

Uh oh!

tuanpham96 Jan 21, 2026

Uh oh!

tuanpham96 Jan 21, 2026

Uh oh!

tuanpham96 Jan 21, 2026

Uh oh!

tuanpham96 Jan 21, 2026

Uh oh!

tuanpham96 Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline #28

Are you sure you want to change the base?

FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline #28

Uh oh!

Conversation

yashinlin commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashinlin commented Dec 16, 2025

Uh oh!

tuanpham96 left a comment

Choose a reason for hiding this comment

Uh oh!

tuanpham96 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

tuanpham96 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

tuanpham96 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

tuanpham96 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

tuanpham96 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yashinlin commented Dec 7, 2025 •

edited

Loading