Skip to content

Conversation

@yashinlin
Copy link
Collaborator

@yashinlin yashinlin commented Dec 7, 2025

Summary of Changes
This branch integrates the 2024 EAVS data and completes a refactoring of the cleaning pipeline for consistency across all years (2020, 2022, 2024)

The history has been interactively rebased into three clean, logical commits for easy review:

Commit Scope Description
b08d576 Core Logic Implements the main clean_2024() logic and updates all YAML files (2020, 2022, 2024) to standardize column mappings based on CLC priority variables (A-list, C1-C9, E1-E3).
a87a133 Output Pipeline Enables multi-format output (parquet, xlsx, csv) for combined and single-year datasets, simplifying final data delivery.
59ce3ee Utility Adds the survey SHA256 calculator to utils/.

Functional Outcomes & Fixes
2024 Data Integration: The pipeline successfully downloads, cleans, and integrates the 2024 EAVS data source.

Pipeline Stability & Validation: The combined pipeline runs cleanly and validates all three years (2020, 2022, 2024) end-to-end.

Fixes Included:

  • Schema Compatibility: Resolved fips_code column validation failure in Pandera by changing the column type from Series[str] to Series[String] (Pandas nullable string dtype).

  • File Reading Integrity: Resolved issues reading the 2024 Excel file caused by Windows metadata streams in WSL environments.

  • Integration of timeseries cleaner into main pipeline Integrated the timeseries cleaner into the main clean pipeline so it runs automatically when executing python -m eavs.clean, while keeping the timeseries logic in its own module for standalone use. It also cleans up schema placement and ensures consistent Parquet outputs across datasets. Both clean.py and clean_timeseries.py were run locally to confirm validation and outputs.

Testing and Verification Steps
The history of this branch was cleaned and successfully rebased onto main (Andrew's last commit: 861308f). To verify locally, reviewers can ensure the local environment is up-to-date and run the download and clean steps:

Update local main:

Bash

git checkout main
git pull upstream main
Run Pipeline:

Bash

uv run -m eavs.download
uv run -m eavs.clean

Expected Result: The pipeline should finish with a SUCCESS message after validating the combined data, generating new multi-format files in the data/cleaned directory.

Adds the clean_2024() function, creates the 2024.yaml file, and updates all year YAMLs (2020, 2022) with CLC priority variables (A-list, C1-C9, E1-E3). This also incorporates robust column filtering and essential pipeline stabilization fixes.
Implements data output into parquet, xlsx, and csv formats for both individual years and combined data. Fixes pipeline execution and updates final column mapping logic.
Introduces the calculate_sha256 utility script to ensure data integrity.
@yashinlin
Copy link
Collaborator Author

Update: I’ve resolved the merge conflicts and integrated timeseries cleaning so it runs via uv run -m eavs.clean. Both per-year and timeseries pipelines were run locally.

Copy link
Collaborator

@tuanpham96 tuanpham96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things generally look ok. There seems to still be issues with the fips codes though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment, I believe shasum usually comes with with most OSes.

Anyways, there are two files of calculate_sh256.py.

I think it's ok to leave it in here as a utility function in the utils folder and delete this file instead.

Comment on lines +490 to +493
- name: provisional_ballots_rejected_total
dtype: int64
raw_name: E1d
description: "Number of provisional ballots rejected."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what the difference between E1d and E3a is. Is this duplicate?

  - name: provisional_ballots_rejected_total
    dtype: int64
    raw_name: E1d
    description: "Number of provisional ballots rejected."
  - name: provisional_ballots_rejected_total_2
    dtype: int64
    raw_name: E3a
    description: "Total provisional ballots rejected."

# -------------------------------
# Permanent Mail Registrants (C2)
# -------------------------------
- name: total_transmitted_permanent_mail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the way you name variables is typically by category first, I wonder if this should be permanent_mail_transmitted_total for consistency. It's fine either way.

log.warning(f"Raw EAVS file not found for year {year} within {raw_data_dir}")
return pd.DataFrame()

data_path = excel_files[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably add an assert statement just in case:

assert len(excel_files) == 1


# Add year column and normalize FIPS
df['year'] = year
df['fips_code'] = df['fips_code'].astype(str).str.zfill(5).str[:5]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding a bit more comment here for future references, since this was a recurring issue/discussion.

Also, I think there's still some issue with this. There are some entries from Wisconsin and Maine with fips code starting at 00. For example, after cleaning, the entries from Adams county in Wisconsin have multiple fips codes:

                         jurisdiction_name      state fips_code
0             CITY OF ADAMS - ADAMS COUNTY  WISCONSIN     00275
1             TOWN OF ADAMS - ADAMS COUNTY  WISCONSIN     00300
2         TOWN OF BIG FLATS - ADAMS COUNTY  WISCONSIN     07300
3           TOWN OF COLBURN - ADAMS COUNTY  WISCONSIN     16075
4      TOWN OF DELL PRAIRIE - ADAMS COUNTY  WISCONSIN     19575
5            TOWN OF EASTON - ADAMS COUNTY  WISCONSIN     22000
6     VILLAGE OF FRIENDSHIP - ADAMS COUNTY  WISCONSIN     27950
7           TOWN OF JACKSON - ADAMS COUNTY  WISCONSIN     37625
8             TOWN OF LEOLA - ADAMS COUNTY  WISCONSIN     43425
9           TOWN OF LINCOLN - ADAMS COUNTY  WISCONSIN     44250
10           TOWN OF MONROE - ADAMS COUNTY  WISCONSIN     53725
11      TOWN OF NEW CHESTER - ADAMS COUNTY  WISCONSIN     56525
12        TOWN OF NEW HAVEN - ADAMS COUNTY  WISCONSIN     56750
13          TOWN OF PRESTON - ADAMS COUNTY  WISCONSIN     65450
14           TOWN OF QUINCY - ADAMS COUNTY  WISCONSIN     65825
15        TOWN OF RICHFIELD - ADAMS COUNTY  WISCONSIN     67425
16             TOWN OF ROME - ADAMS COUNTY  WISCONSIN     69275
17      TOWN OF SPRINGVILLE - ADAMS COUNTY  WISCONSIN     76350
18  TOWN OF STRONGS PRAIRIE - ADAMS COUNTY  WISCONSIN     77800

Based on Wikipedia, Adams county code is 55001.

Based on the Census website for Wisconsin, these are likely fips code for county subdivisions.

Is that right? Do we need county level or subdivision level for downstream analyses?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants