-
Notifications
You must be signed in to change notification settings - Fork 2
FEAT: Integrate 2024 EAVS data and stabilize CLC cleaning pipeline #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adds the clean_2024() function, creates the 2024.yaml file, and updates all year YAMLs (2020, 2022) with CLC priority variables (A-list, C1-C9, E1-E3). This also incorporates robust column filtering and essential pipeline stabilization fixes.
Implements data output into parquet, xlsx, and csv formats for both individual years and combined data. Fixes pipeline execution and updates final column mapping logic.
Introduces the calculate_sha256 utility script to ensure data integrity.
|
Update: I’ve resolved the merge conflicts and integrated timeseries cleaning so it runs via uv run -m eavs.clean. Both per-year and timeseries pipelines were run locally. |
tuanpham96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think things generally look ok. There seems to still be issues with the fips codes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment, I believe shasum usually comes with with most OSes.
Anyways, there are two files of calculate_sh256.py.
I think it's ok to leave it in here as a utility function in the utils folder and delete this file instead.
| - name: provisional_ballots_rejected_total | ||
| dtype: int64 | ||
| raw_name: E1d | ||
| description: "Number of provisional ballots rejected." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure what the difference between E1d and E3a is. Is this duplicate?
- name: provisional_ballots_rejected_total
dtype: int64
raw_name: E1d
description: "Number of provisional ballots rejected."
- name: provisional_ballots_rejected_total_2
dtype: int64
raw_name: E3a
description: "Total provisional ballots rejected."| # ------------------------------- | ||
| # Permanent Mail Registrants (C2) | ||
| # ------------------------------- | ||
| - name: total_transmitted_permanent_mail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the way you name variables is typically by category first, I wonder if this should be permanent_mail_transmitted_total for consistency. It's fine either way.
| log.warning(f"Raw EAVS file not found for year {year} within {raw_data_dir}") | ||
| return pd.DataFrame() | ||
|
|
||
| data_path = excel_files[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably add an assert statement just in case:
assert len(excel_files) == 1|
|
||
| # Add year column and normalize FIPS | ||
| df['year'] = year | ||
| df['fips_code'] = df['fips_code'].astype(str).str.zfill(5).str[:5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding a bit more comment here for future references, since this was a recurring issue/discussion.
Also, I think there's still some issue with this. There are some entries from Wisconsin and Maine with fips code starting at 00. For example, after cleaning, the entries from Adams county in Wisconsin have multiple fips codes:
jurisdiction_name state fips_code
0 CITY OF ADAMS - ADAMS COUNTY WISCONSIN 00275
1 TOWN OF ADAMS - ADAMS COUNTY WISCONSIN 00300
2 TOWN OF BIG FLATS - ADAMS COUNTY WISCONSIN 07300
3 TOWN OF COLBURN - ADAMS COUNTY WISCONSIN 16075
4 TOWN OF DELL PRAIRIE - ADAMS COUNTY WISCONSIN 19575
5 TOWN OF EASTON - ADAMS COUNTY WISCONSIN 22000
6 VILLAGE OF FRIENDSHIP - ADAMS COUNTY WISCONSIN 27950
7 TOWN OF JACKSON - ADAMS COUNTY WISCONSIN 37625
8 TOWN OF LEOLA - ADAMS COUNTY WISCONSIN 43425
9 TOWN OF LINCOLN - ADAMS COUNTY WISCONSIN 44250
10 TOWN OF MONROE - ADAMS COUNTY WISCONSIN 53725
11 TOWN OF NEW CHESTER - ADAMS COUNTY WISCONSIN 56525
12 TOWN OF NEW HAVEN - ADAMS COUNTY WISCONSIN 56750
13 TOWN OF PRESTON - ADAMS COUNTY WISCONSIN 65450
14 TOWN OF QUINCY - ADAMS COUNTY WISCONSIN 65825
15 TOWN OF RICHFIELD - ADAMS COUNTY WISCONSIN 67425
16 TOWN OF ROME - ADAMS COUNTY WISCONSIN 69275
17 TOWN OF SPRINGVILLE - ADAMS COUNTY WISCONSIN 76350
18 TOWN OF STRONGS PRAIRIE - ADAMS COUNTY WISCONSIN 77800
Based on Wikipedia, Adams county code is 55001.
Based on the Census website for Wisconsin, these are likely fips code for county subdivisions.
Is that right? Do we need county level or subdivision level for downstream analyses?
Summary of Changes
This branch integrates the 2024 EAVS data and completes a refactoring of the cleaning pipeline for consistency across all years (2020, 2022, 2024)
The history has been interactively rebased into three clean, logical commits for easy review:
Functional Outcomes & Fixes
2024 Data Integration: The pipeline successfully downloads, cleans, and integrates the 2024 EAVS data source.
Pipeline Stability & Validation: The combined pipeline runs cleanly and validates all three years (2020, 2022, 2024) end-to-end.
Fixes Included:
Schema Compatibility: Resolved fips_code column validation failure in Pandera by changing the column type from Series[str] to Series[String] (Pandas nullable string dtype).
File Reading Integrity: Resolved issues reading the 2024 Excel file caused by Windows metadata streams in WSL environments.
Integration of timeseries cleaner into main pipeline Integrated the timeseries cleaner into the main clean pipeline so it runs automatically when executing python -m eavs.clean, while keeping the timeseries logic in its own module for standalone use. It also cleans up schema placement and ensures consistent Parquet outputs across datasets. Both clean.py and clean_timeseries.py were run locally to confirm validation and outputs.
Testing and Verification Steps
The history of this branch was cleaned and successfully rebased onto main (Andrew's last commit: 861308f). To verify locally, reviewers can ensure the local environment is up-to-date and run the download and clean steps:
Update local main:
Bash
Bash
Expected Result: The pipeline should finish with a SUCCESS message after validating the combined data, generating new multi-format files in the data/cleaned directory.