General data processing utilities for OpenElections projects.
A Python utility for consolidating county-level precinct CSV files into statewide precinct results files.
This module provides functionality to merge multiple county-level precinct result CSV files into a single statewide file. It automatically introspects vote type columns (such as early_voting, election_day, mail, absentee, provisional, etc.) and handles varying data structures across counties.
Based on: openelections-data-tx statewide_generator.py
- Automatic column introspection: Discovers vote type columns by excluding standard columns (
county,precinct,office,district,party,candidate) - Flexible data handling: Works with any combination of vote type columns across different counties
- Office filtering: Optionally include only specific offices
- Custom file patterns: Support for different file naming conventions
- Reporting: Generate reports showing which vote columns are present in each county
- Reusable: Designed to be imported into any OpenElections state repository
Copy precinct_results.py to your OpenElections data repository:
# From your openelections-data-XX repository
curl -O https://raw.githubusercontent.com/openelections/utils/main/precinct_results.pyfrom precinct_results import generate_statewide_precinct_file
generate_statewide_precinct_file(
state_abbr='tx',
election_date='20201103',
source_directory='2020/counties',
output_file='20201103__tx__general__precinct.csv'
)offices = ['President', 'U.S. Senate', 'U.S. House', 'Governor']
generate_statewide_precinct_file(
state_abbr='tx',
election_date='20201103',
source_directory='2020/counties',
output_file='20201103__tx__general__precinct.csv',
offices_filter=offices
)generate_statewide_precinct_file(
state_abbr='ca',
election_date='20201103',
source_directory='2020/counties',
output_file='20201103__ca__general__precinct.csv',
file_pattern='*_precinct_results.csv'
)from precinct_results import generate_vote_columns_report
generate_vote_columns_report(
source_directory='2020/counties',
file_pattern='20201103*precinct.csv',
output_file='vote_columns_report.csv'
)See example_usage.py for more comprehensive examples.
Generate a consolidated statewide precinct results file from county CSV files.
Parameters:
state_abbr(str): Two-letter state abbreviation (e.g., 'tx', 'ca')election_date(str): Election date in YYYYMMDD format (e.g., '20201103')source_directory(str): Directory containing county precinct CSV filesoutput_file(str): Path for the output consolidated CSV filefile_pattern(str, optional): Glob pattern for matching files. Defaults to'{election_date}*precinct.csv'offices_filter(list, optional): List of office names to include. If None, includes all officesverbose(bool, optional): If True, print progress messages. Defaults to True
Generate a report showing which vote type columns are present in each county file.
Parameters:
source_directory(str): Directory containing county precinct CSV filesfile_pattern(str): Glob pattern for matching filesoutput_file(str): Path for the output report CSV fileverbose(bool, optional): If True, print progress messages. Defaults to True
- Column Discovery: Scans all matching CSV files and identifies vote type columns by excluding standard columns
- Column Unification: Collects all unique vote type columns across all counties
- Data Consolidation: Reads each county file and consolidates into the output with consistent column ordering
- Missing Data Handling: Fills
Nonefor vote type columns not present in a particular county
County-level CSV files should have:
Required columns:
county,precinct,office,district,candidate,party
Vote type columns (any combination):
votes,early_voting,election_day,mail,absentee,provisional, etc.
- Standard columns first:
county,precinct,office,district,candidate,party - All discovered vote type columns in sorted order
- One row per precinct/office/candidate combination
Utilities to detect inconsistent party naming conventions in CSV files.
These functions help identify variations in party values (e.g., "DEM", "Democratic", "Democrat") that likely refer to the same political party. This is useful for:
- Data quality assurance
- Identifying data entry inconsistencies
- Standardizing party names before processing
- Single file analysis: Check party variations in a single CSV file
- Directory-wide analysis: Check party variations across multiple files
- Similarity detection: Uses string similarity to identify potential variations
- Frequency reporting: Shows occurrence counts for each party value
- Optional CSV reports: Export detailed analysis to CSV files
from precinct_results import check_party_variations
results = check_party_variations('20201103__tx__general__precinct.csv')
print(results['unique_parties'])
print(results['potential_variations'])from precinct_results import check_party_variations_directory
results = check_party_variations_directory(
source_directory='2020/counties',
file_pattern='20201103*precinct.csv',
output_file='party_variations_report.csv'
)Check for variations in party values within a single CSV file.
Parameters:
csv_path(str): Path to the CSV file to analyzesimilarity_threshold(float, optional): Minimum similarity (0.0-1.0) for variation detection. Default: 0.7output_file(str, optional): Path to write detailed variation reportverbose(bool, optional): If True, print summary and variations found. Default: True
Returns:
Dictionary with:
unique_parties: Set of all unique party values foundtotal_count: Total number of party occurrencesparty_counts: Dict mapping party value to occurrence countpotential_variations: List of (party1, party2, similarity) tuplesempty_count: Number of rows with empty/missing party values
Check for party value variations across multiple CSV files in a directory.
Parameters:
source_directory(str): Directory containing CSV filesfile_pattern(str): Glob pattern for matching files (e.g., '20201103*precinct.csv')similarity_threshold(float, optional): Minimum similarity (0.0-1.0) for variation detection. Default: 0.7output_file(str, optional): Path to write detailed variation reportverbose(bool, optional): If True, print summary and variations found. Default: True
Returns:
Dictionary with:
all_parties: Set of all unique party values across all filesby_file: Dict mapping filename to set of party values in that filepotential_variations: List of (party1, party2, similarity) tuplesfile_count: Number of files processed
Party Value Analysis:
Total rows with party values: 125000
Empty/missing party values: 150
Unique party values found: 8
Party value frequency:
DEM: 55000
REP: 52000
Democratic: 8500
Republican: 7200
LIB: 1800
GRN: 450
IND: 50
Potential variations detected: 2
'DEM' (n=55000) ↔ 'Democratic' (n=8500) - similarity: 0.82
'REP' (n=52000) ↔ 'Republican' (n=7200) - similarity: 0.79
See the compare_precinct_names() function for comparing precinct names between elections.
A comprehensive utility for comparing two CSV election files to identify and report differences.
The compare_csv_files() function compares two CSV files from the same state, county, and election to detect:
- Missing rows (rows in File A but not in File B)
- Extra rows (rows in File B but not in File A)
- Value mismatches in matching rows
- Structural differences (missing/extra columns)
- Vote total discrepancies
This is useful for:
- Data validation and quality assurance
- Verifying corrections were applied correctly
- Comparing original source data vs. processed data
- Migration testing and format validation
- Flexible row identification: Auto-detects key columns or accepts custom column specification
- Multiple output formats: Terminal (CLI), HTML web report, or both
- CSV export: Export detailed differences for further analysis
- Numeric tolerance: Optionally ignore small numeric differences
- Column filtering: Exclude specific columns from comparison
- Vote total analysis: Automatic calculation and comparison of vote totals
- Color-coded output: Easy-to-read terminal output with ANSI colors
- Detailed reporting: Summary statistics plus detailed difference listings
from precinct_results import compare_csv_files
# Compare two files
results = compare_csv_files(
'original.csv',
'corrected.csv',
verbose=True
)
# Check if files match
if results['summary']['percentage_match'] == 100.0:
print("Files are identical!")compare_csv_files(
'file_a.csv',
'file_b.csv',
output_format='web',
output_file='comparison_report.html'
)
# Open comparison_report.html in a browsercompare_csv_files(
'file_a.csv',
'file_b.csv',
csv_export='differences.csv',
verbose=True
)
# Opens in Excel or can be imported for analysisCompare two CSV files and report differences.
Parameters:
file_a(str): Path to first CSV filefile_b(str): Path to second CSV filekey_columns(list, optional): Columns to use as row identifier. If None, auto-detects from standard election columns (county, precinct, office, district, candidate, party)ignore_columns(list, optional): Columns to exclude from comparisontolerance(float, optional): Numeric comparison tolerance (default: 0.0)max_differences(int, optional): Stop after N differences (default: unlimited)output_format(str, optional): Output format - 'cli', 'web', or 'both' (default: 'cli')output_file(str, optional): Output file path (HTML for web, text for cli)csv_export(str, optional): Export detailed differences to CSV fileverbose(bool, optional): Print detailed output (default: True)color(bool, optional): Use colored terminal output (default: True)
Returns:
Dictionary with comparison results containing:
metadata: File information and timestampssummary: High-level statistics (match percentage, difference counts)column_differences: Missing/extra columnsrow_differences: Missing/extra rowsvalue_differences: Value mismatches with detailsvote_totals: Vote totals and differences by vote type
# If your CSV has different structure, specify key columns
results = compare_csv_files(
'file_a.csv',
'file_b.csv',
key_columns=['county', 'precinct', 'office', 'candidate']
)# Exclude columns you know will differ
compare_csv_files(
'file_a.csv',
'file_b.csv',
ignore_columns=['provisional', 'absentee']
)# Ignore differences of 5 votes or less
compare_csv_files(
'file_a.csv',
'file_b.csv',
tolerance=5.0
)# For use in scripts/CI pipelines
results = compare_csv_files(
'file_a.csv',
'file_b.csv',
verbose=False,
color=False
)
if results['summary']['total_differences'] > 0:
print(f"FAIL: {results['summary']['total_differences']} differences")
exit(1)
else:
print("PASS: Files match")
exit(0)results = compare_csv_files('file_a.csv', 'file_b.csv', verbose=False)
# Analyze vote total discrepancies
for vote_type, diff in results['vote_totals']['differences'].items():
if diff != 0:
print(f"{vote_type}: {diff:+,} vote difference")
# Count mismatches by column
mismatches_by_column = {}
for diff in results['value_differences']:
col = diff['column']
mismatches_by_column[col] = mismatches_by_column.get(col, 0) + 1
for col, count in sorted(mismatches_by_column.items()):
print(f"{col}: {count} mismatch(es)")# Complete validation workflow
results = compare_csv_files(
'original.csv',
'processed.csv',
csv_export='validation_differences.csv',
verbose=False
)
acceptable_error_rate = 1.0 # 1%
error_rate = 100.0 - results['summary']['percentage_match']
if error_rate <= acceptable_error_rate:
print("✓ PASS: Error rate within acceptable threshold")
else:
print("✗ FAIL: Error rate exceeds threshold")
# Generate detailed HTML report for review
compare_csv_files(
'original.csv',
'processed.csv',
output_format='web',
output_file='validation_report.html',
verbose=False
)CSV Comparison Report
======================================================================
Files:
File A: original.csv
File B: corrected.csv
Structure:
Rows: 1,234 vs 1,235 (+1)
Columns: 12 vs 12 (identical)
Results:
✓ Identical rows: 1,180 (95.6%)
✗ Missing rows: 10 (0.8%)
✗ Extra rows: 11 (0.9%)
✗ Value mismatches: 34 (2.7%)
Vote Totals:
early_voting File A: 125,432 File B: 125,450 (Δ +18)
election_day File A: 234,567 File B: 234,567 (✓ Match)
mail File A: 45,678 File B: 45,680 (Δ +2)
Overall Match: 95.6%
The HTML report includes:
- Summary dashboard with match rate and difference counts
- Interactive tables showing all differences
- Vote total comparison charts
- Sortable and filterable difference listings
- Responsive design for desktop/mobile
Exported CSV includes columns:
type: Type of difference (missing_row, extra_row, value_mismatch)county,precinct,office,district,candidate,party: Row identificationcolumn: Column with difference (for value mismatches)value_a,value_b: Values from each filedifference: Numeric difference (if applicable)notes: Additional information
# Example: Compare before and after correction
from precinct_results import compare_csv_files
# Step 1: Make corrections
# ... your correction code ...
# Step 2: Validate corrections
results = compare_csv_files(
'before_correction.csv',
'after_correction.csv',
csv_export='corrections_applied.csv',
output_format='web',
output_file='correction_report.html'
)
# Step 3: Verify expected changes
print(f"Corrections applied: {results['summary']['value_mismatches']}")
print(f"Match rate: {results['summary']['percentage_match']:.1f}%")See example_usage.py for more comprehensive examples.
A specialized utility for comparing county-level summary files to aggregated precinct-level totals.
The compare_county_to_precinct_totals() function verifies that county-level election summary files match the aggregated totals from their corresponding precinct-level files. This is essential for data quality assurance when you have both county-level summaries and detailed precinct breakdowns.
Use case: When given a county name (e.g., "adams") and election prefix (e.g., "20251104__pa__general"), this tool:
- Reads the county summary file:
20251104__pa__general__adams__county.csv - Reads the precinct file:
20251104__pa__general__adams__precinct.csv - Aggregates all precinct-level data by office/candidate/party
- Compares the aggregated precinct totals to the county summary
- Reports any differences
- Automatic aggregation: Sums precinct-level data across all precincts
- Flexible matching: Uses office, district, candidate, and party to match rows
- Vote column detection: Automatically identifies all vote type columns
- Multiple output formats: Terminal (CLI) or HTML web report
- Numeric tolerance: Optionally ignore small rounding differences
- Detailed reporting: Shows exactly which candidates have mismatches
- Batch processing: Easy to check multiple counties
from precinct_results import compare_county_to_precinct_totals
results = compare_county_to_precinct_totals(
election_prefix='20251104__pa__general',
county_name='adams',
directory='data/2025'
)
if results['summary']['total_differences'] == 0:
print("County and precinct totals match!")# Basic comparison
python compare_county_precinct.py 20251104__pa__general adams
# With custom directory
python compare_county_precinct.py 20251104__pa__general adams --directory data/2025
# Generate HTML report
python compare_county_precinct.py 20251104__pa__general adams --web --output report.html
# Check multiple counties
for county in adams allegheny beaver; do
python compare_county_precinct.py 20251104__pa__general $county
doneCompare county-level summary totals to aggregated precinct-level totals.
Parameters:
election_prefix(str): Election prefix (e.g., "20251104__pa__general")county_name(str): County name (e.g., "adams")directory(str, optional): Directory containing CSV files (default: '.')tolerance(float, optional): Numeric comparison tolerance (default: 0.0)verbose(bool, optional): Print detailed output (default: True)output_format(str, optional): Output format - 'cli', 'web', or 'both' (default: 'cli')output_file(str, optional): Optional output file path
Returns:
Dictionary with comparison results containing:
metadata: File information and comparison detailssummary: High-level statistics (total differences, match percentage)differences: List of rows with value mismatchesmissing_in_precinct: Rows in county file but not in precinct datamissing_in_county: Rows in precinct data but not in county file
Raises:
FileNotFoundError: If either file doesn't existValueError: If files are invalid or incompatible
compare_county_to_precinct_totals(
election_prefix='20251104__pa__general',
county_name='adams',
directory='data/2025',
output_format='web',
output_file='adams_county_check.html'
)# Ignore differences of 5 votes or less
results = compare_county_to_precinct_totals(
election_prefix='20251104__pa__general',
county_name='adams',
directory='data/2025',
tolerance=5.0
)counties = ['adams', 'allegheny', 'beaver', 'berks', 'blair']
election_prefix = '20251104__pa__general'
issues_found = []
for county in counties:
results = compare_county_to_precinct_totals(
election_prefix=election_prefix,
county_name=county,
directory='data/2025',
verbose=False
)
if results['summary']['total_differences'] > 0:
issues_found.append({
'county': county,
'differences': results['summary']['total_differences']
})
if issues_found:
print(f"Counties with issues: {len(issues_found)}")
for item in issues_found:
print(f" {item['county']}: {item['differences']} differences")
else:
print("All counties match perfectly!")The compare_county_precinct.py script provides easy command-line access:
# Basic usage
python compare_county_precinct.py ELECTION_PREFIX COUNTY_NAME
# Options
-d, --directory DIR Directory containing CSV files (default: current)
-t, --tolerance NUM Numeric comparison tolerance (default: 0.0)
--web Generate HTML web report
-o, --output FILE Output file path
--quiet Suppress verbose output
# Exit codes
0: Files match
1: Differences found
2: File not found
3: Other errorCounty vs Precinct Totals Comparison
======================================================================
Election: 20251104__pa__general
County: adams
County file rows: 45
Precinct file rows: 523
Aggregated unique rows: 45
✓ MATCH: County and precinct totals match perfectly!
When differences exist:
✗ MISMATCH: Found 3 issue(s)
Value mismatches: 3
Value Mismatches:
----------------------------------------------------------------------
1. office: President | candidate: Jane Smith | party: DEM
Vote type: election_day
County total: 12,345
Precinct total: 12,350
Difference: +5
2. office: U.S. Senate | candidate: John Doe | party: REP
Vote type: mail
County total: 8,234
Precinct total: 8,230
Difference: -4
The HTML report includes:
- Summary cards showing match status and total issues
- Detailed table of all value mismatches
- Vote type breakdown showing county vs precinct totals
- Difference highlighting with color coding
- Load Files: Reads both county and precinct CSV files
- Identify Vote Columns: Automatically detects vote type columns (excluding standard columns like county, precinct, office, etc.)
- Aggregate Precinct Data: Groups precinct data by office, district, candidate, and party, summing all vote totals
- Compare: Matches aggregated precinct totals to county summary totals
- Report: Shows any discrepancies found
County file: {election_prefix}__{county}__county.csv
- One row per office/candidate/party
- Contains aggregated vote totals for the entire county
Precinct file: {election_prefix}__{county}__precinct.csv
- Multiple rows per office/candidate/party (one per precinct)
- Contains vote totals broken down by precinct
Both files should have:
- Standard columns:
county,precinct(precinct file only),office,district,candidate,party - Vote columns: Any combination like
votes,early_voting,election_day,mail,absentee,provisional
# Validate all counties after data processing
import os
from precinct_results import compare_county_to_precinct_totals
election_prefix = '20251104__pa__general'
directory = 'data/2025'
# Find all county files
county_files = [f for f in os.listdir(directory)
if f.startswith(election_prefix) and f.endswith('__county.csv')]
counties = [f.replace(f'{election_prefix}__', '').replace('__county.csv', '')
for f in county_files]
# Validate each county
for county in counties:
print(f"Validating {county}...")
results = compare_county_to_precinct_totals(
election_prefix=election_prefix,
county_name=county,
directory=directory,
verbose=False
)
if results['summary']['total_differences'] == 0:
print(f" ✓ {county}: Match")
else:
print(f" ✗ {county}: {results['summary']['total_differences']} differences")
# Generate detailed report
compare_county_to_precinct_totals(
election_prefix=election_prefix,
county_name=county,
directory=directory,
output_format='web',
output_file=f'{county}_discrepancies.html',
verbose=False
)See example_usage.py for more comprehensive examples.
See LICENSE file for details.
Submit issues and pull requests to this repository.