Skip to content

Conversation

mauriciogtec
Copy link
Collaborator

@mauriciogtec mauriciogtec commented Sep 23, 2025

This pull request introduces a synthetic health data pipeline, including configuration, workflow, and preprocessing scripts to generate synthetic health datasets with realistic spatial, temporal, and population effects. The main changes are the addition of a configuration file for synthetic data generation, a Snakemake workflow to orchestrate the process, and a comprehensive preprocessing script that creates synthetic health data using vectorized operations and saves the results in the expected format.

Synthetic Health Data Pipeline Implementation

  • Added a new configuration file conf/synthetic/config.yaml specifying parameters for synthetic data generation, including Poisson distribution parameters, date ranges, data paths, and debug options.
  • Introduced a Snakemake workflow in snakefile_synthetic_health.smk to automate the preprocessing of synthetic health data for each variable and year, producing daily output files.

Preprocessing and Data Generation

  • Implemented src/preprocessing_synth_health.py, a script that loads ZCTA geographic and population data, generates synthetic health counts using vectorized Poisson sampling with spatial and seasonal effects, and writes daily horizon files in parquet format.

Configuration and Data Path Updates

  • Updated conf/datapaths/datapaths_cannon.yaml to add new output directories for synthetic health and covariate data.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a comprehensive synthetic health data tool for generating realistic health data that mimics the structure of the LEGO dataset. The tool creates synthetic health data with spatial, temporal, and population-based variations using configurable Poisson parameters.

  • Implements vectorized synthetic data generation with geographic and demographic effects
  • Adds configuration support for synthetic health data parameters and file paths
  • Provides caching mechanisms to optimize performance for large datasets

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/preprocessing_synth_health.py Main synthetic health data preprocessing tool with ZCTA data loading, vectorized data generation, and caching
conf/synthetic/config.yaml Configuration file defining synthetic data parameters, paths, and Poisson distribution settings
conf/datapaths/datapaths_cannon.yaml Updated data paths configuration to include synthetic health output directory

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 76 to 77
'longitude': np.random.uniform(-125, -65, len(df_unique)), # Approximate US bounds
'latitude': np.random.uniform(25, 50, len(df_unique))
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded longitude and latitude bounds (-125, -65, 25, 50) are magic numbers. Consider defining these as named constants at the module level, e.g., US_LONGITUDE_MIN = -125, US_LONGITUDE_MAX = -65, etc.

Copilot uses AI. Check for mistakes.


# Pre-calculate all spatial effects (these don't change by date)
zcta_data = zcta_data.copy()
zcta_hashes = [hash(str(zcta)) % 1000 / 100.0 for zcta in zcta_data['zcta']]
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression % 1000 / 100.0 uses magic numbers. Consider defining these as named constants like HASH_MOD = 1000 and HASH_DIVISOR = 100.0 to make the normalization logic clearer.

Copilot uses AI. Check for mistakes.

Comment on lines +161 to +162
lat_normalized = (zcta_data['latitude'] - 35) / 15
lon_normalized = (zcta_data['longitude'] + 95) / 30
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The geographic normalization uses magic numbers (35, 15, 95, 30). These appear to be US geographic center and scaling factors. Consider defining these as named constants like US_LAT_CENTER = 35, LAT_SCALE_FACTOR = 15, etc.

Copilot uses AI. Check for mistakes.

LOGGER.warning("No population data found, using synthetic population")
df_pop = pd.DataFrame({
'zcta': df_unique['zcta'],
'population': np.random.lognormal(mean=8.5, sigma=1.2, size=len(df_unique)) # Realistic population dist
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lognormal distribution parameters (mean=8.5, sigma=1.2) are magic numbers. Consider defining these as named constants like SYNTHETIC_POP_MEAN = 8.5 and SYNTHETIC_POP_SIGMA = 1.2 or making them configurable.

Copilot uses AI. Check for mistakes.

for target_date in date_list:
# Calculate seasonal effect for this date
day_of_year = target_date.timetuple().tm_yday
seasonal_effect = poisson_params['seasonal_amplitude'] * np.sin(2 * np.pi * day_of_year / 365.25)
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 365.25 (days per year including leap years) should be defined as a named constant like DAYS_PER_YEAR = 365.25 for better maintainability.

Copilot uses AI. Check for mistakes.

- Remove unnecessary caching logic (pickle, @lru_cache)
- Simplify ZCTA hash calculation using direct numeric conversion
- Adjust Poisson parameters for realistic sparsity (~79% zeros)
- Maintain mainland US filtering (32,657 ZCTAs)
- Generate proper synthetic health data with horizons [0, 30, 90, 180]
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 60 to 61
'longitude': np.random.uniform(-125, -65, len(df_unique)), # Approximate US bounds
'latitude': np.random.uniform(25, 50, len(df_unique))
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded geographic bounds (-125, -65, 25, 50) should be defined as named constants at the module level to improve maintainability and make the values more explicit.

Copilot uses AI. Check for mistakes.

LOGGER.warning("No population data found, using synthetic population")
df_pop = pd.DataFrame({
'zcta': df_unique['zcta'],
'population': np.random.lognormal(mean=8.5, sigma=1.2, size=len(df_unique)) # Realistic population dist
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lognormal distribution parameters (mean=8.5, sigma=1.2) should be defined as named constants or moved to the configuration file to make them configurable and improve maintainability.

Copilot uses AI. Check for mistakes.

Comment on lines +139 to +140
lat_normalized = (zcta_data['latitude'] - 35) / 15
lon_normalized = (zcta_data['longitude'] + 95) / 30
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalization constants (35, 15, -95, 30) should be defined as named constants to clarify their purpose as geographic center points and ranges for the US.

Suggested change
lat_normalized = (zcta_data['latitude'] - 35) / 15
lon_normalized = (zcta_data['longitude'] + 95) / 30
lat_normalized = (zcta_data['latitude'] - US_LAT_CENTER) / US_LAT_RANGE
lon_normalized = (zcta_data['longitude'] - US_LON_CENTER) / US_LON_RANGE

Copilot uses AI. Check for mistakes.

for target_date in date_list:
# Calculate seasonal effect for this date
day_of_year = target_date.timetuple().tm_yday
seasonal_effect = poisson_params['seasonal_amplitude'] * np.sin(2 * np.pi * day_of_year / 365.25)
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value 365.25 (days per year accounting for leap years) should be defined as a named constant to make its purpose explicit.

Copilot uses AI. Check for mistakes.

LOGGER.info(f"Found {len(zcta_data)} ZCTAs for year {year} with complete data")

# get days list for a given year with calendar days
days_list = [(year, month, day) for month in range(1, 13) for day in range(1, calendar.monthrange(year, month)[1] + 1)]
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complex list comprehension for generating all days in a year should be extracted into a separate helper function with a descriptive name like generate_year_days() to improve readability.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant