Added synthetic health data tool #35

mauriciogtec · 2025-09-23T04:49:30Z

This pull request introduces a synthetic health data pipeline, including configuration, workflow, and preprocessing scripts to generate synthetic health datasets with realistic spatial, temporal, and population effects. The main changes are the addition of a configuration file for synthetic data generation, a Snakemake workflow to orchestrate the process, and a comprehensive preprocessing script that creates synthetic health data using vectorized operations and saves the results in the expected format.

Synthetic Health Data Pipeline Implementation

Added a new configuration file conf/synthetic/config.yaml specifying parameters for synthetic data generation, including Poisson distribution parameters, date ranges, data paths, and debug options.
Introduced a Snakemake workflow in snakefile_synthetic_health.smk to automate the preprocessing of synthetic health data for each variable and year, producing daily output files.

Preprocessing and Data Generation

Implemented src/preprocessing_synth_health.py, a script that loads ZCTA geographic and population data, generates synthetic health counts using vectorized Poisson sampling with spatial and seasonal effects, and writes daily horizon files in parquet format.

Configuration and Data Path Updates

Updated conf/datapaths/datapaths_cannon.yaml to add new output directories for synthetic health and covariate data.

Copilot

Pull Request Overview

This PR adds a comprehensive synthetic health data tool for generating realistic health data that mimics the structure of the LEGO dataset. The tool creates synthetic health data with spatial, temporal, and population-based variations using configurable Poisson parameters.

Implements vectorized synthetic data generation with geographic and demographic effects
Adds configuration support for synthetic health data parameters and file paths
Provides caching mechanisms to optimize performance for large datasets

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.

File	Description
src/preprocessing_synth_health.py	Main synthetic health data preprocessing tool with ZCTA data loading, vectorized data generation, and caching
conf/synthetic/config.yaml	Configuration file defining synthetic data parameters, paths, and Poisson distribution settings
conf/datapaths/datapaths_cannon.yaml	Updated data paths configuration to include synthetic health output directory

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-23T04:50:37Z

src/preprocessing_synth_health.py

+            'longitude': np.random.uniform(-125, -65, len(df_unique)),  # Approximate US bounds
+            'latitude': np.random.uniform(25, 50, len(df_unique))


The hardcoded longitude and latitude bounds (-125, -65, 25, 50) are magic numbers. Consider defining these as named constants at the module level, e.g., US_LONGITUDE_MIN = -125, US_LONGITUDE_MAX = -65, etc.

Copilot · 2025-09-23T04:50:38Z

src/preprocessing_synth_health.py

+
+    # Pre-calculate all spatial effects (these don't change by date)
+    zcta_data = zcta_data.copy()
+    zcta_hashes = [hash(str(zcta)) % 1000 / 100.0 for zcta in zcta_data['zcta']]


The expression % 1000 / 100.0 uses magic numbers. Consider defining these as named constants like HASH_MOD = 1000 and HASH_DIVISOR = 100.0 to make the normalization logic clearer.

Copilot · 2025-09-23T04:50:38Z

src/preprocessing_synth_health.py

+    lat_normalized = (zcta_data['latitude'] - 35) / 15
+    lon_normalized = (zcta_data['longitude'] + 95) / 30


The geographic normalization uses magic numbers (35, 15, 95, 30). These appear to be US geographic center and scaling factors. Consider defining these as named constants like US_LAT_CENTER = 35, LAT_SCALE_FACTOR = 15, etc.

Copilot · 2025-09-23T04:50:38Z

src/preprocessing_synth_health.py

+            LOGGER.warning("No population data found, using synthetic population")
+            df_pop = pd.DataFrame({
+                'zcta': df_unique['zcta'],
+                'population': np.random.lognormal(mean=8.5, sigma=1.2, size=len(df_unique))  # Realistic population dist


The lognormal distribution parameters (mean=8.5, sigma=1.2) are magic numbers. Consider defining these as named constants like SYNTHETIC_POP_MEAN = 8.5 and SYNTHETIC_POP_SIGMA = 1.2 or making them configurable.

Copilot · 2025-09-23T04:50:38Z

src/preprocessing_synth_health.py

+    for target_date in date_list:
+        # Calculate seasonal effect for this date
+        day_of_year = target_date.timetuple().tm_yday
+        seasonal_effect = poisson_params['seasonal_amplitude'] * np.sin(2 * np.pi * day_of_year / 365.25)


The magic number 365.25 (days per year including leap years) should be defined as a named constant like DAYS_PER_YEAR = 365.25 for better maintainability.

- Remove unnecessary caching logic (pickle, @lru_cache) - Simplify ZCTA hash calculation using direct numeric conversion - Adjust Poisson parameters for realistic sparsity (~79% zeros) - Maintain mainland US filtering (32,657 ZCTAs) - Generate proper synthetic health data with horizons [0, 30, 90, 180]

Copilot

Pull Request Overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-23T05:19:47Z

src/preprocessing_synth_health.py

+            'longitude': np.random.uniform(-125, -65, len(df_unique)),  # Approximate US bounds
+            'latitude': np.random.uniform(25, 50, len(df_unique))


The hardcoded geographic bounds (-125, -65, 25, 50) should be defined as named constants at the module level to improve maintainability and make the values more explicit.

Copilot · 2025-09-23T05:19:47Z

src/preprocessing_synth_health.py

+            LOGGER.warning("No population data found, using synthetic population")
+            df_pop = pd.DataFrame({
+                'zcta': df_unique['zcta'],
+                'population': np.random.lognormal(mean=8.5, sigma=1.2, size=len(df_unique))  # Realistic population dist


The lognormal distribution parameters (mean=8.5, sigma=1.2) should be defined as named constants or moved to the configuration file to make them configurable and improve maintainability.

Copilot · 2025-09-23T05:19:47Z

src/preprocessing_synth_health.py

+    lat_normalized = (zcta_data['latitude'] - 35) / 15
+    lon_normalized = (zcta_data['longitude'] + 95) / 30


The normalization constants (35, 15, -95, 30) should be defined as named constants to clarify their purpose as geographic center points and ranges for the US.

Suggested change

lat_normalized = (zcta_data['latitude'] - 35) / 15

lon_normalized = (zcta_data['longitude'] + 95) / 30

lat_normalized = (zcta_data['latitude'] - US_LAT_CENTER) / US_LAT_RANGE

lon_normalized = (zcta_data['longitude'] - US_LON_CENTER) / US_LON_RANGE

Copilot · 2025-09-23T05:19:48Z

src/preprocessing_synth_health.py

+    for target_date in date_list:
+        # Calculate seasonal effect for this date
+        day_of_year = target_date.timetuple().tm_yday
+        seasonal_effect = poisson_params['seasonal_amplitude'] * np.sin(2 * np.pi * day_of_year / 365.25)


The value 365.25 (days per year accounting for leap years) should be defined as a named constant to make its purpose explicit.

Copilot · 2025-09-23T05:19:48Z

src/preprocessing_synth_health.py

+    LOGGER.info(f"Found {len(zcta_data)} ZCTAs for year {year} with complete data")
+
+    # get days list for a given year with calendar days
+    days_list = [(year, month, day) for month in range(1, 13) for day in range(1, calendar.monthrange(year, month)[1] + 1)]


This complex list comprehension for generating all days in a year should be extracted into a separate helper function with a descriptive name like generate_year_days() to improve readability.

Added synthetic health data script

f4422b2

mauriciogtec requested review from jckitch and Copilot September 23, 2025 04:49

Copilot AI reviewed Sep 23, 2025

View reviewed changes

mauriciogtec added 3 commits September 23, 2025 01:14

Add synthetic health snakemake file

f589954

Update requirements for synthetic health data

0e8abd3

mauriciogtec requested a review from Copilot September 23, 2025 05:18

Copilot AI reviewed Sep 23, 2025

View reviewed changes

mauriciogtec added 2 commits September 23, 2025 01:29

Remove fallbacks, fail fast on missing data

fd2232f

Add disease-specific synthetic parameters

dc13dcc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added synthetic health data tool #35

Added synthetic health data tool #35

Uh oh!

mauriciogtec commented Sep 23, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Uh oh!

		'longitude': np.random.uniform(-125, -65, len(df_unique)), # Approximate US bounds
		'latitude': np.random.uniform(25, 50, len(df_unique))

		lat_normalized = (zcta_data['latitude'] - 35) / 15
		lon_normalized = (zcta_data['longitude'] + 95) / 30

Added synthetic health data tool #35

Are you sure you want to change the base?

Added synthetic health data tool #35

Uh oh!

Conversation

mauriciogtec commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mauriciogtec commented Sep 23, 2025 •

edited

Loading