Updated Methodology section to reflect current work

Griffin Sharps · Griffin Sharps · commit 690d500ed6d1 · 2025-12-22T20:58:51.000Z
diff --git a/docs/index.qmd b/docs/index.qmd
@@ -4,46 +4,92 @@
 
 ### Load profiles
 
-Our ComEd and Ameren Illinois data consists of kWh load profiles in 30-minute increments for the complete set of households across the entire ComEd service area. To preserve anonymity per rules set by the Illinois Chamber of Commerce, customer data can only be included in a data release from a utility company if it passes a screening process. In this case, a customer’s individual data cannot be released if there are 15 or fewer customers in the given geographic area, or if they represent more than 15% of that area’s load. In our case, the geographic area of interest is the nine-digit Zip+4 postal code.
+Our ComEd Illinois data consists of kWh load profiles in 30-minute increments for the complete set of households across the entire ComEd service area. To preserve anonymity per rules set by the Illinois Chamber of Commerce, customer data can only be included in a data release from a utility company if it passes a screening process. In this case, a customer’s individual data cannot be released if there are 15 or fewer customers in the given geographic area, or if they represent more than 15% of that area’s load. In our case, the geographic area of interest is the nine-digit Zip+4 postal code.
 
-# The Ameren data was grabbed for a hypothetical second project with CUB. It's not part of the scope of this round of work--CUB explicitly asked for the ComEd data first and separately
+::: {.callout-note}
+## Ameren data was also grabbed for a hypothetical second project with CUB. It's not part of the scope of this round of work--CUB explicitly asked for the ComEd data first and separately
+:::
 
-Furthermore, even when individual customers’ usage data is provided, the identification number associated with each customer is not retained month to month. In other words, while the usage data may feature the same customers in September as they did in August, ComEd assigns each customer a new account ID for each month.  As a result, we only get consistent household identifiers for a given calendar month. Since Illinois’s grid peaked in July 2023, we choose load profiles for the month of TK. We denote load profiles as $L_i$ (for each $i$ household). Each $L_i$ is a TK point time series, for every 30 minutes in the month  (48 half-hour readings over the course of 30 days).
+Furthermore, even when individual customers’ usage data is provided, the identification number associated with each customer is not retained month to month. In other words, while the usage data may feature the same customers in September as they did in August, ComEd assigns each customer a new account ID for each month. As a result, we only get consistent household identifiers for a given calendar month.
+
+Since Illinois’s grid peaked in July 2023, for our initial test analysis we choose load profiles for the month of July 2023. We denote a household’s **monthly** 30-minute interval usage series as $L_i$ (for household $i$). For July 2023, each $L_i$ is a 1,488-point time series observed at 30-minute intervals (48 half-hour readings per day over 31 days).
+
+For clustering, however, we work with **household-day observations** derived from these monthly series. Specifically, we partition each $L_i$ into daily 48-point vectors $L_{id}$, where $d\in\{1,\ldots,31\}$ indexes days in July. Each household-day vector $L_{id}$ is then normalized to a daily load-shape vector $S_{id}$ prior to clustering. In our clustering implementation, this normalization is performed row-wise and can be specified as min–max scaling (primary specification), z-score scaling (robustness check), or no additional scaling.
+
+These normalized household-day observations $\{S_{id}\}$ are the inputs to k-means, and all subsequent aggregation and regression in Stage 2 is performed at the household-day level.
+
+::: {.callout-note}
+## I am still unsure about the scale of the final analysis. Will we show a full year of data, or a particular month/set of months? Or 12 months as a series of sub-analyses.
+:::
 
 ### Demographic information
 
-The highest spatial resolution demographic data available for our analysis is the 5-year 2023 US Census Bureau American Community Survey (2023 ACS) and the decennial 2020 Census Supplemental Demographic and Housing Characteristics File (2020 DHS) at the block group level (DHS 2020). From this, we derive a number of block-group level demographic features:
-* TK
+The highest spatial resolution demographic data available for our analysis is the 5-year 2023 US Census Bureau American Community Survey (2023 ACS) and the decennial 2020 Census Supplemental Demographic and Housing Characteristics File (2020 DHS) at the block group level (DHS 2020).
+
+From this, we derive 47 block-group level demographic features across five categories (all sourced from ACS unless otherwise noted):
+
+* Spatial (1 variable): `urban_percent`.
+
+* Economic (7 variables): `median_household_income`, `unemployment_rate`, `pct_in_civilian_labor_force`, `pct_not_in_labor_force`, `pct_income_under_25k`, `pct_income_25k_to_75k`, `pct_income_75k_plus`.
+
+* Housing (24 variables): `pct_owner_occupied`, `pct_renter_occupied`, `pct_heat_utility_gas`, `pct_heat_electric`, `pct_housing_built_2000_plus`, `pct_housing_built_1980_1999`, `old_building_pct`, `pct_structure_single_family_detached`, `pct_structure_single_family_attached`, `pct_structure_multifamily_2_to_4`, `pct_structure_multifamily_5_to_19`, `pct_structure_multifamily_10_plus`, `pct_structure_multifamily_20_plus`, `pct_structure_mobile_home`, `pct_vacant_housing_units`, `pct_home_value_under_150k`, `pct_home_value_150k_to_299k`, `pct_home_value_300k_plus`, `pct_rent_burden_30_plus`, `pct_rent_burden_50_plus`, `pct_owner_cost_burden_30_plus_mortgage`, `pct_owner_cost_burden_50_plus_mortgage`, `pct_owner_overcrowded_2plus_per_room`, `pct_renter_overcrowded_2plus_per_room`.
+
+* Household (3 variables): `avg_household_size`, `avg_family_size`, `pct_single_parent_households`.
+
+* Demographic (12 variables): `median_age`, `pct_white_alone`, `pct_black_alone`, `pct_asian_alone`, `pct_two_or_more_races`, `pct_population_under_5`, `pct_population_5_to_17`, `pct_population_18_to_24`, `pct_population_25_to_44`, `pct_population_45_to_64`, `pct_population_65_plus`, `pct_female`.
+
+::: {.callout-note}
+## Variable list to be cleaned up and made more readable.
+:::
+
+The full variable list, including Census ACS table references and calculation methods, is specified in `census_specs.py`.
 
 ### Geographical crosswalk
 
-Load profiles are identified by ZIP+4. Our demographic information from the 2023ACS and 2020 DHS is available at the Census block group level. To join the two, we use a crosswalk from Melissa. That crosswalk matches every Zip+4 postal code in Illinois to the Census Block that it was associated with in 2023. From here we aggregate the Census Blocks to their Block Groups–allowing them to be associated with our demographic information. Thus, we are able to characterize Block Groups both demographically and by the usage data of their residents.
+Load profiles are identified by ZIP+4. Our demographic information from the 2023ACS and 2020 DHS is available at the Census block group level. To join the two, we use a crosswalk from commercial data firm: Melissa. That crosswalk matches every Zip+4 postal code in Illinois to the Census Block that it was associated with in 2023. From here we aggregate the Census Blocks to their Block Groups–allowing them to be associated with our demographic information. Thus, we are able to characterize Block Groups both demographically and by the usage data of their residents.
 
-TK ZIP+4 and Census Block group spatial overlap to show that it’s a reasonable mapping.
+When a ZIP+4 maps to multiple block groups and crosswalk weights are unavailable, we enforce a deterministic 1-to-1 linkage by assigning each ZIP+4 to a single block group (selecting the smallest GEOID) to avoid double-counting household-day observations; this introduces potential geographic misclassification that we treat as a limitation.
 
 ### Clusters
 
-We cluster the load profiles into clusters via k-means. We use Euclidean distance on normalized load profiles, to focus on load shape rather than overall levels. We are aiming for a small number of clusters (4-10) to aid in interpretation. We chose $k=\textrm{TK}$ via the gap statistic [@tibshiraniEstimatingNumberClusters2001], using the `GapStatistics` package(TK citation).
-
-# current code compares silhouette scores to give the k gives the tightest, best separated clusters. Given skepticism about the initial paper's findings
-# (as in... maybe there's nothing to see here?) your approach might be better. My limited understanding is that the gap score compares to a null model, # while my code only compares clusters that it generates itself. Happy to change
+We cluster the load profiles into clusters via k-means. We use Euclidean distance on normalized load profiles, to focus on load shape rather than overall levels. We are aiming for a small number of clusters (4–10) to aid in interpretation. We selected $k=4$ using a combination of quantitative diagnostics and interpretability considerations. Quantitatively, we compared candidate values of $k$ using (i) the silhouette score to assess separation and cohesion of clusters, and (ii) the within-cluster sum of squares (WCSS) “elbow” curve to identify diminishing returns in fit as $k$ increases. Substantively, we also inspected average normalized load shapes by cluster to ensure that the resulting patterns were distinct and interpretable, and that clusters were not degenerate (e.g., extremely small or poorly separated).
 
 TK details on each cluster, and brief description of each shape.
 
-TK selection of cluster 1 as baseline.
+We selected cluster 1 as the baseline (reference category) because it contains the largest share of household-day observations and exhibits the most stable and representative load-shape pattern among the clusters, providing a stable comparison point for the log-ratio regressions.
 
 ## Demographic predictors of load profile mix
 
-We begin by aggregating load profile clusters at the block group level. If $j$ is the block group, and $q$ is the cluster $(q\in\{1,\ldots,k\})$, then
+We begin by aggregating load profile clusters at the block group level. If $j$ is the block group, and $q$ is the cluster ($q\in\{1,\ldots,k\}$), then
+
+$$
+C_{jq} = \sum_{i\in \mathrm{bg}(j)} \mathrm{I}(c_i = q)
+$$
 
-$$C_{jq} = \sum_{i\in \textrm{bg(j)}}\textrm{I}(c_i = q)$$
+is the count of household-day observations assigned to cluster $q$ in block group $j$, where $c_i$ denotes the cluster assignment for household-day observation $i$. Under this aggregation, a single household with multiple sampled days contributes multiple observations to the block group totals.
 
-is the count of households per cluster in each block group. We further normalize these to $\pi_{jq} = \frac{C_{jq}}{\sum_qC_{jq}}$, the proportion of cluster assignments in each block group.
+We further normalize these to $\pi_{jq} = \frac{C_{jq}}{\sum_q C_{jq}}$, the proportion of cluster assignments in each block group.
 
-Our aim is to understand how block group level demographics predict the load cluster mix. To that end, we fit a series of multinomial logistic regression models, estimating the log probabilty of a cluster assignment relative to our baseline cluster 1,
+Our aim is to understand how block group level demographics predict the load cluster mix. Because $(\pi_{j1},\ldots,\pi_{jk})$ are compositional proportions that must sum to 1, we model log-ratios of cluster proportions. To prevent numerical instability from zero proportions, we apply Laplace smoothing ($\alpha = 0.5$ pseudocount) before computing log-ratios. Specifically, we define smoothed proportions
 
 $$
-\log\left(\frac{\pi_{jq}}{\pi_{jK}}\right) = \beta_{0q} + \beta_{1q} \cdot \textrm{median\_income}_q + \beta_{2q} \textrm{pct\_hispanic}_q + \ldots
+\tilde{\pi}_{jq} = \frac{C_{jq} + \alpha}{\sum_{q=1}^{k} C_{jq} + k\alpha},
+\qquad \alpha = 0.5,
 $$
 
-for each $q\in\{1,\ldots,k\}$. We use the `MNLogit` function in the Python package `statsmodels` to fit these models.
+which ensures $\tilde{\pi}_{jq} > 0$ for all $q$ and prevents $\log(0) = -\infty$.
+
+We fit one model per cluster outcome (except the baseline), modeling the log-ratio of each cluster’s share relative to baseline cluster 1. Specifically, for each $q\in\{2,\ldots,k\}$, we fit:
+
+$$
+\log\left(\frac{\tilde{\pi}_{jq}}{\tilde{\pi}_{j1}}\right)
+= \beta_{0q}
++ \beta_{1q} X_{1j}
++ \beta_{2q} X_{2j}
++ \cdots
++ \varepsilon_{jq}.
+$$
+
+We estimate these as separate weighted least squares (WLS) regressions, weighting each block group $j$ by $\mathrm{total\_obs}_j$, the total number of household-day observations available for that block group. This weighting gives greater influence to block groups with more observed data. As a robustness check, we also estimate unweighted OLS versions of the same log-ratio regressions; large differences between WLS and OLS estimates would indicate results are disproportionately driven by high-observation block groups.
+
+Coefficients are interpreted on the log-ratio scale: $\beta_{pq}$ is the expected change in $\log(\tilde{\pi}_{jq}/\tilde{\pi}_{j1})$ for a one-unit increase in predictor $X_{pj}$, holding other predictors constant. Equivalently, $\exp(\beta_{pq})$ is the multiplicative effect on the proportion ratio $\tilde{\pi}_{jq}/\tilde{\pi}_{j1}$ for a one-unit increase in $X_{pj}$.