Reduce peak memory in demand flex to fix OOM on ConEd/NiMo runs 15-16 by alxsmith · Pull Request #379 · switchbox-data/rate-design-platform

alxsmith · 2026-03-25T01:37:15Z

Closes #378

This PR fixes OOM kills on ConEd and NiMo runs 15-16 (and intermittently 13-14) by reducing peak RSS during the demand-flex pipeline and adding correctness/config guardrails.

What's in this PR

Memory optimizations (committed earlier in 1348c4d, included here):

Vectorize process_residential_hourly_demand_response_shift: replace per-building groupby loop + pd.concat with groupby.transform + dict lookup, returning numpy arrays instead of DataFrames. This was the primary memory bottleneck for large utilities (~15k buildings).
Eliminate tou_df (full TOU cohort copy) and shifted_chunks in apply_runtime_tou_demand_response: extract each season slice just-in-time and write back in-place.
Add inplace=True mode so apply_demand_flex can skip a redundant DataFrame copy.
Precompute per-TOU-key original system loads as tiny 8760-row Series before copying raw_load_elec, then del raw_load_elec to free the original before shifting begins.
del raw_load_elec in run_scenario.py after the flex branch so the caller's reference is also freed before bs.simulate().

Validated numerically: CenHud runs 13-16 produce zero diff vs pre-optimization gold baseline across all 8 output artifacts.

Phase 2.5 bypass (1e3b26b): Skip the per-TOU-subclass MC delta computation when run_includes_subclasses=False. Phase 2.5 scans the full effective load DataFrame for each TOU key — unnecessary for single-tariff runs (NiMo, CenHud, etc.) that don't split revenue requirements by subclass.

Config validation (8daef7d): validate_config.py now warns if run_includes_subclasses disagrees with the number of keys in path_tariffs_electric, catching YAML inconsistencies before a run starts.

ConEd TOU schedule fix (5779d9a): Correct the HP seasonal TOU peak window in coned_hp_seasonalTOU_flex.json — peak period started at hour 16 (4pm) but should start at hour 15 (3pm).

IDE memory (eb1ad7a): Set python.analysis.diagnosticMode: openFilesOnly in .vscode/settings.json to prevent Pyright from consuming 4+ GB on shared EC2 instances.

Validation tool (ad02c9c): utils/post/compare_cairo_runs.py — CLI to compare two CAIRO run directories on S3 numerically. Used throughout this branch to confirm outputs are unchanged after each optimization step.

Reviewer focus

The vectorized shift in process_residential_hourly_demand_response_shift (dict lookup + groupby.transform) is the highest-impact change — worth a close read to confirm the zero-sum writeback is correct.
The Phase 2.5 bypass is guarded by the same run_includes_subclasses flag that run_scenario.py already uses to decide whether to split revenue requirements — so the logic is consistent.

Made with Cursor

Five coordinated changes that together cut peak RSS during demand flex by ~18 GB for large utilities (ConEd ~15k buildings): 1. Eliminate per-building loop + pd.concat in process_residential_hourly_demand_response_shift: replace with groupby.transform for Q_orig and a dict lookup for load_shift, avoiding a full merge that doubled memory for tens-of-millions-of-row slices. Return (shifted_net, hourly_shift, tracker) numpy arrays instead of DataFrames. 2. Eliminate tou_df (full TOU cohort copy) in apply_runtime_tou_demand_response: each season slice is now extracted just-in-time from the output DataFrame, and shifts are written back in-place rather than collected for a final concat. 3. Add inplace=True mode to apply_runtime_tou_demand_response: callers that already hold a copy can skip the internal copy entirely. 4. In apply_demand_flex, make one copy upfront, precompute the per-TOU-key original weighted system loads (tiny 8760-row Series) before copying, then del raw_load_elec so the original is released before the shift begins. Phase 2.5 uses the precomputed Series instead of the full original DataFrame. 5. In run_scenario.py, del raw_load_elec after the flex branch so the caller's reference is also freed before bs.simulate() runs. All changes validated numerically: CenHud runs 13-16 produce bit-identical outputs (zero max abs/rel diff) vs the pre-optimization gold baseline across all 8 artifacts (BAT, bills, elasticity tracker, metadata, tariff config). Made-with: Cursor

Phase 2.5 computes per-TOU-subclass MC deltas for revenue requirement splitting between HP and non-HP customer classes. This work is only needed when a run has multiple tariff subclasses (e.g. ConEd runs 13-16 with hp/nonhp). Single-tariff runs (e.g. NiMo, CenHud) can skip it entirely, saving memory and compute on the full effective_load_elec scan. Pass run_includes_subclasses from ScenarioSettings through to apply_demand_flex so it can guard the Phase 2.5 block. Made-with: Cursor

Add a pre-run cross-check to validate_config.py: for each run in the scenario YAML, compare the explicit run_includes_subclasses flag against whether path_tariffs_electric has more than one key (the canonical source of truth). Print a warning to stderr if they disagree so config mistakes are caught before CAIRO starts. Made-with: Cursor

The TOU schedule had peak period (period 1/3) starting at hour index 16 (4pm); correct start is hour 15 (3pm). Shift the on-peak block back one hour in both weekday and weekend schedules for all seasons in the flex and flex_calibrated tariffs. Made-with: Cursor

Set python.analysis.diagnosticMode to openFilesOnly so Pyright does not index the entire workspace. On a shared EC2 instance this prevents the language server from consuming 4+ GB of RAM that competes with CAIRO runs. Made-with: Cursor

New utils/post/compare_cairo_runs.py compares two S3 CAIRO run directories file-by-file (BAT values, bills, elasticity tracker, metadata, tariff config) using configurable rtol/atol tolerances. Exits non-zero if any diff exceeds tolerance so it can be used in CI or as a manual regression check. Used during this branch to validate that memory optimizations produced bit-identical outputs against the CenHud gold baseline (zero max diff across all 8 artifacts for runs 13-16). Made-with: Cursor

- cairo.py: cast get_level_values result to DatetimeIndex before accessing .month; ty stubs don't expose .month on the generic Index return type, suppress with type: ignore[attr-defined] - compare_cairo_runs.py: guard df_chal.height behind an explicit is-not-None check; suppress overly-wide Polars .max() return type on float() conversion with type: ignore[arg-type] Made-with: Cursor

alxsmith added 6 commits March 24, 2026 20:16

alxsmith linked an issue Mar 25, 2026 that may be closed by this pull request

Reduce peak memory in demand flex to fix OOM on ConEd/Nimo runs 15-16 #378

Closed

4 tasks

alxsmith added 2 commits March 25, 2026 01:40

linter

fbbbf6b

alxsmith merged commit ab436c5 into main Mar 25, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce peak memory in demand flex to fix OOM on ConEd/NiMo runs 15-16#379

Reduce peak memory in demand flex to fix OOM on ConEd/NiMo runs 15-16#379
alxsmith merged 8 commits intomainfrom
378-reduce-peak-memory-in-demand-flex-to-fix-oom-on-conednimo-runs-15-16

alxsmith commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alxsmith commented Mar 25, 2026

What's in this PR

Reviewer focus

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant