This file orients AI agents (e.g. Cursor) so they can work effectively in this repo without reading the entire codebase.
rate-design-platform is Switchbox's simulation platform for electric rate design, starting with heat pump friendly rates that eliminate cross-subsidies. It's main job is to do CAIRO runs and creates simulation outputs on s3 that are then analyzed by Switchbox's reports2 repo, a repo that contains all our reports in quarto notebook format. The platform centers on running the Bill Alignment Test (BAT) on ResStock building loads and Cambium marginal costs; CAIRO also performs bill calculations.
The main inputs are:
- NREL's ResStock metadata and loads in parquet format stored in s3 at
s3://data.sb/nrel/resstock/, downloaded with Switchbox's buildstock-fetch library.- metadata lives at
s3://data.sb/nrel/resstock/res_2024_amy2018_2/metadata/state=<2 char state abbreviation]/upgrade=<0 padded integer>/*.parquet - hourly loads live at
s3://data.sb/nrel/resstock/res_2024_amy2018_2/load_curve_hourly/state=<2 char state abbreviation]/upgrade=<0 padded integer>/<bldg_id>_<upgrade_id>.parquetand there are typically thousands of loads files
- metadata lives at
- NREL's Cambium dataset for marginal energy, generation capacity, and bulk transmission capacity costs. Parquet on S3 at
s3://data.sb/nrel/cambium/with Hive-style partitions:{release_year}/scenario={name}/t={year}/gea={region}/r={ba}/data.parquet(e.g.2024/scenario=MidCase/t=2025/gea=ISONE/r=p133/data.parquetfor balancing area p133). - Marginal sub-distribution and distribution costs, and hourly allocation logic, drawn from utility-specific PUC filings like MCOS studies.
- NREL's CAIRO rate-simulation engine, which implements the Simenone et. al's bill alignment test paper (https://www.sciencedirect.com/science/article/abs/pii/S0957178723000516?via%3Dihub)
- Electric and gas tariffs in URDB JSON format (short guide to this format, official docs), downloaded with Switchbox's tariff-fetch library
- HUD Section 8 Income Limits (area-level AMI and income limits by household size), used for LMI/AMI in rate design. Parquet on S3 at
s3://data.sb/hud/ami/with Hive-style partitionfy={year}/data.parquet(e.g. fy=2016 … fy=2025). Schema harmonized across release years. Fetched and converted viadata/hud/ami/(Justfile:just prepare,just upload). - HUD State Median Income (SMI): state-level only, one row per state per year. Parquet on S3 at
s3://data.sb/hud/smi/, partitionfy={year}/data.parquet(fy=2017 … fy=2025), 50 states. Schema is a subset of AMI (same column names and types for overlapping cols: fy, state_fips, state_abbr, state_name, median_income, l50_1…l50_8, eli_1…eli_8, l80_1…l80_8). Pipeline indata/hud/smi/(Justfile:just fetch,just convert,just upload). Source: HUD APIil/statedata/{statecode}; requiresHUD_API_KEY. - ISO loads: EIA zone loads (data/eia/hourly_loads/), EIA-861 utility stats (data/eia/861/)
- Census ACS PUMS (person and housing microdata) in parquet on S3 at
s3://data.sb/census/pums/. There are two surveys (acs1 1-year and acs5 5-year), each identified by end_year (e.g. 2023). Under each survey/year, data is split into person- and housing-level tables; within each, data is Hive-partitioned by state (51 partitions: 50 states + DC). Path pattern:s3://data.sb/census/pums/{survey}/{end_year}/{person|housing}/state={XX}/data.parquet(e.g.s3://data.sb/census/pums/acs1/2023/housing/state=NY/data.parquet). Pipeline:data/census/pums/Justfile (fetch zips → unzip → convert CSV to parquet → upload).
The main outputs are calibrated tariffs (when CAIRO is run in pre-calc mode), customer-level bills / marginal cost / residual cost allocation / bill alignments, aggregated bill alignment tariffs grouped by post-processing group, and so on. This data lives on s3 at s3://data.sb/switchbox/cairo/outputs/hp_rates/<state>/<utility>/<batch>/<cairo_ts>_<run_name>/, where <batch> is a human-readable batch name (e.g. ny_20260305c_r1-8; see context/code/orchestration/run_orchestration.md for the naming convention) and <cairo_ts> is CAIRO's per-run timestamp. Each run directory contains the following files:
| Path | Purpose |
|---|---|
| bill_assistance_metrics.csv | Metrics for bill assistance programs (e.g., LMI customer impacts) |
| bills/ | Customer-level bill calculations |
| bills/comb_bills_year_run.csv | Annual combined (electric + gas) bills under the proposed rate structure |
| bills/comb_bills_year_target.csv | Annual combined (electric + gas) bills under the baseline/target rate structure |
| bills/elec_bills_year_run.csv | Annual electric-only bills under the proposed rate structure |
| bills/elec_bills_year_target.csv | Annual electric-only bills under the baseline/target rate structure |
| bills/gas_bills_year_run.csv | Annual gas-only bills under the proposed rate structure |
| bills/gas_bills_year_target.csv | Annual gas-only bills under the baseline/target rate structure |
| cross_subsidization/ | Bill Alignment Test (BAT) results |
| cross_subsidization/cross_subsidization_BAT_values.csv | Customer-level bill alignment metrics showing marginal cost recovery and cross-subsidies |
| customer_metadata.csv | ResStock building metadata (heating type, location, demographics, etc.) for each customer |
| tariff_final_config.json | Final calibrated tariff structure (CAIRO internal shape; one key per tariff). Copy utility writes one <key>_calibrated.json per key to config/tariffs/electric. |
After individual CAIRO runs complete, post-processing scripts consolidate results across all utilities in a batch into master tables at s3://data.sb/switchbox/cairo/outputs/hp_rates/<state>/all_utilities/<batch>/run_<delivery>+<supply>/. These are Hive-partitioned Parquet datasets (partitioned by sb.electric_utility) and are the primary data source for analysis notebooks and reports.
Master bills (comb_bills_year_target/) — one row per building per month (Jan–Dec + Annual). Created by utils/post/build_master_bills.py, invoked via just build-master-bills <batch> <run_delivery> <run_supply>. Combines the delivery-only run's comb_bills_year_target.csv (for electric delivery and gas/propane/oil bills) with the delivery+supply run's electric supply bills. Joins ResStock metadata (metadata_sb, utility_assignment) for building attributes.
| Column | Description |
|---|---|
| bldg_id | ResStock building identifier |
| sb.electric_utility, sb.gas_utility | Utility assignments |
| upgrade | ResStock upgrade ID (0 = baseline, 2 = HP) |
| postprocess_group.has_hp, postprocess_group.heating_type | HP status and heating classification |
| heats_with_electricity, heats_with_natgas, heats_with_oil, heats_with_propane | Fuel flags |
| month | "Jan"–"Dec" or "Annual" |
| weight | CAIRO sample weight |
| elec_fixed_charge | Electric fixed charge component |
| elec_delivery_bill | Electric delivery volumetric bill |
| elec_supply_bill | Electric supply bill (from supply run) |
| elec_total_bill | Total electric bill (fixed + delivery + supply) |
| gas_total_bill | Total gas bill |
| propane_total_bill, oil_total_bill | Delivered fuel bills |
| energy_total_bill | Sum of all fuel bills |
Master BAT (cross_subsidization_BAT_values/) — one row per building (annual). Created by utils/post/build_master_bat.py, invoked via just build-master-bat <batch> <run_delivery> <run_supply>. Computes delivery, supply, and total bill alignment by taking the delivery-only run's BAT values as delivery, the delivery+supply run's values as total, and deriving supply = total − delivery.
| Column | Description |
|---|---|
| bldg_id | ResStock building identifier |
| sb.electric_utility, sb.gas_utility | Utility assignments |
| upgrade, postprocess_group.has_hp, postprocess_group.heating_type | Building classification |
| heats_with_electricity, heats_with_natgas, heats_with_oil, heats_with_propane | Fuel flags |
| weight | CAIRO sample weight |
| BAT_vol_delivery, BAT_vol_supply, BAT_vol_total | Volumetric bill alignment (delivery / supply / total) |
| BAT_peak_delivery, BAT_peak_supply, BAT_peak_total | Peak bill alignment |
| BAT_percustomer_delivery, BAT_percustomer_supply, BAT_percustomer_total | Per-customer bill alignment |
| Path | Purpose |
|---|---|
data/ |
Data engineering scripts for ingesting and preparing datasets on S3. Each subdirectory (e.g. data/cambium/) holds scripts to fetch, convert, and optionally upload a dataset; run via that directory’s Justfile (e.g. just prepare in data/cambium/). When adding or editing data pipelines or scripts, follow the conventions in data/README.md (recipe names, path variables, fetch→upload split, clean recipe, script naming). |
rate_design/ |
Package root. Heat pump rate design lives under rate_design/hp_rates/. |
rate_design/hp_rates/ |
Shared scenario entrypoint (run_scenario.py), shared Justfile (primary task interface for all states). State-specific thin Justfiles and config/ dirs live under rate_design/hp_rates/{ny,ri}/. |
rate_design/hp_rates/{ny,ri}/ |
State-specific thin Justfile (imports shared), state.env, and config/ (tariffs JSON in tariffs/electric and tariffs/gas, tariff_maps CSV in tariff_maps/electric and tariff_maps/gas, marginal_costs). Large artifacts (buildstock raw/processed, cairo_cases) are git-ignored; sync via S3 or keep local. |
data/eia/hourly_loads/ |
EIA zone load fetch and utility load aggregation; eia_region_config (state/utility config, get_aws_storage_options); Justfile for fetch-zone-data and aggregate-utility-loads. |
data/eia/861/ |
EIA-861 utility stats (PUDL yearly sales); fetch_electric_utility_stat_parquets.py; Justfile build-utility-stats (local parquet), update (upload to s3://data.sb/eia/861/electric_utility_stats/), fetch-utility-stats STATE (CSV to stdout). |
data/fred/cpi/ |
FRED CPI series; Justfile fetch-cpi (local parquet/), upload (sync to s3://data.sb/fred/cpi/). |
data/aspe/fpl/ |
ASPE Federal Poverty Guidelines fetch; Justfile fetch. Output: utils/post/data/fpl_guidelines.yaml (used by LMI discount logic). |
data/resstock/ |
ResStock metadata: identify HP customers, heating type, assign_utility_ny (NY). Justfile for fetch, test-download, resstock-identify-hp-customers, assign-utility-ny. Data is put on S3 separately; rate_design Justfiles do not invoke data pipelines. |
utils/ |
Cross-jurisdiction utilities split by run phase: utils/pre/ (tariff creation, scenario YAMLs, marginal-cost allocation, config validation), utils/mid/ (mid-run scripts consuming earlier CAIRO outputs: calibrated tariff promotion, subclass revenue requirements, seasonal discount derivation, output resolution), utils/post/ (post-run: LMI discount application). CAIRO helpers in utils/cairo.py. All runnable as CLI or imported by rate_design. |
context/ |
Reference docs and research notes for agents; see Reference context below and context/README.md for what lives where. |
tests/ |
Pytest tests; mirror utils/ and key rate_design behavior. |
.devcontainer/ |
Dev container and install scripts. CI uses runner-native workflow (just install then just check / just test); optional devcontainer for local/DevPod. |
infra/ |
Terraform and scripts for EC2/dev environment (e.g. dev-setup, dev-teardown). |
We run BAT on ResStock and Cambium; key reference material lives in context/ so agents can use it without loading full PDFs or hunting through the repo. Treat these paths as first-class context (like the S3 input/output paths above).
Conventions:
context/sources/papers/— Academic papers (e.g. Bill Alignment Test). Extracted from PDFs via the pdf-to-markdown command.context/docs/— Technical documentation (e.g. Cambium, ResStock dataset docs). Extracted from PDFs via the pdf-to-markdown command.context/domain/— General domain knowledge: policy explainers, program guides, regulatory and institutional background. Documents that answer "how does this work in the real world?" (rate design, LMI programs, bulk transmission, ECOS vs MCOS). Subdirs:bat_mc_residual/(fairness, ECOS/MCOS, residual allocation),charges/(LMI programs, gas heating rates),marginal_costs/(bulk transmission cost recovery).context/methods/— Methodology writeups: conceptual framing, formulas, literature, design choices that feed our methodology. Documents that answer "how do we justify and operationalize this?" (BAT and residual allocation, TOU design, deriving BAT inputs from MCOS). Subdirs:bat_mc_residual/,tou_and_rates/,marginal_costs/.context/code/— Implementation notes: how libraries and pipelines work, how to run and wire code. Documents that answer "how do I implement or run this?" (CAIRO behavior, orchestration, data sources, marginal-cost pipelines). Subdirs:orchestration/,cairo/,data/,marginal_costs/.
When working on marginal costs, ResStock metadata/loads, BAT/cross-subsidization, LMI logic, state-specific programs, or Census PUMS data or documentation, read the relevant file(s) in context/. In particular, read context/docs/ and context/sources/papers/ when working on Cambium, ResStock dataset semantics, or the Bill Alignment Test—these are core inputs to the platform. PUMS docs in context/docs/ are release-specific (1-year vs 5-year, by year); pick the file that matches your release. By using those docs, you may know more about the datasets than the team does; if you see code or assumptions that conflict with the ResStock or Cambium documentation, proactively say so so we can correct them.
For the current list of files and when to use each, see context/README.md.
To add or refresh extracted PDF content: use the extract-pdf-to-markdown slash command (.cursor/commands/extract-pdf-to-markdown.md) and place output under context/docs/ or context/sources/papers/ as appropriate.
- Tasks: Use Just as the main interface. Root
Justfiledefinesinstall,check,test,check-deps, and dev/DevPod targets. Shared rate design recipes live inrate_design/hp_rates/Justfile; state-specific thin wrappers inrate_design/hp_rates/{ny,ri}/Justfile(import the shared file, add state-only recipes). Data-specific tasks live in data subdirectories (e.g.data/eia/hourly_loads/Justfile,data/resstock/Justfile,data/fred/cpi/Justfile). Ad hoc scripts should typically by invoked viajustrecipes. Just syntax is tricky, especially for inline shell code. See the syntax here, and prefer external shell scripts to inline shell recipes if they go from command invocation to full-on scripts. Invocation patterns (fromrate_design/hp_rates/):just s <state> <recipe>(dispatch recipe that sources<state>/state.env),just -f <state>/Justfile <recipe>(state wrapper directly), orsource <state>/state.env && just <recipe>(manual env loading). - Python: The project uses uv for dependency and env management (see
pyproject.toml). The resulting virtualenv is created in root of the project (at.venv/) but it is .gitignored. CAIRO is a private Git dependency; CI and devcontainer rely onGH_PATfor cloning. Always useuv run python(never barepython3orpython) — the system Python does not have project dependencies like pyyaml, polars, etc. Examples:uv run python -m pytest tests/,uv run python utils/...,uv run python3 -c "import yaml; ...". This applies to shell scripts, Justfile recipes, and inline python snippets alike. Use Python 3.12+. - Data: Versioned inputs are under
rate_design/.../config/tariffs/electric/and.../tariffs/gas/(JSON) and.../config/tariff_maps/electric/and.../config/tariff_maps/gas/(CSV). Don’t commit large buildstock or CAIRO case outputs; use.gitignoreand S3/local paths as in existing Justfiles. - AWS authentication: we rely heavily on reading and writing data to s3. We use short-lived AWS SOO config; if it must be refreshed, use
just awsin the root.
- Data scientists' laptops, usually Macs with Apple Silicon
- EC2 instances luaunched by terraform scripts in
infra/ - devcontainers running on a laptop or on an instance using Devpod
- When relevant, be aware of what context you are in.
Match existing style: Ruff for formatting/lint, ty for type checking, dprint for md formatting using .mardownlint.json and shfmt for shell scripts. Keep new code consistent with current patterns in utils/ and rate_design/.
LaTeX in markdown: GitHub's MathJax renderer does not support escaped underscores inside \text{} (e.g. \text{avg\_mc\_peak} will fail). Use proper math symbols instead: \overline{MC}_{\text{peak}}, MC_h, L_h, etc. Bare subscripts and \text{} with simple words (no underscores) are fine.
- Never write commit messages via a temp file (e.g.
/tmp/commit_msg.txt). Pass the message directly with-m "..."or let the user commit manually. - Never add co-author trailers (
Co-authored-by: ...) or any other generated-by attribution to commit messages or PR bodies. - For
gh pr createbody: use--body-file -with a shell heredoc (stdin) to avoid attribution injection — do NOT use--body "..."with multi-line strings or--body-file /tmp/.... Example:gh pr create --body-file - <<'PRBODY'\n...\nPRBODY
- Run
just check— no linter errors, no type errors, no warningsjust checkruns lock validation (uv lock --locked) and prek (ruff, formatting, type checking)- Pre-commit hooks enforce: ruff-check, ruff-format, ty-check, trailing whitespace, end-of-file newline, YAML/JSON/TOML validation, no large files (>600KB), no merge conflict markers
- Run
just test— all tests pass; Aad or extend tests intests/for new or changed behavior.
- Path format:
s3://data.sb/<org>/<dataset>/<filename_YYYYMMDD.parquet> - Prefer Parquet format
- Filenames: lowercase with underscores, end with
_YYYYMMDD(download date) - Use lazy evaluation (polars
scan_parquet/ arrowopen_dataset) and filter before collecting
- Path variables: Any Just variable that holds a file or directory path should be named with a
path_prefix (e.g.path_project_root,path_output_dir,path_rateacuity_yaml). This makes it clear which variables are paths and keeps naming consistent across Justfiles. - Recipe args and script args: Parameter names in a Just recipe should match the script’s CLI argument names they are wired to (e.g. recipe
path_yamlandpath_output_dir→ script--yamland--output-dir). Use the same naming convention (path_ prefix for paths) in both so the wiring is obvious.
- Path arguments: CLI arguments that are file or directory paths should use a
path_prefix in the argparse name (e.g.path_yaml,path_output_dir), or a long option that makes the path explicit (e.g.--output-dir). When the script is invoked from a Justfile, use the same names as the Just variables (path_…) so recipe and script stay in sync.
- Prefer LazyFrame: Use
scan_parquetand lazy operations; only materialize (e.g..collect()orread_parquet) when the operation cannot be done lazily (e.g. control flow that depends on data, or a library that requires a DataFrame). For more on laziness, when to collect, and how to handle runtime data-quality asserts without scattering collects, seecontext/code/data/polars_laziness_and_validation.md. - LazyFrame vs DataFrame: Only
LazyFramehas.collect(). ADataFramefromread_parquet,.collect(), ordf.join()does not—calling.collect()on it will raise. Use.group_by().agg()on a DataFrame directly; no.collect(). - Joins: With default
coalesce=True, Polars keeps only the left join key column and drops the right. If you need both key columns in the result, usecoalesce=Falsein the join; otherwise select/alias from the left key as needed. - Prefer a single path for scan_parquet: Pass the hive-partition root (or directory) to
pl.scan_parquet(path, ...)so Polars reads the dataset as one logical table; do not pre-list files with s3fs just to pass a list of paths unless you have confirmed the row identity or grouping is not in the data (e.g. only in the path).
- Always inspect the data before coding. When writing code that reads from S3 (or any data source), open the actual dataset—e.g. read one parquet and print schema and a few rows (
df.schema,df.head())—instead of assuming column names, presence of IDs, or file layout. Do not infer schema or row identity from file paths or other code alone. - Check context/docs first. Before assuming a dataset's structure, look in
context/docs/for data dictionaries, dataset docs, or release notes (e.g. ResStock, Cambium, EIA, PUMS). Use that as the source of truth; if docs and data disagree, note it. - Parquet reads: local vs S3. S3 has ~50–100 ms overhead per GET regardless of payload size. ResStock load curves are one-file-per-building (~33k files for NY). Whole state from S3:
scan_parqueton the directory = ~28 min of overhead; prefer downloading locally first (e.g.aws s3 sync) or consolidating into fewer files. Single utility from S3: Do NOT usescan_parquet(dir).filter(bldg_id.is_in(...))— it probes every file; instead, loadmetadata_utilityfor bldg_ids, construct paths{base}/{bldg_id}-{upgrade}.parquet, and pass the list toscan_parquet. On local disk,scan_parquet+ filter is fine (overhead ~1 s). Full guide:context/code/data/parquet_reads_local_vs_s3.md.
uv add <package>(updates pyproject.toml + uv.lock); never use `pip install- Commit lock files (uv.lock) when adding dependencies
When writing or modifying code that uses a library, use the Context7 MCP server to fetch up-to-date documentation for that library. Do not rely on training data for API signatures, function arguments, or usage patterns — always resolve against Context7 first.
When a task involves creating, updating, or referencing issues, use the Linear MCP server to interact with our Linear workspace directly. See the ticket conventions below.
All work is tracked with Linear issues (which sync to GitHub Issues automatically). When asked to create or update a ticket, use the Linear MCP tools. Every new issue MUST satisfy all of the following before it is created:
- Title follows the format
Brief descriptionthat starts with a verb (e.g.,Add winter peak analysis). -
## Whatis filled in: a concise, high-level description of what is being built, changed, or decided. Anyone should be able to understand the scope at a glance. -
## Whyis filled in: context, importance, and value — why this matters, what problem it solves, and what it unblocks. -
## Howis filled in (skip only when the What is self-explanatory and implementation is trivial) via numbered implementation steps, trade-offs, dependencies. -
## Deliverableslists concrete, verifiable outputs that define "done", basically acceptance criteria: c - Code: "PR that adds …", "Tests for …", "Data ins3://..."- Never vague ("Finish the analysis") or unmeasurable ("Make it better").
- Project is set, ask the user if unsure.
- Status is set. Default to Backlog. Options: Backlog, To Do, In Progress, Under Review, Done.
- Milestone is set when one applies (strongly encouraged — milestones are how we track progress toward major goals), ask the user if unclear.
- Assignee is set if the person doing the work is known.
Keep status updated as work progresses — this is critical for team visibility:
- Backlog → To Do: Picked for the current sprint
- To Do → In Progress: Work has started (branch created for code issues)
- In Progress → Under Review: PR ready for review, or findings documented
- Under Review → Done: PR merged (auto-closes), or reviewer approves and closes
This is a scientific computing python codebase. We make heavy use of polars, prefer it to pandas unless there's no other choice. (CAIRO is implemented in pandas though.)
- Do not add intermediates to context: Agent plans, GitHub (or Linear) issue bodies, design drafts, and other working artifacts should not be added under
context/. Do not commit issue-body or issue-template markdown files to the repo (not incontext/, not in.github/).context/is for reference material only (seecontext/README.md). - Prefer existing entrypoints: Add or use
justrecipes andutilsCLIs rather than one-off scripts at the repo root. - Respect data boundaries: Don't assume large data is in git; follow S3/local paths and env (e.g. AWS,
GH_PAT) documented in Justfiles and CI. - Data pipeline conventions: When creating or changing scripts or Justfiles under
data/, read and followdata/README.md(recipe names, path variable naming, fetch→upload split, clean recipe, script naming). This keeps all data pipelines consistent. - Update the context index: When adding or removing files under
context/, updatecontext/README.mdso the index stays accurate. - Type and style: Use type hints and Ruff; run
just checkbefore considering a change done.
- Always close the GitHub issue: Include
Closes #<number>in the PR body so the GitHub issue is auto-closed when the PR is merged. Use the GitHub issue number, not the Linear issue identifier (e.g. useCloses #263, notCloses RDP-126). When work was tracked in Linear, look up the corresponding GitHub issue (e.g.gh issue listor the synced issue in the repo) and put that number inCloses #<number>.
Do not duplicate the issue in the PR body. Instead, write a concise description that gives the reviewer enough context to review without having to ask you questions:
- High-level overview of what the PR contains (a few sentences).
- Reviewer focus: Anything you want explicit feedback on (trade-offs, alternatives, design choices).
- Non-obvious implementation details and the "why" behind them (so the reviewer understands intent, not just the diff).
Keep it short. Do not add "Made with Cursor", "Generated by …", or any other LLM attribution.
- Install deps:
just install - Lint / format / typecheck:
just check - Tests:
just test - Dependency hygiene:
just check-deps - Project root (scripts):
utils.get_project_root()orgit rev-parse --show-toplevel