Skip to content

Conversation

@alexhyunminlee
Copy link
Collaborator

Summary

This PR adds python script and function that generates electrical tariff mapping for bldg_id's based on SB_analysis_scenario, electrical_utility, customer_class, etc.

Closes #134

@alexhyunminlee alexhyunminlee linked an issue Feb 6, 2026 that may be closed by this pull request
match SB_scenario_name:
# Default scenario
case "default_1" | "default_2" | "default_3":
tariff_key = "default"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tariff_key = f"{electric_utility}{SB_scenario_name}{HP or flat or default, etc.}"

@@ -0,0 +1,13 @@
# Resolve project root as absolute path from this Justfile's location (works for any clone)
project_root := absolute_path(justfile_directory() / ".." / ".." / ".." / ".." / "..")
data_base := "s3://data.sb/nrel/resstock/res_2024_amy2018_2/metadata"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a database, call it path_s3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

# Example: just map-electric-tariff Coned default 1 00
map-electric-tariff electric_utility SB_scenario_name SB_scenario_year upgrade_id:
cd {{project_root}} && uv run python {{project_root}}/utils/electric_tariff_mapper.py \
--metadata_path "{{data_base}}/state=NY/upgrade={{upgrade_id}}/metadata-sb.parquet" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass the path_s3 only and use polars filters to select the state (which is already a param) and upgrade (which shouldn't vary?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit to do the following:

  1. Only the path_s3 is passed
  2. The python script builds the full metadata_path using the state and upgrade_id provided
  3. Reads in LazyFrame via scan_parquet
  4. Passes the LazyFrame directly to map_electric_tariff function

# For now, we will manually add the electric utility name column. Later on, the metadata parquet will be updated to include this column.
# Assign first ~1/3 to Coned, next ~1/3 to National Grid, last ~1/3 to NYSEG.
metadata_path = S3Path(args.metadata_path)
metadata_df = pl.read_parquet(io.BytesIO(metadata_path.read_bytes()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've explained several times now, we need to use scan_parquet to get a pl.LazyFrame, not read_parquet to get a pl.DataFrame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you don't need to read out the bytes like this. Just pass the string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

temp_path = metadata_path.parent / f"{metadata_path.stem}_temp.parquet"
buf = io.BytesIO()
metadata_df.write_parquet(buf)
temp_path.write_bytes(buf.getvalue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's all this BytesIO and write_bytes stuff about? You shouldn't need it, you can just use write_parquet

Copy link
Contributor

@jpvelez jpvelez Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More to the point, you don't need to save this to disk at all! You can refactor map_electric_tariff to accept a LazyFrame rather than a path, which is probably a better design anyway, and then you can just pass metadata_df to the function directly.

It's a better design because:

  • if map_electric_tariff is being used from the CLI, then electric_mapper_py can have the job of accepting the path and calling scan_parquet and passing the LazyFrame to the function.
  • if map_electric_tariff is being imported into another script and called directly from python code to serve a more complicated use case, then it offers more flexibility to accept a LazyFrame, rather than a path to a file on disk—who knows how the code calling this function is manufacturing a LazyFrame.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit


if (
args.SB_scenario_name != "default"
and args.SB_scenario_name != "seasonal"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use args.SB_scenario_name in <list>

You should really have an LLM check your code and make suggestions for improvement before you commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

):
raise ValueError(f"Invalid SB scenario name: {args.SB_scenario_name}")

sb_scenario = SB_scenario(args.SB_scenario_name, args.SB_scenario_year)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're going to make a class for SB_scenario, you could move the validation into here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

@@ -0,0 +1,13 @@
# Resolve project root as absolute path from this Justfile's location (works for any clone)
project_root := absolute_path(justfile_directory() / ".." / ".." / ".." / ".." / "..")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve it with project_root := git rev-parse --show-toplevel`

  • More elegant
  • Less brittle if you move the file around
  • Standard way to do this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

# Usage: just map-electric-tariff [electric_utility] [SB_scenario_type] [SB_scenario_year] [upgrade_id]
# Example: just map-electric-tariff Coned default 1 00
map-electric-tariff electric_utility SB_scenario_name SB_scenario_year upgrade_id:
cd {{project_root}} && uv run python {{project_root}}/utils/electric_tariff_mapper.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to cd if project_root is an absolute path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit


# Usage: just map-electric-tariff [electric_utility] [SB_scenario_type] [SB_scenario_year] [upgrade_id]
# Example: just map-electric-tariff Coned default 1 00
map-electric-tariff electric_utility SB_scenario_name SB_scenario_year upgrade_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameterized task is useful, because it abstracts away from the verbose CLI invocation.

However, the job of the Justfile is to enumerate all of the different utility-scenario-year combos that we need for the analysis, so that we have version-controlled records of the tariff_maps we use, and an easy way to reproduce them. We don't want to invocations to get lost in the terminal of the user that calls this task.

Here's a way to build on your parameterized task to accomplish this:

 Convenience recipes - add more as needed
# These call the main recipe with specific arguments, keeping the abstraction
map-coned-default:
    just map-electric-tariff Coned default 1 00

map-coned-seasonal:
    just map-electric-tariff Coned seasonal 1 00

# Add more convenience recipes here following the pattern:
# map-<utility>-<scenario>:
#     just map-electric-tariff <utility> <scenario> <year> <upgrade_id>

from pathlib import Path

import polars as pl
from cloudpathlib import S3Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to add this to pyproject.toml and uv.lock? Come on now. That's the last time I should catch this type of oversight.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

metadata_df = pl.read_parquet(io.BytesIO(metadata_path.read_bytes()))
n = len(metadata_df)
metadata_df = (
metadata_df.with_row_index("_row_idx")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm doesn't work with LazyFrames, because len(metadata_df) requires materialization—think about it, you need to have the entire dataset in memory because you can count how long it is. If you're streaming a (potentially massive) DataFrame off disk, chunk by chunk, you need a streaming algorithm to accomplish this job:

# Instead of:
n = len(metadata_df)  # This materializes if LazyFrame
metadata_df = (
    metadata_df.with_row_index("_row_idx")
    .with_columns(...)
)

# Use this for LazyFrames:
metadata_df = metadata_df.with_columns(
    pl.when(pl.col("bldg_id").hash() % 3 == 0)
    .then(pl.lit("Coned"))
    .when(pl.col("bldg_id").hash() % 3 == 1)
    .then(pl.lit("National Grid"))
    .otherwise(pl.lit("NYSEG"))
    .alias("sb.electric_utility")
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

def main():
parser = argparse.ArgumentParser(description="Map electrical tariff.")
parser.add_argument(
"--metadata_path", required=True, help="Base path for resstock data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Absolute or s3 path to ResStock metadata"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit



def map_electric_tariff(
SB_metadata_path: S3Path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work with either an local path or an S3 path. You can accomplish this using the AnyPath class from cloudpathlib:

from cloudpathlib import AnyPath

# Create an AnyPath instance for a local file
local_file = AnyPath("/path/to/local/file.txt")

# Create an AnyPath instance for an S3 object
s3_file = AnyPath("s3://bucket/path/to/object.txt")

# You can then perform operations on either, and the appropriate
# class methods will be used automatically.
if local_file.exists():
    print(f"Local file exists: {local_file.read_text()}")
s3_file.write_text("Hello, S3!")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit to accept pl.LazyFrame directly as discussed above ^

Copy link
Contributor

@jpvelez jpvelez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two broad issue with this:

  1. The I/O could be improved
  2. This should be refactored to work with LazyFrames, not DataFrame. Not strictly necessary in this case since the data is small, but it's a standard we should enforce, and a good learning opportunity.

if not SB_metadata_path.exists():
raise FileNotFoundError(f"SB metadata path {SB_metadata_path} does not exist")

metadata_df = pl.read_parquet(io.BytesIO(SB_metadata_path.read_bytes()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this should always be pl.scan_parquet, and you don't need that BytesIO stuff. Though the point is moot since we'll be moving the scan out of this function anyway (see other comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit to:

  1. Use pl.scan_parquet to load pl.LazyFrame outside of the function
  2. Pass the pl.LazyFrame scanned directly to this function
  3. No BytesIO stuff


#########################################################
# Remove the temp file that contains electric utility name column.
temp_path.unlink()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit

utility_metadata_df = metadata_df.filter(
pl.col("sb.electric_utility") == electric_utility
)
if utility_metadata_df.is_empty():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script should fail loudly if the utility is not found; you want the user to know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyFrames don't have is_empty(). This will raise an AttributeError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it work with LazyFrames, you need to materialize first using .collect(). But that means you stop the query plan right there, even though it's useful for the polars engine to know that it only needs to read sb.electric_utility, bldg_id, and postprocess_group.has_hp off of disk. So you can move the select() to before the .collect() and .is_empty(), like this:

# Step 1: Filter by utility (still lazy)
utility_metadata_df = metadata_df.filter(
    pl.col("sb.electric_utility") == electric_utility
)

# Step 2: Select only needed columns (still lazy, building query plan)
metadata_has_hp = (
    utility_metadata_df
    .select(pl.col("bldg_id", "postprocess_group.has_hp"))
    .collect()  # Step 3: Materialize (executes filter + select)
)

# Step 4: Check if empty (if empty, the filter found no matching utility)
if metadata_has_hp.is_empty():
    raise ValueError(...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit to:

  1. Filter by electric_utility
  2. Select columns bldg_id and postprocess_group.has_hp
  3. Only collect the first row of the LazyFrame and materialize it
  4. Checks if this first row exists (which means there's at least one match in the metadata-sb.parquet for the requested electric_utility

print(electrical_tariff_mapping_df.head(20))

output_path = (
RATE_DESIGN_DIR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super brittle. That's why the CLI should accept an output path. For it's part, this function should return a DataFrame, not a CSV; the rest of the script can handle writing it to disk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit. The output_path input is optional for the Justfile command. If it's provided, it saves the csv to that path. If not, it saves to rate_design/<state>/hp_rates/data/tariff_map/<electric_utliyt>_<SB_scenario>.csv. The map_electric_tariff function returns the LazyFrame and the outputting is done in the rest of the script.

SB_scenario: SB_scenario,
electric_utility: electric_utility,
) -> pl.DataFrame:
has_hp = metadata_has_hp["postprocess_group.has_hp"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an anti-pattern... you're materializing has_hp early. The idiomatic polars way to do this is define_electrical_tariff_key(..., pl.col("postprocess_group.has_hp")) # Pass Expr

pl.col("column") returns a pl.Expr that represents an operation in the query plan and is not executed yet. You're building up a big formula that doesn't get actually executed until you call collect. During execution, polars will compute on the Series, but you write an Expr to keep the overall query lazy and optimizable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit. The metadata_has_hp is passed as a LazyFrame and is not materialized until later.

def define_electrical_tariff_key(
SB_scenario: SB_scenario,
electric_utility: electric_utility,
has_hp: pl.Series,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to pl.Exp, see below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit


# Project root (rate-design-platform); independent of cwd or caller
_PROJECT_ROOT = Path(__file__).resolve().parent.parent
RATE_DESIGN_DIR = _PROJECT_ROOT / "rate_design"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brittle; accept output_dir as a CLI arg instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

utils/types.py Outdated
from typing import Literal


class SB_scenario:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per PEP 8, class names should be CamelCase (PascalCase). SB_scenario should be SBScenario

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit

return


def main():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a function you'd ever want to import from a different context... consequently, there's not need for this code to be wrapped in a function. It can literally just go below the if name == main

@alexhyunminlee alexhyunminlee merged commit 1a6f7df into main Feb 9, 2026
2 of 4 checks passed
@alexhyunminlee alexhyunminlee deleted the 134-mvp-function-to-generate-electrical-tariff-map branch February 9, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MVP function to generate electrical tariff map

2 participants