-
Notifications
You must be signed in to change notification settings - Fork 0
Add MVP function for electrical tariff mapping #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MVP function for electrical tariff mapping #137
Conversation
utils/electric_tariff_mapper.py
Outdated
| match SB_scenario_name: | ||
| # Default scenario | ||
| case "default_1" | "default_2" | "default_3": | ||
| tariff_key = "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tariff_key = f"{electric_utility}{SB_scenario_name}{HP or flat or default, etc.}"
| @@ -0,0 +1,13 @@ | |||
| # Resolve project root as absolute path from this Justfile's location (works for any clone) | |||
| project_root := absolute_path(justfile_directory() / ".." / ".." / ".." / ".." / "..") | |||
| data_base := "s3://data.sb/nrel/resstock/res_2024_amy2018_2/metadata" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a database, call it path_s3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
| # Example: just map-electric-tariff Coned default 1 00 | ||
| map-electric-tariff electric_utility SB_scenario_name SB_scenario_year upgrade_id: | ||
| cd {{project_root}} && uv run python {{project_root}}/utils/electric_tariff_mapper.py \ | ||
| --metadata_path "{{data_base}}/state=NY/upgrade={{upgrade_id}}/metadata-sb.parquet" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass the path_s3 only and use polars filters to select the state (which is already a param) and upgrade (which shouldn't vary?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit to do the following:
- Only the path_s3 is passed
- The python script builds the full
metadata_pathusing the state and upgrade_id provided - Reads in
LazyFrameviascan_parquet - Passes the
LazyFramedirectly tomap_electric_tarifffunction
utils/electric_tariff_mapper.py
Outdated
| # For now, we will manually add the electric utility name column. Later on, the metadata parquet will be updated to include this column. | ||
| # Assign first ~1/3 to Coned, next ~1/3 to National Grid, last ~1/3 to NYSEG. | ||
| metadata_path = S3Path(args.metadata_path) | ||
| metadata_df = pl.read_parquet(io.BytesIO(metadata_path.read_bytes())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've explained several times now, we need to use scan_parquet to get a pl.LazyFrame, not read_parquet to get a pl.DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you don't need to read out the bytes like this. Just pass the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
utils/electric_tariff_mapper.py
Outdated
| temp_path = metadata_path.parent / f"{metadata_path.stem}_temp.parquet" | ||
| buf = io.BytesIO() | ||
| metadata_df.write_parquet(buf) | ||
| temp_path.write_bytes(buf.getvalue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's all this BytesIO and write_bytes stuff about? You shouldn't need it, you can just use write_parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More to the point, you don't need to save this to disk at all! You can refactor map_electric_tariff to accept a LazyFrame rather than a path, which is probably a better design anyway, and then you can just pass metadata_df to the function directly.
It's a better design because:
- if
map_electric_tariffis being used from the CLI, thenelectric_mapper_pycan have the job of accepting the path and callingscan_parquetand passing theLazyFrameto the function. - if
map_electric_tariffis being imported into another script and called directly from python code to serve a more complicated use case, then it offers more flexibility to accept aLazyFrame, rather than a path to a file on disk—who knows how the code calling this function is manufacturing aLazyFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
utils/electric_tariff_mapper.py
Outdated
|
|
||
| if ( | ||
| args.SB_scenario_name != "default" | ||
| and args.SB_scenario_name != "seasonal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use args.SB_scenario_name in <list>
You should really have an LLM check your code and make suggestions for improvement before you commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
utils/electric_tariff_mapper.py
Outdated
| ): | ||
| raise ValueError(f"Invalid SB scenario name: {args.SB_scenario_name}") | ||
|
|
||
| sb_scenario = SB_scenario(args.SB_scenario_name, args.SB_scenario_year) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to make a class for SB_scenario, you could move the validation into here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
| @@ -0,0 +1,13 @@ | |||
| # Resolve project root as absolute path from this Justfile's location (works for any clone) | |||
| project_root := absolute_path(justfile_directory() / ".." / ".." / ".." / ".." / "..") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolve it with project_root := git rev-parse --show-toplevel`
- More elegant
- Less brittle if you move the file around
- Standard way to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
| # Usage: just map-electric-tariff [electric_utility] [SB_scenario_type] [SB_scenario_year] [upgrade_id] | ||
| # Example: just map-electric-tariff Coned default 1 00 | ||
| map-electric-tariff electric_utility SB_scenario_name SB_scenario_year upgrade_id: | ||
| cd {{project_root}} && uv run python {{project_root}}/utils/electric_tariff_mapper.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to cd if project_root is an absolute path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
|
|
||
| # Usage: just map-electric-tariff [electric_utility] [SB_scenario_type] [SB_scenario_year] [upgrade_id] | ||
| # Example: just map-electric-tariff Coned default 1 00 | ||
| map-electric-tariff electric_utility SB_scenario_name SB_scenario_year upgrade_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameterized task is useful, because it abstracts away from the verbose CLI invocation.
However, the job of the Justfile is to enumerate all of the different utility-scenario-year combos that we need for the analysis, so that we have version-controlled records of the tariff_maps we use, and an easy way to reproduce them. We don't want to invocations to get lost in the terminal of the user that calls this task.
Here's a way to build on your parameterized task to accomplish this:
Convenience recipes - add more as needed
# These call the main recipe with specific arguments, keeping the abstraction
map-coned-default:
just map-electric-tariff Coned default 1 00
map-coned-seasonal:
just map-electric-tariff Coned seasonal 1 00
# Add more convenience recipes here following the pattern:
# map-<utility>-<scenario>:
# just map-electric-tariff <utility> <scenario> <year> <upgrade_id>
| from pathlib import Path | ||
|
|
||
| import polars as pl | ||
| from cloudpathlib import S3Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you forget to add this to pyproject.toml and uv.lock? Come on now. That's the last time I should catch this type of oversight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
utils/electric_tariff_mapper.py
Outdated
| metadata_df = pl.read_parquet(io.BytesIO(metadata_path.read_bytes())) | ||
| n = len(metadata_df) | ||
| metadata_df = ( | ||
| metadata_df.with_row_index("_row_idx") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm doesn't work with LazyFrames, because len(metadata_df) requires materialization—think about it, you need to have the entire dataset in memory because you can count how long it is. If you're streaming a (potentially massive) DataFrame off disk, chunk by chunk, you need a streaming algorithm to accomplish this job:
# Instead of:
n = len(metadata_df) # This materializes if LazyFrame
metadata_df = (
metadata_df.with_row_index("_row_idx")
.with_columns(...)
)
# Use this for LazyFrames:
metadata_df = metadata_df.with_columns(
pl.when(pl.col("bldg_id").hash() % 3 == 0)
.then(pl.lit("Coned"))
.when(pl.col("bldg_id").hash() % 3 == 1)
.then(pl.lit("National Grid"))
.otherwise(pl.lit("NYSEG"))
.alias("sb.electric_utility")
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
utils/electric_tariff_mapper.py
Outdated
| def main(): | ||
| parser = argparse.ArgumentParser(description="Map electrical tariff.") | ||
| parser.add_argument( | ||
| "--metadata_path", required=True, help="Base path for resstock data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Absolute or s3 path to ResStock metadata"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
utils/electric_tariff_mapper.py
Outdated
|
|
||
|
|
||
| def map_electric_tariff( | ||
| SB_metadata_path: S3Path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work with either an local path or an S3 path. You can accomplish this using the AnyPath class from cloudpathlib:
from cloudpathlib import AnyPath
# Create an AnyPath instance for a local file
local_file = AnyPath("/path/to/local/file.txt")
# Create an AnyPath instance for an S3 object
s3_file = AnyPath("s3://bucket/path/to/object.txt")
# You can then perform operations on either, and the appropriate
# class methods will be used automatically.
if local_file.exists():
print(f"Local file exists: {local_file.read_text()}")
s3_file.write_text("Hello, S3!")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit to accept pl.LazyFrame directly as discussed above ^
jpvelez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two broad issue with this:
- The I/O could be improved
- This should be refactored to work with
LazyFrames, notDataFrame. Not strictly necessary in this case since the data is small, but it's a standard we should enforce, and a good learning opportunity.
utils/electric_tariff_mapper.py
Outdated
| if not SB_metadata_path.exists(): | ||
| raise FileNotFoundError(f"SB metadata path {SB_metadata_path} does not exist") | ||
|
|
||
| metadata_df = pl.read_parquet(io.BytesIO(SB_metadata_path.read_bytes())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this should always be pl.scan_parquet, and you don't need that BytesIO stuff. Though the point is moot since we'll be moving the scan out of this function anyway (see other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit to:
- Use
pl.scan_parquetto loadpl.LazyFrameoutside of the function - Pass the
pl.LazyFramescanned directly to this function - No
BytesIOstuff
utils/electric_tariff_mapper.py
Outdated
|
|
||
| ######################################################### | ||
| # Remove the temp file that contains electric utility name column. | ||
| temp_path.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
utils/electric_tariff_mapper.py
Outdated
| utility_metadata_df = metadata_df.filter( | ||
| pl.col("sb.electric_utility") == electric_utility | ||
| ) | ||
| if utility_metadata_df.is_empty(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script should fail loudly if the utility is not found; you want the user to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LazyFrames don't have is_empty(). This will raise an AttributeError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it work with LazyFrames, you need to materialize first using .collect(). But that means you stop the query plan right there, even though it's useful for the polars engine to know that it only needs to read sb.electric_utility, bldg_id, and postprocess_group.has_hp off of disk. So you can move the select() to before the .collect() and .is_empty(), like this:
# Step 1: Filter by utility (still lazy)
utility_metadata_df = metadata_df.filter(
pl.col("sb.electric_utility") == electric_utility
)
# Step 2: Select only needed columns (still lazy, building query plan)
metadata_has_hp = (
utility_metadata_df
.select(pl.col("bldg_id", "postprocess_group.has_hp"))
.collect() # Step 3: Materialize (executes filter + select)
)
# Step 4: Check if empty (if empty, the filter found no matching utility)
if metadata_has_hp.is_empty():
raise ValueError(...)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit to:
- Filter by
electric_utility - Select columns
bldg_idandpostprocess_group.has_hp - Only collect the first row of the
LazyFrameand materialize it - Checks if this first row exists (which means there's at least one match in the
metadata-sb.parquetfor the requestedelectric_utility
utils/electric_tariff_mapper.py
Outdated
| print(electrical_tariff_mapping_df.head(20)) | ||
|
|
||
| output_path = ( | ||
| RATE_DESIGN_DIR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super brittle. That's why the CLI should accept an output path. For it's part, this function should return a DataFrame, not a CSV; the rest of the script can handle writing it to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit. The output_path input is optional for the Justfile command. If it's provided, it saves the csv to that path. If not, it saves to rate_design/<state>/hp_rates/data/tariff_map/<electric_utliyt>_<SB_scenario>.csv. The map_electric_tariff function returns the LazyFrame and the outputting is done in the rest of the script.
utils/electric_tariff_mapper.py
Outdated
| SB_scenario: SB_scenario, | ||
| electric_utility: electric_utility, | ||
| ) -> pl.DataFrame: | ||
| has_hp = metadata_has_hp["postprocess_group.has_hp"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an anti-pattern... you're materializing has_hp early. The idiomatic polars way to do this is define_electrical_tariff_key(..., pl.col("postprocess_group.has_hp")) # Pass Expr
pl.col("column") returns a pl.Expr that represents an operation in the query plan and is not executed yet. You're building up a big formula that doesn't get actually executed until you call collect. During execution, polars will compute on the Series, but you write an Expr to keep the overall query lazy and optimizable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit. The metadata_has_hp is passed as a LazyFrame and is not materialized until later.
utils/electric_tariff_mapper.py
Outdated
| def define_electrical_tariff_key( | ||
| SB_scenario: SB_scenario, | ||
| electric_utility: electric_utility, | ||
| has_hp: pl.Series, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to pl.Exp, see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit
|
|
||
| # Project root (rate-design-platform); independent of cwd or caller | ||
| _PROJECT_ROOT = Path(__file__).resolve().parent.parent | ||
| RATE_DESIGN_DIR = _PROJECT_ROOT / "rate_design" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is brittle; accept output_dir as a CLI arg instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
utils/types.py
Outdated
| from typing import Literal | ||
|
|
||
|
|
||
| class SB_scenario: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per PEP 8, class names should be CamelCase (PascalCase). SB_scenario should be SBScenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in the latest commit
utils/electric_tariff_mapper.py
Outdated
| return | ||
|
|
||
|
|
||
| def main(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a function you'd ever want to import from a different context... consequently, there's not need for this code to be wrapped in a function. It can literally just go below the if name == main
Summary
This PR adds python script and function that generates electrical tariff mapping for
bldg_id's based onSB_analysis_scenario,electrical_utility,customer_class, etc.Closes #134