Skip to content

Conversation

@cfreedman
Copy link
Contributor

@cfreedman cfreedman commented May 21, 2025

Reconcile the pipeline-integration branch with staging both of which contains the main changes with the switch over the updated pipeline, additional data validation checks, etc.

pipeline-integration contains the changes regarding removing postgres, switching over to caching for geoparquet files, and some other small changes. We should be able to work off of staging moving forward.

There were a number of previous tests relying or earlier postgres patching/old pipeline code still connected to postgres that have been ripped out or left with stubs pending updates with the new unit testing suite to support the new pipeline.

cfreedman and others added 4 commits May 10, 2025 14:04
…nsibilies from the data diffing. Additionally change some previous postgres stat generation to instead look at parquet file sizes
@vercel
Copy link

vercel bot commented May 21, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
vacant-lots-proj ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 29, 2025 5:15pm

Copy link
Contributor

@grahamalama grahamalama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Removing big dependencies like Postgres will really make reasoning about the project easier.

Because we ran the command together to completion IRL at a hack night, I didn't look too closely at the internals of each new or modified function, since there's quite a bit changing in this PR.

I did notice some commented-out code, explicitly passing tests, and tests missing for new code. I think it'd be good to square that stuff away before merging this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we explicitly passing all of these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were largely dependent on the previous postgres implementation or some heavy mocking of it - I'm keeping them stubbed out for now until we fill them in with the planned unit testing PR coming after the FeatureLayer refactor, which will change how all the services are called.

Comment on lines 102 to 122
# if len(vacant_land_gdf) < 20000:
# print(
# "Vacant land data is below the threshold. Removing vacant land rows and loading backup data from GCS."
# )
# vacant_properties.gdf = vacant_properties.gdf[
# vacant_properties.gdf["parcel_type"] != "Land"
# ]

# # Attempt to load backup data from GCS
# backup_gdf = load_backup_data_from_gcs("vacant_indicators_land_06_2024.geojson")

# if backup_gdf:
# # Add parcel_type column to backup data
# backup_gdf["parcel_type"] = "Land"

# # Append backup data to the existing dataset
# print(f"Appending backup data ({len(backup_gdf)} rows) to the existing data.")
# vacant_properties.gdf = pd.concat(
# [vacant_properties.gdf, backup_gdf], ignore_index=True
# )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this commented-out code for any reason, or can we safely delete it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, need that put back in for correct loading of the vacancy data backups, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an important class for the new ETL pipeline. Does it deserve any accompanying tests in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @cfreedman on Slack, and we agreed to postpone tests for this class to unblock other work around this area. Could we create an issue so that we remember to come back and add tests for this?

Copy link
Contributor

@grahamalama grahamalama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm approving this PR, but before merging, let's create issues for things to come back to.

with io.BytesIO(tree_response.content) as f:
with zipfile.ZipFile(f, "r") as zip_ref:
zip_ref.extractall("tmp/")
zip_ref.extractall("storage/temp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to make this storage/temp directory configurable with an environment variable. Perhaps something to add an issue for, but doesn't need to block this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by the FileManager, which is a change coming in FeatureLayer refactor PR when I went over all of the different services, but yea we can figure out if we want to configure the storage directory even further.

vacant_properties.gdf = pd.concat(
[vacant_properties.gdf, backup_gdf], ignore_index=True
)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to add an issue for -- we probably want to add an issue to refactor these except Exception as e lines. I think we generally want to try to be more specific with error catching, as it usually helps debug issues. Not something to address in this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we can definitely circle back to this. I've made a issue to look into some more informative error handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using pass, can we use @pytest.mark.skip for these tests? If we intend to come back to them, this will at least create a bit more noise in CI output than simply passing them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @cfreedman on Slack, and we agreed to postpone tests for this class to unblock other work around this area. Could we create an issue so that we remember to come back and add tests for this?

cfreedman added 2 commits May 29, 2025 10:59
…move data validation from main run until it is ready, and some small fixes
@cfreedman cfreedman merged commit c2fa5f7 into staging May 30, 2025
13 checks passed
@nlebovits nlebovits deleted the pipeline-integration branch June 30, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants