Merging in feature `pipeline-integration` branch #1212

cfreedman · 2025-05-21T15:43:59Z

Reconcile the pipeline-integration branch with staging both of which contains the main changes with the switch over the updated pipeline, additional data validation checks, etc.

pipeline-integration contains the changes regarding removing postgres, switching over to caching for geoparquet files, and some other small changes. We should be able to work off of staging moving forward.

There were a number of previous tests relying or earlier postgres patching/old pipeline code still connected to postgres that have been ripped out or left with stubs pending updates with the new unit testing suite to support the new pipeline.

…ng a file manager (#1189)

…nsibilies from the data diffing. Additionally change some previous postgres stat generation to instead look at parquet file sizes

vercel · 2025-05-21T15:44:03Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
vacant-lots-proj	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 29, 2025 5:15pm

…ng suite

grahamalama

Nice work! Removing big dependencies like Postgres will really make reasoning about the project easier.

Because we ran the command together to completion IRL at a hack night, I didn't look too closely at the internals of each new or modified function, since there's quite a bit changing in this PR.

I did notice some commented-out code, explicitly passing tests, and tests missing for new code. I think it'd be good to square that stuff away before merging this PR.

grahamalama · 2025-05-26T18:47:31Z

data/src/test/test_data_utils.py

Why are we explicitly passing all of these tests?

These were largely dependent on the previous postgres implementation or some heavy mocking of it - I'm keeping them stubbed out for now until we fill them in with the planned unit testing PR coming after the FeatureLayer refactor, which will change how all the services are called.

grahamalama · 2025-05-26T18:48:04Z

data/src/new_etl/data_utils/vacant_properties.py

+    # if len(vacant_land_gdf) < 20000:
+    #     print(
+    #         "Vacant land data is below the threshold. Removing vacant land rows and loading backup data from GCS."
+    #     )
+    #     vacant_properties.gdf = vacant_properties.gdf[
+    #         vacant_properties.gdf["parcel_type"] != "Land"
+    #     ]
+
+    #     # Attempt to load backup data from GCS
+    #     backup_gdf = load_backup_data_from_gcs("vacant_indicators_land_06_2024.geojson")
+
+    #     if backup_gdf:
+    #         # Add parcel_type column to backup data
+    #         backup_gdf["parcel_type"] = "Land"
+
+    #         # Append backup data to the existing dataset
+    #         print(f"Appending backup data ({len(backup_gdf)} rows) to the existing data.")
+    #         vacant_properties.gdf = pd.concat(
+    #             [vacant_properties.gdf, backup_gdf], ignore_index=True
+    #         )



Do we need this commented-out code for any reason, or can we safely delete it?

Yea, need that put back in for correct loading of the vacancy data backups, thanks

data/src/new_etl/data_utils/vacant_properties.py

data/src/new_etl/classes/file_manager.py

grahamalama · 2025-05-26T19:05:51Z

data/src/new_etl/classes/file_manager.py

This seems like an important class for the new ETL pipeline. Does it deserve any accompanying tests in this PR?

Talked with @cfreedman on Slack, and we agreed to postpone tests for this class to unblock other work around this area. Could we create an issue so that we remember to come back and add tests for this?

data/src/main.py

grahamalama

Nice! I'm approving this PR, but before merging, let's create issues for things to come back to.

grahamalama · 2025-05-29T00:37:06Z

data/src/new_etl/data_utils/tree_canopy.py

    with io.BytesIO(tree_response.content) as f:
        with zipfile.ZipFile(f, "r") as zip_ref:
-            zip_ref.extractall("tmp/")
+            zip_ref.extractall("storage/temp")


I think we may want to make this storage/temp directory configurable with an environment variable. Perhaps something to add an issue for, but doesn't need to block this PR.

This should be handled by the FileManager, which is a change coming in FeatureLayer refactor PR when I went over all of the different services, but yea we can figure out if we want to configure the storage directory even further.

grahamalama · 2025-05-29T00:39:49Z

data/src/new_etl/data_utils/vacant_properties.py

+            vacant_properties.gdf = pd.concat(
+                [vacant_properties.gdf, backup_gdf], ignore_index=True
+            )
+        except Exception as e:


Something to add an issue for -- we probably want to add an issue to refactor these except Exception as e lines. I think we generally want to try to be more specific with error catching, as it usually helps debug issues. Not something to address in this PR though.

Yea we can definitely circle back to this. I've made a issue to look into some more informative error handling.

grahamalama · 2025-05-29T00:43:55Z

data/src/test/test_data_utils.py

Instead of using pass, can we use @pytest.mark.skip for these tests? If we intend to come back to them, this will at least create a bit more noise in CI output than simply passing them.

grahamalama · 2025-05-29T00:45:20Z

data/src/new_etl/classes/file_manager.py

Talked with @cfreedman on Slack, and we agreed to postpone tests for this class to unblock other work around this area. Could we create an issue so that we remember to come back and add tests for this?

…move data validation from main run until it is ready, and some small fixes

cfreedman and others added 4 commits May 10, 2025 14:04

Remove postgres and replace with read and writes to parquet files usi…

8a57a08

…ng a file manager (#1189)

add in class for controlling slack reporting and separate those respo…

8c93129

…nsibilies from the data diffing. Additionally change some previous postgres stat generation to instead look at parquet file sizes

fix merge for staging into pipeline-integration

62b56b1

quick error fix

87d069e

github-actions bot added backend frontend labels May 21, 2025

rip out some additional lingering connection to postgres in the testi…

a847d1a

…ng suite

vercel bot deployed to Preview May 22, 2025 01:34 View deployment

another postgres connection to tests to remove

9176ec4

vercel bot deployed to Preview May 22, 2025 01:53 View deployment

remove some additional patches

5691602

vercel bot deployed to Preview May 22, 2025 02:07 View deployment

reduce some more tests to stubs

b716240

vercel bot deployed to Preview May 22, 2025 02:57 View deployment

cfreedman requested a review from grahamalama May 22, 2025 03:15

merge in pipeline-integration

498a622

grahamalama reviewed May 26, 2025

View reviewed changes

review changes

b277cc1

vercel bot deployed to Preview May 28, 2025 04:30 View deployment

cfreedman added 2 commits May 28, 2025 07:41

remove repeated code

b6adb20

fix merge with pipeline-integration

283aa5c

grahamalama approved these changes May 29, 2025

View reviewed changes

fix merge with staging

e3930a3

vercel bot deployed to Preview May 29, 2025 02:44 View deployment

cfreedman added 2 commits May 28, 2025 23:03

fix test errors

ae89547

a few more merge errors

1a14f0b

vercel bot deployed to Preview May 29, 2025 04:01 View deployment

couple test import errors

9dd2d1b

vercel bot deployed to Preview May 29, 2025 04:11 View deployment

file manager small fix

34330c9

vercel bot deployed to Preview May 29, 2025 04:15 View deployment

another file manager fix

27233a3

vercel bot deployed to Preview May 29, 2025 04:24 View deployment

cfreedman added 2 commits May 29, 2025 10:59

fix merge pipeline-integration

539a4a1

adjust docker setup to take into account changed module structure, re…

1f7b06e

…move data validation from main run until it is ready, and some small fixes

vercel bot deployed to Preview May 29, 2025 17:15 View deployment

cfreedman merged commit c2fa5f7 into staging May 30, 2025
13 checks passed

grahamalama mentioned this pull request May 31, 2025

Move contents of new_etl to parent directory, replacing the old ETL implementation #1219

Merged

5 tasks

nlebovits deleted the pipeline-integration branch June 30, 2025 12:01

Merging in feature pipeline-integration branch #1212

Merging in feature pipeline-integration branch #1212

Uh oh!

Conversation

cfreedman commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grahamalama left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grahamalama left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Merging in feature `pipeline-integration` branch #1212

Merging in feature `pipeline-integration` branch #1212

cfreedman commented May 21, 2025 •

edited

Loading

vercel bot commented May 21, 2025 •

edited

Loading