-
Notifications
You must be signed in to change notification settings - Fork 91
Merging in feature pipeline-integration branch
#1212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ng a file manager (#1189)
…nsibilies from the data diffing. Additionally change some previous postgres stat generation to instead look at parquet file sizes
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
grahamalama
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Removing big dependencies like Postgres will really make reasoning about the project easier.
Because we ran the command together to completion IRL at a hack night, I didn't look too closely at the internals of each new or modified function, since there's quite a bit changing in this PR.
I did notice some commented-out code, explicitly passing tests, and tests missing for new code. I think it'd be good to square that stuff away before merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we explicitly passing all of these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were largely dependent on the previous postgres implementation or some heavy mocking of it - I'm keeping them stubbed out for now until we fill them in with the planned unit testing PR coming after the FeatureLayer refactor, which will change how all the services are called.
| # if len(vacant_land_gdf) < 20000: | ||
| # print( | ||
| # "Vacant land data is below the threshold. Removing vacant land rows and loading backup data from GCS." | ||
| # ) | ||
| # vacant_properties.gdf = vacant_properties.gdf[ | ||
| # vacant_properties.gdf["parcel_type"] != "Land" | ||
| # ] | ||
|
|
||
| # # Attempt to load backup data from GCS | ||
| # backup_gdf = load_backup_data_from_gcs("vacant_indicators_land_06_2024.geojson") | ||
|
|
||
| # if backup_gdf: | ||
| # # Add parcel_type column to backup data | ||
| # backup_gdf["parcel_type"] = "Land" | ||
|
|
||
| # # Append backup data to the existing dataset | ||
| # print(f"Appending backup data ({len(backup_gdf)} rows) to the existing data.") | ||
| # vacant_properties.gdf = pd.concat( | ||
| # [vacant_properties.gdf, backup_gdf], ignore_index=True | ||
| # ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this commented-out code for any reason, or can we safely delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, need that put back in for correct loading of the vacancy data backups, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an important class for the new ETL pipeline. Does it deserve any accompanying tests in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with @cfreedman on Slack, and we agreed to postpone tests for this class to unblock other work around this area. Could we create an issue so that we remember to come back and add tests for this?
grahamalama
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'm approving this PR, but before merging, let's create issues for things to come back to.
| with io.BytesIO(tree_response.content) as f: | ||
| with zipfile.ZipFile(f, "r") as zip_ref: | ||
| zip_ref.extractall("tmp/") | ||
| zip_ref.extractall("storage/temp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may want to make this storage/temp directory configurable with an environment variable. Perhaps something to add an issue for, but doesn't need to block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled by the FileManager, which is a change coming in FeatureLayer refactor PR when I went over all of the different services, but yea we can figure out if we want to configure the storage directory even further.
| vacant_properties.gdf = pd.concat( | ||
| [vacant_properties.gdf, backup_gdf], ignore_index=True | ||
| ) | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to add an issue for -- we probably want to add an issue to refactor these except Exception as e lines. I think we generally want to try to be more specific with error catching, as it usually helps debug issues. Not something to address in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we can definitely circle back to this. I've made a issue to look into some more informative error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using pass, can we use @pytest.mark.skip for these tests? If we intend to come back to them, this will at least create a bit more noise in CI output than simply passing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with @cfreedman on Slack, and we agreed to postpone tests for this class to unblock other work around this area. Could we create an issue so that we remember to come back and add tests for this?
…move data validation from main run until it is ready, and some small fixes
Reconcile the
pipeline-integrationbranch withstagingboth of which contains the main changes with the switch over the updated pipeline, additional data validation checks, etc.pipeline-integrationcontains the changes regarding removing postgres, switching over to caching for geoparquet files, and some other small changes. We should be able to work off ofstagingmoving forward.There were a number of previous tests relying or earlier postgres patching/old pipeline code still connected to postgres that have been ripped out or left with stubs pending updates with the new unit testing suite to support the new pipeline.