-
Notifications
You must be signed in to change notification settings - Fork 3
Library to Ingest Migration
We are migrating from dcpy.library to dcpy.lifecycle.ingest as our tool of choice for ingesting (extracting, some amount of minimal processing/transforming, then archiving) source datasets to edm-recipes, our bucket that's used as a long term data store.
This migration is being done on a dataset-by-dataset basis. Our github actions that previously targeted library or ingest now hit dcpy.lifecycle.scripts.ingest_with_library_fallback, which runs ingest to archive a dataset if a template for it exists in dcpy/lifecycle/ingest/templates, and falls back to trying to use library if it doesn't. This cli target is python3 -m dcpy.cli lifecycle scripts ingest_or_library_archive { dataset } { ... }.
In terms of how datasets are being prioritized in migration, see the project managing the migration: https://github.com/NYCPlanning/data-engineering/issues/1255.
{ template }
Then it's time to validate, and iterate as needed.
Columns are tough to define when starting, since you haven't looked at any of the data yet - leave them blank. Once the new template has been validated, there's a CLI target - get_columns that will print out copy and paste-able yml to dump into the template. You should definitely run ingest again after adding the columns field to ensure that all fields are valid.
You'll be running ingest and library via command line. gdal has breaking changes in 3.9.x to library, so if you have a later version installed locally, you'll either need to downgrade, or run in a docker container (such as the dev container). I have moved away from working inside the dev container, so for this, I built the dev container but did not have vs code run within it, and simply prefix commands with docker exec de ... (the container is named "de" now) to run the commands below in the running container.
python3 -m dcpy.cli lifecycle scripts validate_ingest run_and_compare dcp_specialpurpose
Likely, you will need to do more than just run this one command
The code to run both tools lives mostly in https://github.com/NYCPlanning/data-engineering/blob/main/dcpy/lifecycle/scripts/ingest_validation.py, and it's targets from here that are hit. That file contains logic to
- run the tool(s) - library and ingest - without pushing to s3, just keeping file outputs local
- load to the database
sandboxinedm-data(in schema based on your env) - run utilities from
dcpy.data.compareto compare the two
The first two points are bundled into a (few) single command(s)
python3 -m dcpy.cli lifecycle scripts validate_ingest run dcp_specialpurposepython3 -m dcpy.cli lifecycle scripts validate_ingest run_single library dcp_specialpurposepython3 -m dcpy.cli lifecycle scripts validate_ingest run_single ingest dcp_specialpurpose -v 20241001
The first runs both library and ingest, the second/third is a target to just run one of them. Typically, you'll start with the first (or the one-liner to run and compare from up above), and then re-run ingest as needed. Version can be supplied as an option to either command.
The code to compare lives in https://github.com/NYCPlanning/data-engineering/tree/main/dcpy/data/compare.py, though we still access it only through the lifecycle script we've already been using.
python3 -m dcpy.cli lifecycle scripts validate_ingest compare dcp_specialpurpose
This has a couple options, all of which you will likely use. With no options, it returns a SqlReport object that compares
- row count
- columns/schema
- data - simple comparison (
select * from left except select * from rightessentially).
That might return empty dataframes for the data comparison. If so, great! We're done. But more likely we have to make changes
The most common error I've gotten is non-comparable data types of the columns - str vs date, str vs int, etc. In this case, the most informative command will be running the same command but with the -c option. This skips the data comparison and will just print out the rows and column name/type comparison, so that you can see what you need to do to "fix" the ingest template. Likely, you'll need to add clean_column_names step and coerce_column_types maybe as well.
It's tough to just compare rows. But maybe now looking at the output, or querying the tables in dbeaver, you can see a little bit of structure - this table looks like it's one row per bbl, or sdlbl, or what have you. Great - we can run compare now with the -k option to specify a key for this table, so we can compare by key. I.e., for bbl '1000010001', what values have changed? It's a much more informative comparison.
If you have a compound key (say, boro, block, and lot), you can specify multiple - python3 -m dcpy.cli lifecycle scripts validate_ingest compare dcp_zoningtaxlots -k bbl -k boro -k block -k lot is what was needed to identify unique keys in ztl and then make comparisons. This keyed report both summarizes any keys present in one table but not the other, then also compares other columns for each key.
Hopefully at this point, you have some sort of informative comparison. So the data is different - what do we do?
There's one blanket change that we've decided is acceptable. character varying columns are now being created as text.
Beyond that, you might need to do a variety of things. Particularly with geometry differences, you might need to query the database
Library by default
- always lowercased column names and replaced spaces with underscores
- generally coerced polygons to multipolygons
v0 of ingest had these steps baked in, but they're being removed from the code of ingest, meaning they likely will need to be defined in your template.
processing_steps:
- name: clean_column_names
args: {"replace": {" ": "_"}, "lower": True}
- name: multiThese aren't run by default for a couple reasons
- if they're not needed for a specific dataset, it's nice to run less steps
- not ALL geospatial datasets are actually coerced to
multiby library - when adding new datasets, we might prefer a difference scheme of cleaning names (and certainly would prefer to not automatically turn polygons to multi)
However, they are both probably needed for many datasets. For multi, a quick search of our codebase shows that several transformations rely on logic along the lines of WHERE st_geometrytype(geom) = 'MultiPolygon. Unless you're certain that this dataset is fine, in general convert polygons to multi. It seems more common that we did not coerce to multi in the case of points (dcp_addresspoints, for one).
There's a chance you'll want to preprocess in some way, and that function doesn't exist! For example, for dcp_dcmstreetcenterline, the dataset actually is of POINTZ geometry - 3D points. However, after running library and ingest and querying the database, I found that all z values were zero. So both to make identical with library (and as it turned out without losing any actual data), the dates needed to be coerced to 2d. There's actually a gpd geoseries function to do just that, and we had a preprocessor to call a generic pd Series function, so I extended it to also allow for GeoSeries function calls as well.
Processing steps should be as generic and parameterizable as possible.
If you need to write a new processing function, you also need to write a test for it. Thems the rules.