-
Notifications
You must be signed in to change notification settings - Fork 3
Library to Ingest Migration
We are migrating from dcpy.library to dcpy.lifecycle.ingest as our tool of choice for ingesting (extracting, some amount of minimal processing/transforming, then archiving) source datasets to edm-recipes, our bucket that's used as a long term data store.
This migration is being done on a dataset-by-dataset basis. Our github actions that previously targeted library or ingest now hit dcpy.lifecycle.scripts.ingest_with_library_fallback, which runs ingest to archive a dataset if a template for it exists in dcpy/lifecycle/ingest/templates, and falls back to trying to use library if it doesn't. This cli target is python3 -m dcpy.cli lifecycle scripts ingest_or_library_archive { dataset } { ... }.
In terms of how datasets are being prioritized in migration, see the project managing the migration: https://github.com/NYCPlanning/data-engineering/issues/1255.
{ template }
Then it's time to validate, and iterate as needed.
Columns are tough to define when starting, since you haven't looked at any of the data yet - leave them blank. Once the new template has been validated, you can run this query, changing the first line, to generate the list of columns (copy and paste-able) to be put in the template
with setup(schema, table_name) as (select 'fvk_ztl_cm', 'dcp_dcmstreetcenterline'),
all_rows as (
select
'- id: ' || column_name as t,
ordinal_position as n
from information_schema."columns" c
inner join setup s
on c.table_name = s.table_name || '_ingest'
and c.table_schema = s.schema
where column_name not in ('ogc_fid', 'data_library_version')
union all
select
' data_type: ' || (
CASE
WHEN data_type = 'USER-DEFINED' THEN udt_name
when data_type = 'bigint' then 'integer'
when data_type = 'smallint' then 'integer'
when data_type = 'double_precision' then 'decimal'
when data_type = 'timestamp with time zone' then 'datetime'
when data_type = 'timestamp without time zone' then 'datetime'
ELSE data_type
END
) as t,
ordinal_position + 0.5 as n
from information_schema."columns" c
inner join setup s
on c.table_name = s.table_name || '_ingest'
and c.table_schema = s.schema
where column_name not in ('ogc_fid', 'data_library_version')
)
select t from all_rows order by nYou should definitely run ingest again after adding the columns field to ensure that all fields are valid.
You'll be running ingest and library via command line. gdal has breaking changes in 3.9.x to library, so if you have a later version installed locally, you'll either need to downgrade, or run in a docker container (such as the dev container). I have moved away from working inside the dev container, so for this, I built the dev container but did not have vs code run within it, and simply prefix commands with docker exec de ... (the container is named "de" now) to run the commands below in the running container.
python3 -m dcpy.cli lifecycle scripts validate_ingest run_and_compare dcp_specialpurpose
Likely, you will need to do more than just run this one command
The code to run both tools lives mostly in https://github.com/NYCPlanning/data-engineering/blob/main/dcpy/lifecycle/scripts/ingest_validation.py, and it's targets from here that are hit. That file contains logic to
- run the tool(s) - library and ingest - without pushing to s3, just keeping file outputs local
- load to the database
sandboxinedm-data(in schema based on your env) - run utilities from
dcpy.data.compareto compare the two
The first two points are bundled into a (few) single command(s)
python3 -m dcpy.cli lifecycle scripts validate_ingest run dcp_specialpurposepython3 -m dcpy.cli lifecycle scripts validate_ingest run_single library dcp_specialpurposepython3 -m dcpy.cli lifecycle scripts validate_ingest run_single ingest dcp_specialpurpose -v 20241001
The first runs both library and ingest.
https://github.com/NYCPlanning/data-engineering/tree/main/dcpy/data