Skip to content

Library to Ingest Migration

fvankrieken edited this page Nov 19, 2024 · 25 revisions

We are migrating from dcpy.library to dcpy.lifecycle.ingest as our tool of choice for ingesting (extracting, some amount of minimal processing/transforming, then archiving) source datasets to edm-recipes, our bucket that's used as a long term data store.

This migration is being done on a dataset-by-dataset basis. Our github actions that previously targeted library or ingest now hit dcpy.lifecycle.scripts.ingest_with_library_fallback, which runs ingest to archive a dataset if a template for it exists in dcpy/lifecycle/ingest/templates, and falls back to trying to use library if it doesn't. This cli target is python3 -m dcpy.cli lifecycle scripts ingest_or_library_archive { dataset } { ... }.

In terms of how datasets are being prioritized in migration, see the project managing the migration: https://github.com/NYCPlanning/data-engineering/issues/1255.

Migrating a dataset

Writing the new template

{ template }

Then it's time to validate, and iterate as needed.

columns field

Columns are tough to define when starting, since you haven't looked at any of the data yet - leave them blank. Once the new template has been validated, you can run this query, changing the first line, to generate the list of columns (copy and paste-able) to be put in the template

with setup(schema, table_name) as (select 'fvk_ztl_cm', 'dcp_dcmstreetcenterline'),
all_rows as (
select 
	'- id: ' || column_name as t,
	ordinal_position as n
from information_schema."columns" c
inner join setup s 
	on c.table_name = s.table_name || '_ingest' 
	and c.table_schema = s.schema
where column_name not in ('ogc_fid', 'data_library_version')
union all
select 
	'  data_type: ' || (
        CASE 
            WHEN data_type = 'USER-DEFINED' THEN udt_name
            when data_type = 'bigint' then 'integer'
            when data_type = 'smallint' then 'integer'
            when data_type = 'double_precision' then 'decimal'
            when data_type = 'timestamp with time zone' then 'datetime'
            when data_type = 'timestamp without time zone' then 'datetime'
            ELSE data_type
        END
	) as t, 
	ordinal_position + 0.5 as n
from information_schema."columns" c
inner join setup s 
	on c.table_name = s.table_name || '_ingest' 
	and c.table_schema = s.schema
where column_name not in ('ogc_fid', 'data_library_version')
)
select t from all_rows order by n

You should definitely run ingest again after adding the columns field to ensure that all fields are valid.

Requirements

You'll be running ingest and library via command line. gdal has breaking changes in 3.9.x to library, so if you have a later version installed locally, you'll either need to downgrade, or run in a docker container (such as the dev container). I have moved away from working inside the dev container, so for this, I built the dev container but did not have vs code run within it, and simply prefix commands with docker exec de ... (the container is named "de" now) to run the commands below in the running container.

One-liner to run and compare

python3 -m dcpy.cli lifecycle scripts validate_ingest run_and_compare dcp_specialpurpose

Likely, you will need to do more than just run this one command

Running the tools

The code to run both tools lives mostly in https://github.com/NYCPlanning/data-engineering/blob/main/dcpy/lifecycle/scripts/ingest_validation.py, and it's targets from here that are hit. That file contains logic to

  1. run the tool(s) - library and ingest - without pushing to s3, just keeping file outputs local
  2. load to the database sandbox in edm-data (in schema based on your env)
  3. run utilities from dcpy.data.compare to compare the two

The first two points are bundled into a (few) single command(s)

  • python3 -m dcpy.cli lifecycle scripts validate_ingest run dcp_specialpurpose
  • python3 -m dcpy.cli lifecycle scripts validate_ingest run_single library dcp_specialpurpose
  • python3 -m dcpy.cli lifecycle scripts validate_ingest run_single ingest dcp_specialpurpose -v 20241001

The first runs both library and ingest.

https://github.com/NYCPlanning/data-engineering/tree/main/dcpy/data

Colum

Clone this wiki locally