This jupyter process converts a project written in DBT to a Google Dataform project. In this spreadsheet, you can see the details about the objects that are converted by the python code, and some important notes about possible limitations, as well as the roadmap for future implementations.
- Make sure you already have dataform installed on your computer; If it's not, just follow this walkthrough.
- Make sure you have DBT's source repository on your local machine;
- Clone this repository on the same path as you have your DBT Project
- gh repo clone datalakehouse/dbt-to-dataform
- Copy .df_credentials.json file that you had generated on dataform configuration to the same path; The path structure should be like below;
- dbt_project/
- .df_credentials.json
- dbt-to-dataform/notebook
- Make sure you have Python, Juptyter notebook or Hub installed on your machine;
- Read the spreadsheet to make sure that each part of your code will be converted as expected;
- dbt-to-dataform - dataform install and config basics
- dbt-to-dataform - dbt and dataform syntax differences
- dbt-to-dataform conversion concept roadmap
Execute the jupyter-notebook command on your CLI to start jupyter notebook.
After starting the jupyter notebook on your local machine, navigate to the web server page. This is typically, http://localhost:8888/
Navigate to the dbt_dataform_converter.ipynb file.
On this part of the code (image), insert the variables as requested.
dbt_source_project_path
: The path of your source dbt project;dataform_root_path
: The target dataform path to be generated;target_schema
: The name of the schema to be created by Dataform on the target data warehouse platform, e.g.: Snowflake;conversion_type
: Define if the code will be converted to JS or SQLX on dataform. If you want to create a Dataform package, must use JS, otherwise, SQLX.dlh_timestamp_field
: If your code has SCD Snapshot files, Dataform requires to inform a timestamp field to be checked when generating snapshot. Must be a field on your model that tracks the last update datetime for each record;
Run each cell of dbt_dataform_converter.ipynb file.
On the last cell, you must have a return close to this.
Make sure you have read the spreadsheet to understand the current limitations of the converter based on your current DBT code.
On the case of the unit testing, based on dlh_square_analytics project, below are the changes that needed prior to running Dataform's code.
- Square analytics project, uses a full_name macro. It was required to be rewritten on Dataform. Write your macro in a .JS file and put that file inside includes folder.
- Add the name of the macro file before the name of the macro
- Execute the dataform compile command to make sure nothing will break at runtime
- Change the default schema on dataform.json file
- Execute dataform run
The output for Dataform will be generated on the path contained on dataform_root_path variable.
When the code runs, it will check if this directory already exists. If it does not exists, it will be created, otherwise, it will be deleted and created again.
Below, are the functions that will be run in sequence by dbt_dataform_converter function.
- delete target repository if exists;
- dataform init new repository;
- copy .df_credentials file to new repository
- edit packages.json file with target dataform version, and adding dataform-scd package;
- runs dataform install to setup target version and scd package;
- gets all yml files that contain sources in the models folder of dbt source project;
- generate one .JS file for each source table contained on the yml files on definitions/sources on dataform project;
- gets all .sql files on dbt's models repository;
- copy all files to target dataform's definitions repository, replacing the extension to .sqlx;
- replace header with dataform's syntax using replace and regex substitution functions;
- replace syntax patterns on dbt project to dataform's;
- remove unsupported DBT's config header features;
- replace syntax pattern of incremental models macros on dataform;
- gets all .sql files on dbt's models repository;
- copy all files to target dataform's definitions repository, replacing the extension to .js;
- replace header with dataform's syntax using replace and regex substitution functions;
- replace syntax patterns on dbt project to dataform's;
- remove unsupported DBT's config header features;
- replace syntax pattern of incremental models macros on dataform;
- gets all .sql files on dbt's snapshots repository;
- copy all files to target dataform's definitions/snapthots repository, replacing the extension to .js;
- gets the name of the table used on FROM clause of the snapshot file;
- replace file for the dataform's scd JS pattern;
- gets all .yml test and schema definition files on DBT's model repository;
- gets unique and not_null tests and the corresponding tables and columns;
- gets descriptions on tables and/or columns presents on yml files;
- create a python dictionary with those tests and descriptions;
- write assertions (tests) and descriptions inside each model already present on dataform's definitions folder
If you have any comments, questions please consider joining our DataLakeHouse Slack Channel Community where we discuss this project and other data engineering and analytics engineering related topics, https://datalakehouse.slack.com/
We welcome any and all feedback and contribution to further the project. Please take a look at this project on how to contribute. We think their guidelines are pretty darn good, https://github.com/firstcontributions/first-contributions