How effectively to ingest many dbt files that are separated/run by models? #21989
Unanswered
nikita-sheremet-java-developer
asked this question in
Q&A
Replies: 2 comments
-
@OrionFanWeb1701 odezwij się do mnie na maila, mamy duży zleceń dla Ciebie |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hey @nikita-sheremet-java-developer, At night, we move all files scheduled for ingestion into a new prefix, add the latest manifest.json and catalog.json, and ingest everything from that prefix. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a project with about 500+ dbt models (Trino + iceberg). Each model is run by Airflow DAG in the order. Then files
manifest.json
,catalog.json
andrun_results.json
are uploaded into s3 prefix like thats3://my_bucket/dbt/mymodel/
. So here I have ~1500 files = 500 xmanifest.json
+ 500 xcatalog.json
+ 500 xrun_results.json
.This leads to message
dbt: Processed 550000 records
(2x500x500
roughly) and this is too much. I have less table in Trino than these records. I spot thatmanifest.json
(3.2Mb) is the same for all models so when I removed it and put it only once intos3://my_bucket/dbt/
I got much less time for processing - about 10 minutes instead of 10h. But also messagesManifest file not found at: dbt/mymodel
starting to apear and make me nervous that something goes wrong or I miss some data or lineage.Could some body please clarify what is the proper way to ingest dbt data? May be put all files in one folder? But in this case files
catalog.json
andrun_results.json
will override each other. Any ideas? any secret settings? In source I spot that there can be a collection of files - but I have no idea how orginize them.Offtop: dbt configuration does not have threads parameters - any suggestion how it can be added.
Beta Was this translation helpful? Give feedback.
All reactions