Currently we merge each batch from source into temporary table and then fire CREATE OR REPLACE. However, since backfills for many source are APPEND-ONLY, or, in case of Synapse, can be pre-aggregated, an improvement can be done to reduce load on merge engine:
- Do not create staging tables except for backfill table
- Append data files to backfill table
- Modify backfill-overwrite query to perform pre-aggregation if needed (for Synapse, for example)
This will allow us to bypass query engine until the last stage, maximizing read throughput from source.