This repository was archived by the owner on Sep 26, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 9
Transforming via COPY INTO #101
Copy link
Copy link
Open
Description
Snowflake has some capabilities when it comes to transforming during a load. From my very basic understanding of what the transformer does, it seems like much of its logic replaced with a transformation on load, which means we can also directly load data from the enriched files in S3.
Example test file in the enriched file format:
my_app {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.test/foo/jsonschema/1-0-0","data":{"hello":"world"}}}
my_app {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.test/bar/jsonschema/1-0-0","data":{"example":123}}}I was able to load this into Snowflake with each type of unstructured event in the right format.
CREATE TABLE my_table (
app_id VARCHAR
, unstruct_event_com_strava_foo OBJECT
, unstruct_event_com_strava_bar OBJECT
);
COPY INTO my_table
FROM (
SELECT
t.$1
, IFF(PARSE_JSON(t.$2):data:schema = 'iglu:com.test/foo/jsonschema/1-0-0', PARSE_JSON(t.$2):data:data, NULL)
, IFF(PARSE_JSON(t.$2):data:schema = 'iglu:com.test/bar/jsonschema/1-0-0', PARSE_JSON(t.$2):data:data, NULL)
FROM @stage/test.tsv t
)
FILE_FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = '\t'
);Open questions
- Cross batch deduping: Copy transformation won't support this, but we could load into a temp table and do the deduping on insert.
- New columns: It looks like the loader uses the transformed files to see which columns it adds. This might be trickier with only the enriched files?
If this is indeed possible, there could potentially just be a single loader step without the transformer Spark job at all?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels