Skip to content
This repository was archived by the owner on Sep 26, 2023. It is now read-only.

Transforming via COPY INTO #101

@dhuang

Description

@dhuang

Snowflake has some capabilities when it comes to transforming during a load. From my very basic understanding of what the transformer does, it seems like much of its logic replaced with a transformation on load, which means we can also directly load data from the enriched files in S3.

Example test file in the enriched file format:

my_app  {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.test/foo/jsonschema/1-0-0","data":{"hello":"world"}}}
my_app  {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.test/bar/jsonschema/1-0-0","data":{"example":123}}}

I was able to load this into Snowflake with each type of unstructured event in the right format.

CREATE TABLE my_table (
    app_id VARCHAR
    , unstruct_event_com_strava_foo OBJECT
    , unstruct_event_com_strava_bar OBJECT
);

COPY INTO my_table
FROM (
    SELECT 
        t.$1
        , IFF(PARSE_JSON(t.$2):data:schema = 'iglu:com.test/foo/jsonschema/1-0-0', PARSE_JSON(t.$2):data:data, NULL)
        , IFF(PARSE_JSON(t.$2):data:schema = 'iglu:com.test/bar/jsonschema/1-0-0', PARSE_JSON(t.$2):data:data, NULL) 
        FROM @stage/test.tsv t
)
FILE_FORMAT = (
    TYPE = 'CSV' 
    FIELD_DELIMITER = '\t'
);

Open questions

  • Cross batch deduping: Copy transformation won't support this, but we could load into a temp table and do the deduping on insert.
  • New columns: It looks like the loader uses the transformed files to see which columns it adds. This might be trickier with only the enriched files?

If this is indeed possible, there could potentially just be a single loader step without the transformer Spark job at all?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions