Transforming via COPY INTO

Snowflake has some capabilities when it comes to [transforming during a load](https://docs.snowflake.net/manuals/user-guide/data-load-transform.html). From my very basic understanding of what the [transformer](https://github.com/snowplow-incubator/snowplow-snowflake-loader/wiki#snowplow-snowflake-transformer) does, it seems like much of its logic replaced with a transformation on load, which means we can also directly load data from the enriched files in S3.

Example test file in the enriched file format:

```tsv
my_app  {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.test/foo/jsonschema/1-0-0","data":{"hello":"world"}}}
my_app  {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.test/bar/jsonschema/1-0-0","data":{"example":123}}}
```

I was able to load this into Snowflake with each type of unstructured event in the right format.


```sql
CREATE TABLE my_table (
    app_id VARCHAR
    , unstruct_event_com_strava_foo OBJECT
    , unstruct_event_com_strava_bar OBJECT
);

COPY INTO my_table
FROM (
    SELECT 
        t.$1
        , IFF(PARSE_JSON(t.$2):data:schema = 'iglu:com.test/foo/jsonschema/1-0-0', PARSE_JSON(t.$2):data:data, NULL)
        , IFF(PARSE_JSON(t.$2):data:schema = 'iglu:com.test/bar/jsonschema/1-0-0', PARSE_JSON(t.$2):data:data, NULL) 
        FROM @stage/test.tsv t
)
FILE_FORMAT = (
    TYPE = 'CSV' 
    FIELD_DELIMITER = '\t'
);
```

Open questions

* Cross batch deduping: Copy transformation won't support this, but we could load into a temp table and do the deduping on insert. 
* New columns: It looks like the [loader](https://github.com/snowplow-incubator/snowplow-snowflake-loader/wiki#snowplow-snowflake-loader) uses the transformed files to see which columns it adds. This might be trickier with only the enriched files?

If this is indeed possible, there could potentially just be a single loader step without the transformer Spark job at all?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transforming via COPY INTO #101

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Transforming via COPY INTO #101

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions