How not to ingest a file more than once? #355

alberto-lanfranco-storebrand · 2024-08-14T08:06:12Z

alberto-lanfranco-storebrand
Aug 14, 2024

Hello,

I'm currently implementing Sling in my architecture.
I'm ingesting from an Azure blob storage with a lot of big csv files, but at each iteration I only want to ingest the new files since last execution.

What is the best practice on how to implement this in Sling?

I have a feeling in might involve the _SLING_STREAM_URL column, and update_key in the replication file, but I don't know exactly how to make it work.

flarco · 2024-08-14T19:24:42Z

flarco
Aug 14, 2024
Maintainer

Yes, need to document this better.

Based on file timestamp:

source: azure
target: my_db

defaults:
  update_key: _sling_loaded_at  # <-- tells sling to use the file timestamp for comparison
  object: my_schema.{stream_file_name}

streams:
  "path/to/my/folder/*.csv":

env:
  SLING_LOADED_AT_COLUMN: unix

Based on a column in a file (not what you're asking, will scan all files again, but stream only rows after latest max(update_key)):

source: local
target: postgres

defaults:
  mode: incremental
  update_key: create_dt
  primary_key: id
  object: public.incremental_csv
  target_options:
    adjust_column_type: true

streams:
  cmd/sling/tests/files/test1.csv:
  cmd/sling/tests/files/test1.upsert.csv:

3 replies

alberto-lanfranco-storebrand Aug 15, 2024
Author

Thanks for the follow-up!

I implemented the first approach that you suggest, but I'm encountering an unexpected behavior.

On the first run, when the destination table doesn't exist yet, all rows from all files are supposed to be loaded, but I'm seeing a much smaller amount than expected.

To better clarify my use case, in the Azure blob container I have the following structure:

Table_A_20240219.csv
Table_A_20240220.csv
Table_A_20240221.csv
Table_A_20240222.csv
...
Table_B_20240219.csv
Table_B_20240220.csv
Table_B_20240221.csv
Table_B_20240222.csv
...
Table_C_20240219.csv
Table_C_20240220.csv
Table_C_20240221.csv
Table_C_20240222.csv
...

Where all the files with name pattern like Table_A_*.csv are loaded in table A, and so on

alberto-lanfranco-storebrand Aug 15, 2024
Author

some progress was made by adding single: true

alberto-lanfranco-storebrand Aug 15, 2024
Author

Just noticed that the solution provided doesn't exactly achieve what I was looking for.

What I was looking for: Ingesting only the not-yet-ingested files using the file name (containing a date in the name), comparing it to max(_SLING_STREAM_URL) in the table

What this solution offers: Ingesting only the files that appeared in the blob container since last execution.

Key difference between the two: if do a full build of the table from scratch, then manually delete the last batch of data from the table, I want to be able to re-run the job, and for the last batch to be re-ingested

When I use _SLING_STREAM_URL in the update_key, the job fails

Uh oh!

How not to ingest a file more than once? #355

Uh oh!

Uh oh!

alberto-lanfranco-storebrand Aug 14, 2024

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

flarco Aug 14, 2024 Maintainer

Uh oh!

Uh oh!

alberto-lanfranco-storebrand Aug 15, 2024 Author

Uh oh!

Uh oh!

alberto-lanfranco-storebrand Aug 15, 2024 Author

Uh oh!

Uh oh!

alberto-lanfranco-storebrand Aug 15, 2024 Author

alberto-lanfranco-storebrand
Aug 14, 2024

Replies: 1 comment 3 replies

flarco
Aug 14, 2024
Maintainer

alberto-lanfranco-storebrand Aug 15, 2024
Author

alberto-lanfranco-storebrand Aug 15, 2024
Author

alberto-lanfranco-storebrand Aug 15, 2024
Author