Anything you'd like to share?
It can be helpful to track the load information for each row. These might be things like:
_LOADED_FROM: filename, if applicable_LOADED_AT: timestamp
"Give me the data from the most recent load"
CREATE TABLE my_table AS
SELECT
*,
current_localtimestamp() AS _LOADED_AT,
filename AS _LOADED_FROM
FROM read_csv('[path]', filename=true);INSERT INTO my_table
SELECT
*,
current_localtimestamp() AS _LOADED_AT,
filename AS _LOADED_FROM
FROM read_csv('[path]', filename=true);You can imagine the equivalent in a DataFrame.
SELECT *
FROM my_table
WHERE _LOADED_AT = (SELECT MAX(_LOADED_AT) FROM my_table);BEGIN TRANSACTION;
INSERT INTO ...;
COMMIT;- Append load
- Trunc(ate) and load
- Incremental load
Let's say you were given access to a random table that uses one of the three data loading methods above. How would you tell which it was?
The trick is avoiding duplicates. Your script might then need to say something like:
- What's the latest timestamp in the database?
- Pull data from the API that's more recent than that.