-
Notifications
You must be signed in to change notification settings - Fork 25
Description
DuckLake behaviour with client's "batch insert" methods, when a separate Parquet file is written for each incoming record, may be confusing. There are at least 2 reports on this from users (1, 2).
It is suggested to add a note like the following somewhere to the docs:
Bulk data import
When input data is already stored in Parquet files with the same schema as the destination DuckLake table, then it is advised to use ducklake_add_data_files to import this data.
For input data from other formats/sources the most performant way to import it to DuckLake is to first insert the records into the in-memory "staging" table (for example, using the Appender interface) and then perform the bulk insert into the DuckLake table:
INSERT INTO ducklake.tab1 FROM memory.tab1Warning:
"batch insert" functionality in DuckDB clients can be misleading when used with DuckLake. For example, when Python DB-API method executemany or JDBC method executeBatch are used with DuckLake, the specified INSERT query will be prepared and then executed in a loop for every parameter tuple in the batch, resulting in a sequence like this:
PREPARE p AS INSERT INTO tab1 (col1, col2) VALUES (?, ?);
BEGIN TRANSACTION;
EXECUTE p(1, 'foo');
EXECUTE p(2, 'bar');
EXECUTE p(3, 'baz');
...
EXECUTE p(n, 'boo');
COMMIT;From the DuckLake point of view incoming INSERT queries are separate and not a part of any "batch", so a separate Parquet file will be created for every record in the batch. That is unlikely desired, so the usage of executemany()/executeBatch() directly with DuckLake is discouraged. These methods still can be used to insert the records into the in-memory "staging" table before copying it to a DuckLake table.
TODO: Polars data frame direct copy example.
@guillesd I wonder if you have a minute to look at this note - how to structure it better and in what section it is appropriate to put it into.