Skip to content

Add a note about executemany or executebatch inserts into DuckLakeΒ #290

@staticlibs

Description

@staticlibs

DuckLake behaviour with client's "batch insert" methods, when a separate Parquet file is written for each incoming record, may be confusing. There are at least 2 reports on this from users (1, 2).

It is suggested to add a note like the following somewhere to the docs:

Bulk data import

When input data is already stored in Parquet files with the same schema as the destination DuckLake table, then it is advised to use ducklake_add_data_files to import this data.

For input data from other formats/sources the most performant way to import it to DuckLake is to first insert the records into the in-memory "staging" table (for example, using the Appender interface) and then perform the bulk insert into the DuckLake table:

INSERT INTO ducklake.tab1 FROM memory.tab1

Warning:

"batch insert" functionality in DuckDB clients can be misleading when used with DuckLake. For example, when Python DB-API method executemany or JDBC method executeBatch are used with DuckLake, the specified INSERT query will be prepared and then executed in a loop for every parameter tuple in the batch, resulting in a sequence like this:

PREPARE p AS INSERT INTO tab1 (col1, col2) VALUES (?, ?);
BEGIN TRANSACTION;
EXECUTE p(1, 'foo');
EXECUTE p(2, 'bar');
EXECUTE p(3, 'baz');
...
EXECUTE p(n, 'boo');
COMMIT;

From the DuckLake point of view incoming INSERT queries are separate and not a part of any "batch", so a separate Parquet file will be created for every record in the batch. That is unlikely desired, so the usage of executemany()/executeBatch() directly with DuckLake is discouraged. These methods still can be used to insert the records into the in-memory "staging" table before copying it to a DuckLake table.

TODO: Polars data frame direct copy example.

@guillesd I wonder if you have a minute to look at this note - how to structure it better and in what section it is appropriate to put it into.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions