Writing to ducklake from spark via JDBC: each dataframe row becomes parquet file #491
-
|
I am planning to work with Ducklake from Apache Spark. Got initial ideas from here: https://motherduck.com/blog/spark-ducklake-getting-started First simple write attempt caused this question, here is code snippet (I used get_jdbc_writer() from above): It worked but resulted in 10 parquet files in target directory. I checked - it is not related to partitions (rangeDf has one partition), if I change range() parameter to e.g. 4, I get 4 parquet files. I tried with bigger dataframe (several thousand rows) - the process never ended (too many files in directory, cannot even list them). This can be easily reproduced, initially I used older versions of DuckDB/Ducklake/JDBC. Is it a bug or is it "by design"? Current version info: Duckdb version - 1.4.0, ducklake - latest, JDBC - 1.4.0, spark - 3.5.1 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 14 replies
-
|
Hi! What is happening here is that Spark issues queries like: INSERT INTO range_10 ("id") VALUES (?)So each insert ends up as a separate Parquet file in DuckLake. Perhaps one of the following options can be used for DuckLake writing from Spark:
For options 2. and 3., if you can use Java API, then using JDBC Appender to DuckDB + |
Beta Was this translation helpful? Give feedback.
-
|
For append only (for now) bulk writing to Ducklake, try https://github.com/motherduckdb/ducklake-spark. It uses spark/hadoop built in parquet writer for the data, and the duckdb ducklake extension to register the already-written parquet files at commit time. |
Beta Was this translation helpful? Give feedback.
Batch inserts are supported by JDBC driver, just, coming from Spark's
dataframe.save(), they result in the following sequence:The problem is that, until Data Inlining is implemented for Postgres catalog, DuckLake will write a Parquet file immediately on every
EXECUTEcall.Besides
ducklake_add_data_filesandCOPYmentioned above, this problem also can be avoided using:But there seems to be no easy way to use this approach from Spark, so for now I think that
d…