-
Notifications
You must be signed in to change notification settings - Fork 32
Description
What happens?
DuckDBPyRelation::ToArrowCapsule consumes 2x the memory of ToArrowTableInternal. It fully materializes the results before constructing the arrow table, not leveraging a PhysicalArrowCollector.
Some rough test results (edit: for a 1GB Arrow Table):
Test | Time | Peak | Table Size |
---|---|---|---|
Unchanged - ExecuteOrThrow(); |
5.818s | 2.36GB | 1.08GB |
Streaming - ExecuteOrThrow(true); |
4.682s | 1.22GB | 1.08GB |
PhysicalArrowCollector + Stream (code not shown) | 3.937s | 1.23GB | 1.08GB |
FetchArrowTable (see ToArrowTableInternal) | 4.447s | 1.17GB | 1.08GB |
Streaming
The simplest change is changing the materialization to streaming - ExecuteOrThrow(true);
, this alone halves the memory consumption.
duckdb-python/src/duckdb_py/pyrelation.cpp
Line 967 in bbd4389
ExecuteOrThrow(); |
PhysicalArrowCollector
A more complete change would be to refactor ToArrowCapsule and ToArrowTableInternal to both use common PhysicalArrowCollector code. A light wrapper to construct an ArrowArrayStream would be needed in DuckDBPyResult::FetchArrowCapsule.
Related optimization - ToPolars
I noticed this when looking at ToPolars memory consumption in duckdb/duckdb#19356. If the above issue is solved, then Polars dataframe construction can be done directly without the ToArrowTableInternal step at all.
A simple change like the below would significantly improve memory and performance:
duckdb-python/src/duckdb_py/pyrelation.cpp
Lines 976 to 979 in bbd4389
if (!lazy) { | |
auto arrow = ToArrowTableInternal(batch_size, true); | |
return py::cast<PolarsDataFrame>(pybind11::module_::import("polars").attr("DataFrame")(arrow)); | |
} |
=>
auto polars_module = pybind11::module_::import("polars");
return py::cast<PolarsDataFrame>(polars_module.attr("DataFrame")(*this));
To Reproduce
import os
import time
from resource import getrusage, RUSAGE_SELF
import duckdb
import pyarrow as pa
query = "SELECT r, r||'foo' AS foo FROM range(100000000) t(r) WHERE r % 2 = 0"
start = time.perf_counter()
rel = duckdb.sql(query)
table = pa.table(rel) # Direct import from Arrow C Stream capsule
elapsed = time.perf_counter() - start
peak_gb = getrusage(RUSAGE_SELF).ru_maxrss * 1024 / (1024**3)
table_gb = table.nbytes / (1024**3)
print(f"Time: {elapsed:.3f}s Peak: {peak_gb:.2f}GB Table: {table_gb:.2f}GB")
OS:
Linux
DuckDB Package Version:
1.4.1, 1.5.0+dev110
Python Version:
3.13
Full Name:
Paul T
Affiliation:
Iqmo
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a source build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration to reproduce the issue?
- Yes, I have