Skip to content

Arrow pycapsules use 2x the memory of fetch_arrow_table #137

@paultiq

Description

@paultiq

What happens?

DuckDBPyRelation::ToArrowCapsule consumes 2x the memory of ToArrowTableInternal. It fully materializes the results before constructing the arrow table, not leveraging a PhysicalArrowCollector.

edit 1: the ADBC driver has the same behavior: ...adbc... cur.execute(query).fetch_arrow_table() consumes more than 2x the memory of conn.sql(query).fetch_arrow_table().

Some rough test results (edit: for a 1GB Arrow Table):

Test Time Peak Table Size
Unchanged - ExecuteOrThrow(); 5.818s 2.36GB 1.08GB
Streaming - ExecuteOrThrow(true); 4.682s 1.22GB 1.08GB
PhysicalArrowCollector + Stream (code not shown) 3.937s 1.23GB 1.08GB
FetchArrowTable (see ToArrowTableInternal) 4.447s 1.17GB 1.08GB
Streaming

The simplest change is changing the materialization to streaming - ExecuteOrThrow(true);, this alone halves the memory consumption.

ExecuteOrThrow();

PhysicalArrowCollector

A more complete change would be to refactor ToArrowCapsule and ToArrowTableInternal to both use common PhysicalArrowCollector code. A light wrapper to construct an ArrowArrayStream would be needed in DuckDBPyResult::FetchArrowCapsule.

Related optimization - ToPolars

I noticed this when looking at ToPolars memory consumption in duckdb/duckdb#19356. If the above issue is solved, then Polars dataframe construction can be done directly without the ToArrowTableInternal step at all.

A simple change like the below would significantly improve memory and performance:

if (!lazy) {
auto arrow = ToArrowTableInternal(batch_size, true);
return py::cast<PolarsDataFrame>(pybind11::module_::import("polars").attr("DataFrame")(arrow));
}

=>

		auto polars_module = pybind11::module_::import("polars");
		return py::cast<PolarsDataFrame>(polars_module.attr("DataFrame")(*this));

To Reproduce

import os
import time
from resource import getrusage, RUSAGE_SELF
import duckdb
import pyarrow as pa

query = "SELECT r, r||'foo' AS foo FROM range(100000000) t(r) WHERE r % 2 = 0"

start = time.perf_counter()
rel = duckdb.sql(query)
table = pa.table(rel)  # Direct import from Arrow C Stream capsule
elapsed = time.perf_counter() - start

peak_gb = getrusage(RUSAGE_SELF).ru_maxrss * 1024 / (1024**3)
table_gb = table.nbytes / (1024**3)

print(f"Time: {elapsed:.3f}s  Peak: {peak_gb:.2f}GB  Table: {table_gb:.2f}GB")

OS:

Linux

DuckDB Package Version:

1.4.1, 1.5.0+dev110

Python Version:

3.13

Full Name:

Paul T

Affiliation:

Iqmo

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions