Skip to content

Commit 5a53633

Browse files
committed
Enhance documentation on zero-copy streaming to Arrow-based Python libraries, clarifying the protocol and adding implementation-agnostic notes.
1 parent 78f6c8a commit 5a53633

File tree

2 files changed

+22
-36
lines changed

2 files changed

+22
-36
lines changed

dev/changelog/50.0.0.md

Lines changed: 0 additions & 24 deletions
This file was deleted.

docs/source/user-guide/dataframe/index.rst

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -149,15 +149,26 @@ To materialize the results of your DataFrame operations:
149149
# Count rows
150150
count = df.count()
151151
152-
PyArrow Streaming
152+
Zero-copy streaming to Arrow-based Python libraries
153153
-----------------
154154

155155
DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling
156-
zero-copy streaming into libraries like `PyArrow <https://arrow.apache.org/>`_.
157-
Earlier versions eagerly converted the entire DataFrame when exporting to
158-
PyArrow, which could exhaust memory on large datasets. With streaming, batches
159-
are produced lazily so you can process arbitrarily large results without
160-
out-of-memory errors.
156+
zero-copy, lazy streaming into Arrow-based Python libraries. Earlier versions
157+
eagerly converted the entire DataFrame when exporting to Python Arrow APIs,
158+
which could exhaust memory on large results. With the streaming protocol,
159+
batches are produced on demand so you can process arbitrarily large results
160+
without out-of-memory errors.
161+
162+
.. note::
163+
164+
The protocol is implementation-agnostic and works with any Python library
165+
that understands the Arrow C streaming interface (for example, PyArrow
166+
or other Arrow-compatible implementations). The sections below provide a
167+
short PyArrow-specific example and general guidance for other
168+
implementations.
169+
170+
PyArrow
171+
-------
161172

162173
.. code-block:: python
163174
@@ -170,7 +181,7 @@ out-of-memory errors.
170181
171182
DataFrames are also iterable, yielding :class:`datafusion.RecordBatch`
172183
objects lazily so you can loop over results directly without importing
173-
PyArrow:
184+
PyArrow::
174185

175186
.. code-block:: python
176187
@@ -179,24 +190,23 @@ PyArrow:
179190
180191
Each batch exposes ``to_pyarrow()``, allowing conversion to a PyArrow
181192
table. ``pa.table(df)`` collects the entire DataFrame eagerly into a
182-
PyArrow table:
193+
PyArrow table::
183194

184195
.. code-block:: python
185196
186197
import pyarrow as pa
187198
table = pa.table(df)
188199
189200
Asynchronous iteration is supported as well, allowing integration with
190-
``asyncio`` event loops:
201+
``asyncio`` event loops::
191202

192203
.. code-block:: python
193204
194205
async for batch in df:
195206
... # process each batch as it is produced
196207
197-
To work with the stream directly, use
198-
``execute_stream()``, which returns a
199-
:class:`~datafusion.RecordBatchStream`:
208+
To work with the stream directly, use ``execute_stream()``, which returns a
209+
:class:`~datafusion.RecordBatchStream`::
200210

201211
.. code-block:: python
202212

0 commit comments

Comments
 (0)