Skip to content

Commit 7130138

Browse files
[ARROW-197] Add support for Polars
* ARROW-197 First commit. Basic API, simple tests, and a number of questions and to-dos * Raise ValueError if one writes a Polars DataFrame with no _id column * Working start on TestExplicitPolarsApi.test_write_schema_validation * Cast ExtensionTypes in Arrow.Table from find_arrow_all to base pyarrow.types (eg pa.string, pa.lib.FixedSizeBinaryType) * cleanup * cleanup formatting * Add polars to pyproject deps * Add Polars to benchmarks * Additional Polars tests and todos * Finished Polars tests * Ruff Cleanup * ARROW-206 ARROW-204 Added temporary pytest filterwarnings for Pandas DeprecationWarnings * Remove redundant check if pd.DataFrame is None * ARROW-210 Initial commit for pyarrow large_list and large_string DataTypes * Updated and added further datetime tests * Added tests of large_list and large_string to test_arrow * Added version numbers to changelog and docstrings * Fixed merge typo * [ARROW-214] Added tests of Arrow binary datatypes with expected failures. Will be good tests when support is added * Updated FAQ. Was outdated as Pandas is now required. Polars slipped in perfectly * Update heading of comparison.html. We had 2 Quick Start pages. * Typo * Updates to Quick Start' * Updates to index.rst and quickstart.rst * Updated Data Types page * Fix heading underline * Added manylinux-aarch64-image * Removed completed todo in benchmarks. Polars array IS being tested. * ARROW-217 - Turned todos into jiras * Place guards around polars imports * Set polars as optional dependency in pyproject.toml * Added test extras to benchmark env in tox.ini
1 parent 0fbe7e7 commit 7130138

File tree

12 files changed

+598
-40
lines changed

12 files changed

+598
-40
lines changed

bindings/python/benchmarks/benchmarks.py

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
import numpy as np
2020
import pandas as pd
21+
import polars as pl
2122
import pyarrow as pa
2223
import pymongo
2324
from bson import BSON, Binary, Decimal128
@@ -27,6 +28,7 @@
2728
find_arrow_all,
2829
find_numpy_all,
2930
find_pandas_all,
31+
find_polars_all,
3032
write,
3133
)
3234
from pymongoarrow.types import BinaryType, Decimal128Type
@@ -74,6 +76,9 @@ def time_insert_pandas(self):
7476
def time_insert_numpy(self):
7577
write(db.benchmark, self.numpy_arrays)
7678

79+
def time_insert_polars(self):
80+
write(db.benchmark, self.polars_table)
81+
7782
def peakmem_insert_arrow(self):
7883
self.time_insert_arrow()
7984

@@ -86,6 +91,9 @@ def peakmem_insert_pandas(self):
8691
def peakmem_insert_numpy(self):
8792
self.time_insert_numpy()
8893

94+
def peakmem_insert_polars(self):
95+
self.time_insert_polars()
96+
8997

9098
class Read(ABC):
9199
"""
@@ -136,16 +144,25 @@ def time_to_pandas(self):
136144
c = db.benchmark
137145
find_pandas_all(c, {}, schema=self.schema, projection={"_id": 0})
138146

147+
def time_conventional_arrow(self):
148+
c = db.benchmark
149+
f = list(c.find({}, projection={"_id": 0}))
150+
table = pa.Table.from_pylist(f)
151+
self.exercise_table(table)
152+
139153
def time_to_arrow(self):
140154
c = db.benchmark
141155
table = find_arrow_all(c, {}, schema=self.schema, projection={"_id": 0})
142156
self.exercise_table(table)
143157

144-
def time_conventional_arrow(self):
158+
def time_conventional_polars(self):
159+
collection = db.benchmark
160+
cursor = collection.find(projection={"_id": 0})
161+
_ = pl.DataFrame(list(cursor))
162+
163+
def time_to_polars(self):
145164
c = db.benchmark
146-
f = list(c.find({}, projection={"_id": 0}))
147-
table = pa.Table.from_pylist(f)
148-
self.exercise_table(table)
165+
find_polars_all(c, {}, schema=self.schema, projection={"_id": 0})
149166

150167
def peakmem_to_numpy(self):
151168
self.time_to_numpy()
@@ -162,6 +179,12 @@ def peakmem_to_arrow(self):
162179
def peakmem_conventional_arrow(self):
163180
self.time_conventional_arrow()
164181

182+
def peakmem_to_polars(self):
183+
self.time_to_polars()
184+
185+
def peakmem_conventional_polars(self):
186+
self.time_conventional_polars()
187+
165188

166189
class ProfileReadArray(Read):
167190
schema = Schema(
@@ -364,6 +387,7 @@ def setup(self):
364387
self.arrow_table = find_arrow_all(db.benchmark, {}, schema=self.schema)
365388
self.pandas_table = find_pandas_all(db.benchmark, {}, schema=self.schema)
366389
self.numpy_arrays = find_numpy_all(db.benchmark, {}, schema=self.schema)
390+
self.polars_table = find_polars_all(db.benchmark, {}, schema=self.schema)
367391

368392

369393
class ProfileInsertLarge(Insert):
@@ -383,3 +407,4 @@ def setup(self):
383407
self.arrow_table = find_arrow_all(db.benchmark, {}, schema=self.schema)
384408
self.pandas_table = find_pandas_all(db.benchmark, {}, schema=self.schema)
385409
self.numpy_arrays = find_numpy_all(db.benchmark, {}, schema=self.schema)
410+
self.polars_table = find_polars_all(db.benchmark, {}, schema=self.schema)

bindings/python/docs/source/changelog.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
Changelog
22
=========
33

4+
Changes in Version 1.3.0
5+
------------------------
6+
- Support for Polars
7+
- Support for PyArrow.DataTypes: large_list, large_string
8+
49
Changes in Version 1.2.0
510
------------------------
611
- Support for PyArrow 14.0.

bindings/python/docs/source/comparison.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
Quick Start
2-
===========
1+
Comparing to PyMongo
2+
====================
33

4-
This tutorial is intended as a comparison between using just PyMongo, versus
5-
with **PyMongoArrow**. The reader is assumed to be familiar with basic
4+
This tutorial is intended as a comparison between using **PyMongoArrow**,
5+
versus just PyMongo. The reader is assumed to be familiar with basic
66
`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and
77
`MongoDB <https://docs.mongodb.com>`_ concepts.
88

bindings/python/docs/source/data_types.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,12 @@ Data Types
44
==========
55

66
PyMongoArrow supports a majority of the BSON types.
7+
As Arrow and Polars provide first-class support for Lists and Structs,
8+
this includes Embedded arrays and documents.
9+
710
Support for additional types will be added in subsequent releases.
811

12+
913
.. note:: For more information about BSON types, see the
1014
`BSON specification <http://bsonspec.org/spec.html>`_.
1115

@@ -131,11 +135,12 @@ dataframe will be the appropriate ``bson`` type.
131135
>>> df["_id"][0]
132136
ObjectId('64408bf65ac9e208af220144')
133137
138+
As of this writing, Polars does not support Extension Types.
134139

135140
Null Values and Conversion to Pandas DataFrames
136141
-----------------------------------------------
137142

138-
In Arrow, all Arrays are always nullable.
143+
In Arrow (and Polars), all Arrays are nullable.
139144
Pandas has experimental nullable data types as, e.g., "Int64" (note the capital "I").
140145
You can instruct Arrow to create a pandas DataFrame using nullable dtypes
141146
with the code below (taken from `here <https://arrow.apache.org/docs/python/pandas.html>`_)

bindings/python/docs/source/faq.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@ Frequently Asked Questions
33

44
.. contents::
55

6-
Why do I get ``ModuleNotFoundError: No module named 'pandas'`` when using PyMongoArrow
7-
--------------------------------------------------------------------------------------
6+
Why do I get ``ModuleNotFoundError: No module named 'polars'`` when using PyMongoArrow?
7+
---------------------------------------------------------------------------------------
88

99
This error is raised when an application attempts to use a PyMongoArrow API
10-
that returns query result sets as a :class:`pandas.DataFrame` instance without
11-
having ``pandas`` installed in the Python environment. Since ``pandas`` is not
10+
that returns query result sets as a :class:`polars.DataFrame` instance without
11+
having ``polars`` installed in the Python environment. Since ``polars`` is not
1212
a direct dependency of PyMongoArrow, it is not automatically installed when
1313
you install ``pymongoarrow`` and must be installed separately::
1414

15-
$ python -m pip install pandas
15+
$ python -m pip install polars

bindings/python/docs/source/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ Overview
66
**PyMongoArrow** is a `PyMongo <http://pymongo.readthedocs.io/>`_ extension
77
containing tools for loading `MongoDB <http://www.mongodb.org>`_ query result
88
sets as `Apache Arrow <http://arrow.apache.org>`_ tables,
9-
`Pandas <https://pandas.pydata.org>`_ and `NumPy <https://numpy.org>`_ arrays.
9+
`NumPy <https://numpy.org>`_ arrays, and `Pandas <https://pandas.pydata.org>`_
10+
or `Polars <https://pola.rs/>`_ DataFrames.
1011
PyMongoArrow is the recommended way to materialize MongoDB query result sets as
1112
contiguous-in-memory, typed arrays suited for in-memory analytical processing
1213
applications. This documentation attempts to explain everything you need to

bindings/python/docs/source/quickstart.rst

Lines changed: 32 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -68,10 +68,16 @@ to type-specifiers, e.g.::
6868
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
6969

7070

71-
Nested data (embedded documents) are also supported::
71+
PyMongoArrow offers first-class support for Nested data (embedded documents)::
7272

7373
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
7474

75+
Lists (and nested lists) are also supported::
76+
77+
from pyarrow import list_, string
78+
schema = Schema({'txns': list_(string())})
79+
polars_df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
80+
7581
There are multiple permissible type-identifiers for each supported BSON type.
7682
For a full-list of data types and associated type-identifiers see
7783
:doc:`data_types`.
@@ -89,18 +95,16 @@ We can also load the same result set as a :class:`pyarrow.Table` instance::
8995

9096
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
9197

92-
In the NumPy case, the return value is a dictionary where the keys are field
93-
names and values are corresponding :class:`numpy.ndarray` instances::
98+
a :class:`polars.DataFrame`::
9499

95-
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
100+
df = client.db.data.find_polars_all({'amount': {'$gt': 0}}, schema=schema)
96101

102+
or as **Numpy arrays**::
97103

98-
Arrays (and nested arrays) are also supported::
99-
100-
from pyarrow import list_, string
101-
schema = Schema({'_id': int, 'amount': float, 'txns': list_(string())})
102-
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
104+
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
103105

106+
In the NumPy case, the return value is a dictionary where the keys are field
107+
names and values are corresponding :class:`numpy.ndarray` instances.
104108

105109
.. note::
106110
For all of the examples above, the schema can be omitted like so::
@@ -130,16 +134,18 @@ More information on aggregation pipelines can be found `here <https://www.mongod
130134

131135
Writing to MongoDB
132136
-----------------------
133-
Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type, Pandas'
134-
:class:`~pandas.DataFrame` type, or NumPy's :class:`~numpy.ndarray` type can
137+
All of these types, Arrow's :class:`~pyarrow.Table`, Pandas'
138+
:class:`~pandas.DataFrame`, NumPy's :class:`~numpy.ndarray`, or :class:`~polars.DataFrame` can
135139
be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function::
136140

137-
from pymongoarrow.api import write
138-
from pymongo import MongoClient
139-
coll = MongoClient().db.my_collection
140-
write(coll, df)
141-
write(coll, arrow_table)
142-
write(coll, ndarrays)
141+
from pymongoarrow.api import write
142+
from pymongo import MongoClient
143+
coll = MongoClient().db.my_collection
144+
write(coll, df)
145+
write(coll, arrow_table)
146+
write(coll, ndarrays)
147+
148+
(Keep in mind that NumPy arrays are specified as ``dict[str, ndarray]``.)
143149

144150
Writing to other formats
145151
------------------------
@@ -157,6 +163,15 @@ referenced by the variable ``df`` to a CSV file ``out.csv``, for example, run::
157163

158164
df.to_csv('out.csv', index=False)
159165

166+
The Polars API is a mix of the two::
167+
168+
169+
import polars as pl
170+
df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
171+
df.write_parquet('example.parquet')
172+
173+
174+
160175
.. note::
161176

162177
Nested data is supported for parquet read/write but is not well supported

0 commit comments

Comments
 (0)