Skip to content

Commit 6b33ad1

Browse files
blink1073juliusgeo
andauthored
ARROW-129 Documentation should provide a comparison with using PyMongo directly (#158)
Co-authored-by: Julius Park <[email protected]>
1 parent 2fe34e0 commit 6b33ad1

File tree

2 files changed

+164
-0
lines changed

2 files changed

+164
-0
lines changed
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
Quick Start
2+
===========
3+
4+
This tutorial is intended as a comparison between using just PyMongo, versus
5+
with **PyMongoArrow**. The reader is assumed to be familiar with basic
6+
`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and
7+
`MongoDB <https://docs.mongodb.com>`_ concepts.
8+
9+
10+
Reading Data
11+
^^^^^^^^^^^^
12+
The most basic way to read data using PyMongo is:
13+
14+
.. code-block:: python
15+
16+
coll = db.benchmark
17+
f = list(coll.find({}, projection={"_id": 0}))
18+
table = pyarrow.Table.from_pylist(f)
19+
20+
This works, but we have to exclude the "_id" field because otherwise we get this error::
21+
22+
pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type
23+
24+
The workaround gets ugly (especially if you're using more than ObjectIds):
25+
26+
.. code-block:: pycon
27+
28+
>>> f = list(coll.find({}))
29+
>>> for doc in f:
30+
... doc["_id"] = str(doc["_id"])
31+
...
32+
>>> table = pyarrow.Table.from_pylist(f)
33+
>>> print(table)
34+
pyarrow.Table
35+
_id: string
36+
x: int64
37+
y: double
38+
39+
Even though this avoids the error, an unfortunate drawback is that Arrow cannot identify that it is an ObjectId,
40+
as noted by the schema showing "_id" is a string.
41+
The primary benefit that PyMongoArrow gives is support for BSON types through Arrow/Pandas Extension Types. This allows you to avoid the ugly workaround:
42+
43+
.. code-block:: pycon
44+
45+
>>> from pymongoarrow.types import ObjectIdType
46+
>>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
47+
>>> table = find_arrow_all(coll, {}, schema=schema)
48+
>>> print(table)
49+
pyarrow.Table
50+
_id: extension<arrow.py_extension_type<ObjectIdType>>
51+
x: int64
52+
y: double
53+
54+
And it also lets Arrow correctly identify the type! This is limited in utility for non-numeric extension types,
55+
but if you wanted to for example, sort datetimes, it avoids unecessary casting:
56+
57+
.. code-block:: python
58+
59+
f = list(coll.find({}, projection={"_id": 0, "x": 0}))
60+
naive_table = pyarrow.Table.from_pylist(f)
61+
62+
schema = Schema({"time": pyarrow.timestamp("ms")})
63+
table = find_arrow_all(coll, {}, schema=schema)
64+
65+
assert (
66+
table.sort_by([("time", "ascending")])["time"]
67+
== naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
68+
)
69+
70+
Additionally, PyMongoArrow supports Pandas extension types.
71+
With PyMongo, a Decimal128 value behaves as follows:
72+
73+
.. code-block:: python
74+
75+
coll = client.test.test
76+
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
77+
cursor = coll.find({})
78+
df = pd.DataFrame(list(cursor))
79+
print(df.dtypes)
80+
# _id object
81+
# value object
82+
83+
The equivalent in PyMongoArrow would be:
84+
85+
.. code-block:: python
86+
87+
from pymongoarrow.api import find_pandas_all
88+
89+
coll = client.test.test
90+
coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
91+
df = find_pandas_all(coll, {})
92+
print(df.dtypes)
93+
# _id bson_PandasObjectId
94+
# value bson_PandasDecimal128
95+
96+
In both cases the underlying values are the bson class type:
97+
98+
.. code-block:: python
99+
100+
print(df["value"][0])
101+
Decimal128("0")
102+
103+
104+
Writing Data
105+
~~~~~~~~~~~~
106+
107+
Writing data from an Arrow table using PyMongo looks like the following:
108+
109+
.. code-block:: python
110+
111+
data = arrow_table.to_pylist()
112+
db.collname.insert_many(data)
113+
114+
The equivalent in PyMongoArrow is:
115+
116+
.. code-block:: python
117+
118+
from pymongoarrow.api import write
119+
120+
write(db.collname, arrow_table)
121+
122+
As of PyMongoArrow 1.0, the main advantage to using the ``write`` function
123+
is that it will iterate over the arrow table/ data frame / numpy array
124+
and not convert the entire object to a list.
125+
126+
127+
Benchmarks
128+
~~~~~~~~~~
129+
130+
The following measurements were taken with PyMongoArrow 1.0 and PyMongo 4.4.
131+
For insertions, the library performs about the same as when using PyMongo
132+
(conventional), and uses the same amount of memory.::
133+
134+
ProfileInsertSmall.peakmem_insert_conventional 107M
135+
ProfileInsertSmall.peakmem_insert_arrow 108M
136+
ProfileInsertSmall.time_insert_conventional 202±0.8ms
137+
ProfileInsertSmall.time_insert_arrow 181±0.4ms
138+
139+
ProfileInsertLarge.peakmem_insert_arrow 127M
140+
ProfileInsertLarge.peakmem_insert_conventional 125M
141+
ProfileInsertLarge.time_insert_arrow 425±1ms
142+
ProfileInsertLarge.time_insert_conventional 440±1ms
143+
144+
For reads, the library is somewhat slower for small documents and nested
145+
documents, but faster for large documents . It uses less memory in all cases::
146+
147+
ProfileReadSmall.peakmem_conventional_arrow 85.8M
148+
ProfileReadSmall.peakmem_to_arrow 83.1M
149+
ProfileReadSmall.time_conventional_arrow 38.1±0.3ms
150+
ProfileReadSmall.time_to_arrow 60.8±0.3ms
151+
152+
ProfileReadLarge.peakmem_conventional_arrow 138M
153+
ProfileReadLarge.peakmem_to_arrow 106M
154+
ProfileReadLarge.time_conventional_ndarray 243±20ms
155+
ProfileReadLarge.time_to_arrow 186±0.8ms
156+
157+
ProfileReadDocument.peakmem_conventional_arrow 209M
158+
ProfileReadDocument.peakmem_to_arrow 152M
159+
ProfileReadDocument.time_conventional_arrow 865±7ms
160+
ProfileReadDocument.time_to_arrow 937±1ms

bindings/python/docs/source/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ know to use **PyMongoArrow**.
2121
:doc:`data_types`
2222
Data type support with PyMongoArrow.
2323

24+
:doc:`comparison`
25+
Comparison of using PyMongoArrow versus using PyMongo directly.
26+
2427
:doc:`faq`
2528
Frequently asked questions.
2629

@@ -86,6 +89,7 @@ Indices and tables
8689
installation
8790
quickstart
8891
data_types
92+
comparison
8993
faq
9094
api/index
9195
changelog

0 commit comments

Comments
 (0)