ARROW-129 Documentation should provide a comparison with using PyMongo directly (#158)

blink1073 · juliusgeo · web-flow · commit 6b33ad1bed47 · 2023-06-21T16:04:25.000-05:00
Co-authored-by: Julius Park &lt;juliusgeo@gmail.com&gt;
diff --git a/bindings/python/docs/source/comparison.rst b/bindings/python/docs/source/comparison.rst
@@ -0,0 +1,160 @@
+Quick Start
+===========
+
+This tutorial is intended as a comparison between using just PyMongo, versus
+with **PyMongoArrow**. The reader is assumed to be familiar with basic
+`PyMongo <https://pymongo.readthedocs.io/en/stable/tutorial.html>`_ and
+`MongoDB <https://docs.mongodb.com>`_ concepts.
+
+
+Reading Data
+^^^^^^^^^^^^
+The most basic way to read data using PyMongo is:
+
+.. code-block:: python
+
+    coll = db.benchmark
+    f = list(coll.find({}, projection={"_id": 0}))
+    table = pyarrow.Table.from_pylist(f)
+
+This works, but we have to exclude the "_id" field because otherwise we get this error::
+
+    pyarrow.lib.ArrowInvalid: Could not convert ObjectId('642f2f4720d92a85355671b3') with type ObjectId: did not recognize Python value type when inferring an Arrow data type
+
+The workaround gets ugly (especially if you're using more than ObjectIds):
+
+.. code-block:: pycon
+
+    >>> f = list(coll.find({}))
+    >>> for doc in f:
+    ...     doc["_id"] = str(doc["_id"])
+    ...
+    >>> table = pyarrow.Table.from_pylist(f)
+    >>> print(table)
+    pyarrow.Table
+    _id: string
+    x: int64
+    y: double
+
+Even though this avoids the error, an unfortunate drawback is that Arrow cannot identify that it is an ObjectId,
+as noted by the schema showing "_id" is a string.
+The primary benefit that PyMongoArrow gives is support for BSON types through Arrow/Pandas Extension Types. This allows you to avoid the ugly workaround:
+
+.. code-block:: pycon
+
+    >>> from pymongoarrow.types import ObjectIdType
+    >>> schema = Schema({"_id": ObjectIdType(), "x": pyarrow.int64(), "y": pyarrow.float64()})
+    >>> table = find_arrow_all(coll, {}, schema=schema)
+    >>> print(table)
+    pyarrow.Table
+    _id: extension<arrow.py_extension_type<ObjectIdType>>
+    x: int64
+    y: double
+
+And it also lets Arrow correctly identify the type! This is limited in utility for non-numeric extension types,
+but if you wanted to for example, sort datetimes, it avoids unecessary casting:
+
+.. code-block:: python
+
+    f = list(coll.find({}, projection={"_id": 0, "x": 0}))
+    naive_table = pyarrow.Table.from_pylist(f)
+
+    schema = Schema({"time": pyarrow.timestamp("ms")})
+    table = find_arrow_all(coll, {}, schema=schema)
+
+    assert (
+        table.sort_by([("time", "ascending")])["time"]
+        == naive_table["time"].cast(pyarrow.timestamp("ms")).sort()
+    )
+
+Additionally, PyMongoArrow supports Pandas extension types.
+With PyMongo, a Decimal128 value behaves as follows:
+
+.. code-block:: python
+
+    coll = client.test.test
+    coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
+    cursor = coll.find({})
+    df = pd.DataFrame(list(cursor))
+    print(df.dtypes)
+    # _id      object
+    # value    object
+
+The equivalent in PyMongoArrow would be:
+
+.. code-block:: python
+
+    from pymongoarrow.api import find_pandas_all
+
+    coll = client.test.test
+    coll.insert_many([{"value": Decimal128(str(i))} for i in range(200)])
+    df = find_pandas_all(coll, {})
+    print(df.dtypes)
+    # _id      bson_PandasObjectId
+    # value    bson_PandasDecimal128
+
+In both cases the underlying values are the bson class type:
+
+.. code-block:: python
+
+    print(df["value"][0])
+    Decimal128("0")
+
+
+Writing Data
+~~~~~~~~~~~~
+
+Writing data from an Arrow table using PyMongo looks like the following:
+
+.. code-block:: python
+
+    data = arrow_table.to_pylist()
+    db.collname.insert_many(data)
+
+The equivalent in PyMongoArrow is:
+
+.. code-block:: python
+
+    from pymongoarrow.api import write
+
+    write(db.collname, arrow_table)
+
+As of PyMongoArrow 1.0, the main advantage to using the ``write`` function
+is that it will iterate over the arrow table/ data frame / numpy array
+and not convert the entire object to a list.
+
+
+Benchmarks
+~~~~~~~~~~
+
+The following measurements were taken with PyMongoArrow 1.0 and PyMongo 4.4.
+For insertions, the library performs about the same as when using PyMongo
+(conventional), and uses the same amount of memory.::
+
+    ProfileInsertSmall.peakmem_insert_conventional      107M
+    ProfileInsertSmall.peakmem_insert_arrow             108M
+    ProfileInsertSmall.time_insert_conventional         202±0.8ms
+    ProfileInsertSmall.time_insert_arrow                181±0.4ms
+
+    ProfileInsertLarge.peakmem_insert_arrow             127M
+    ProfileInsertLarge.peakmem_insert_conventional      125M
+    ProfileInsertLarge.time_insert_arrow                425±1ms
+    ProfileInsertLarge.time_insert_conventional         440±1ms
+
+For reads, the library is somewhat slower for small documents and nested
+documents, but faster for large documents .  It uses less memory in all cases::
+
+    ProfileReadSmall.peakmem_conventional_arrow     85.8M
+    ProfileReadSmall.peakmem_to_arrow               83.1M
+    ProfileReadSmall.time_conventional_arrow        38.1±0.3ms
+    ProfileReadSmall.time_to_arrow                  60.8±0.3ms
+
+    ProfileReadLarge.peakmem_conventional_arrow     138M
+    ProfileReadLarge.peakmem_to_arrow               106M
+    ProfileReadLarge.time_conventional_ndarray      243±20ms
+    ProfileReadLarge.time_to_arrow                  186±0.8ms
+
+    ProfileReadDocument.peakmem_conventional_arrow  209M
+    ProfileReadDocument.peakmem_to_arrow            152M
+    ProfileReadDocument.time_conventional_arrow     865±7ms
+    ProfileReadDocument.time_to_arrow               937±1ms
diff --git a/bindings/python/docs/source/index.rst b/bindings/python/docs/source/index.rst
@@ -21,6 +21,9 @@ know to use **PyMongoArrow**.
 :doc:`data_types`
   Data type support with PyMongoArrow.
 
+:doc:`comparison`
+  Comparison of using PyMongoArrow versus using PyMongo directly.
+
 :doc:`faq`
   Frequently asked questions.
 
@@ -86,6 +89,7 @@ Indices and tables
    installation
    quickstart
    data_types
+   comparison
    faq
    api/index
    changelog