ARROW-150 Improve documentation about limitations surrounding schemas and nested data structures (#130)

juliusgeo · web-flow · commit 722df95df79f · 2023-02-21T14:28:52.000-08:00
diff --git a/bindings/python/docs/source/index.rst b/bindings/python/docs/source/index.rst
@@ -30,6 +30,9 @@ know to use **PyMongoArrow**.
 :doc:`api/index`
   The complete API documentation, organized by module.
 
+:doc:`schemas`
+  Important notes about the usage of PyMongoArrow Schemas.
+
 Getting Help
 ------------
 If you're having trouble or have questions about PyMongoArrow, ask your question on
@@ -91,3 +94,4 @@ Indices and tables
    api/index
    changelog
    developer/index
+   schemas
diff --git a/bindings/python/docs/source/schemas.rst b/bindings/python/docs/source/schemas.rst
@@ -0,0 +1,95 @@
+.. _schema usage:
+
+Schema Examples
+===============
+
+The following are a few examples of usage of PyMongoArrow Schemas in common situations.
+
+
+Nested Data With Schema
+-----------------------
+
+With aggregate or find methods, you can provide a schema for nested data using the struct object. Note that there can be conflicting
+names in sub-documents compared to their parent documents.
+
+.. code-block:: pycon
+
+   >>> from pymongo import MongoClient
+   ... from pymongoarrow.api import Schema, find_arrow_all
+   ... from pyarrow import struct, field, int32
+   ... coll = MongoClient().db.coll
+   ... coll.insert_many(
+   ...     [
+   ...         {"start": "string", "prop": {"name": "foo", "start": 0}},
+   ...         {"start": "string", "prop": {"name": "bar", "start": 10}},
+   ...     ]
+   ... )
+   ... arrow_table = find_arrow_all(
+   ...     coll, {}, schema=Schema({"start": str, "prop": struct([field("start", int32())])})
+   ... )
+   ... print(arrow_table)
+   pyarrow.Table
+   start: string
+   prop: struct<start: int32>
+     child 0, start: int32
+   ----
+   start: [["string","string"]]
+   prop: [
+     -- is_valid: all not null
+     -- child 0 type: int32
+   [0,10]]
+
+For Pandas and NumPy you can do the same exact thing:
+
+.. code-block:: pycon
+
+   >>> df = find_pandas_all(
+   ...     coll, {}, schema=Schema({"start": str, "prop": struct([field("start", int32())])})
+   ... )
+   ... print(df)
+       start           prop
+   0  string   {'start': 0}
+   1  string  {'start': 10}
+
+
+Nested Data With Projections
+----------------------------
+
+One can also use projections to flatten the data prior to ingesting into PyMongoArrow.
+The following example illustrates how to do it with a very simple nested document structure.
+
+.. code-block:: pycon
+
+   >>> df = find_pandas_all(
+   ...     coll,
+   ...     {
+   ...         "prop.start": {
+   ...             "$gte": 0,
+   ...             "$lte": 10,
+   ...         }
+   ...     },
+   ...     projection={"propName": "$prop.name", "propStart": "$prop.start"},
+   ...     schema=Schema({"_id": ObjectIdType(), "propStart": int, "propName": str}),
+   ... )
+   ... print(df)
+                                    _id  propStart propName
+   0  b'c\xec2\x98R(\xc9\x1e@#\xcc\xbb'          0      foo
+   1  b'c\xec2\x98R(\xc9\x1e@#\xcc\xbc'         10      bar
+
+
+For aggregate you can flatten the fields using the `$project` stage, like so:
+
+.. code-block:: pycon
+
+   >>> df = aggregate_pandas_all(
+   ...     coll,
+   ...     pipeline=[
+   ...         {"$match": {"prop.start": {"$gte": 0, "$lte": 10}}},
+   ...         {
+   ...             "$project": {
+   ...                 "propStart": "$prop.start",
+   ...                 "propName": "$prop.name",
+   ...             }
+   ...         },
+   ...     ],
+   ... )
diff --git a/bindings/python/pymongoarrow/schema.py b/bindings/python/pymongoarrow/schema.py
@@ -27,6 +27,8 @@ class Schema:
     Each key in ``schema`` is a field name and its corresponding value
     is the expected type of the data contained in the named field.
 
+    For more examples, see :ref:`schema usage`.
+
     Data types can be specified as pyarrow type instances (e.g.
     an instance of :class:`pyarrow.int64`), bson types (e.g.
     :class:`bson.Int64`), or python type-identifiers (e.g. ``int``,