Skip to content

Commit 722df95

Browse files
authored
ARROW-150 Improve documentation about limitations surrounding schemas and nested data structures (#130)
1 parent 745ff0c commit 722df95

File tree

3 files changed

+101
-0
lines changed

3 files changed

+101
-0
lines changed

bindings/python/docs/source/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,9 @@ know to use **PyMongoArrow**.
3030
:doc:`api/index`
3131
The complete API documentation, organized by module.
3232

33+
:doc:`schemas`
34+
Important notes about the usage of PyMongoArrow Schemas.
35+
3336
Getting Help
3437
------------
3538
If you're having trouble or have questions about PyMongoArrow, ask your question on
@@ -91,3 +94,4 @@ Indices and tables
9194
api/index
9295
changelog
9396
developer/index
97+
schemas
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
.. _schema usage:
2+
3+
Schema Examples
4+
===============
5+
6+
The following are a few examples of usage of PyMongoArrow Schemas in common situations.
7+
8+
9+
Nested Data With Schema
10+
-----------------------
11+
12+
With aggregate or find methods, you can provide a schema for nested data using the struct object. Note that there can be conflicting
13+
names in sub-documents compared to their parent documents.
14+
15+
.. code-block:: pycon
16+
17+
>>> from pymongo import MongoClient
18+
... from pymongoarrow.api import Schema, find_arrow_all
19+
... from pyarrow import struct, field, int32
20+
... coll = MongoClient().db.coll
21+
... coll.insert_many(
22+
... [
23+
... {"start": "string", "prop": {"name": "foo", "start": 0}},
24+
... {"start": "string", "prop": {"name": "bar", "start": 10}},
25+
... ]
26+
... )
27+
... arrow_table = find_arrow_all(
28+
... coll, {}, schema=Schema({"start": str, "prop": struct([field("start", int32())])})
29+
... )
30+
... print(arrow_table)
31+
pyarrow.Table
32+
start: string
33+
prop: struct<start: int32>
34+
child 0, start: int32
35+
----
36+
start: [["string","string"]]
37+
prop: [
38+
-- is_valid: all not null
39+
-- child 0 type: int32
40+
[0,10]]
41+
42+
For Pandas and NumPy you can do the same exact thing:
43+
44+
.. code-block:: pycon
45+
46+
>>> df = find_pandas_all(
47+
... coll, {}, schema=Schema({"start": str, "prop": struct([field("start", int32())])})
48+
... )
49+
... print(df)
50+
start prop
51+
0 string {'start': 0}
52+
1 string {'start': 10}
53+
54+
55+
Nested Data With Projections
56+
----------------------------
57+
58+
One can also use projections to flatten the data prior to ingesting into PyMongoArrow.
59+
The following example illustrates how to do it with a very simple nested document structure.
60+
61+
.. code-block:: pycon
62+
63+
>>> df = find_pandas_all(
64+
... coll,
65+
... {
66+
... "prop.start": {
67+
... "$gte": 0,
68+
... "$lte": 10,
69+
... }
70+
... },
71+
... projection={"propName": "$prop.name", "propStart": "$prop.start"},
72+
... schema=Schema({"_id": ObjectIdType(), "propStart": int, "propName": str}),
73+
... )
74+
... print(df)
75+
_id propStart propName
76+
0 b'c\xec2\x98R(\xc9\x1e@#\xcc\xbb' 0 foo
77+
1 b'c\xec2\x98R(\xc9\x1e@#\xcc\xbc' 10 bar
78+
79+
80+
For aggregate you can flatten the fields using the `$project` stage, like so:
81+
82+
.. code-block:: pycon
83+
84+
>>> df = aggregate_pandas_all(
85+
... coll,
86+
... pipeline=[
87+
... {"$match": {"prop.start": {"$gte": 0, "$lte": 10}}},
88+
... {
89+
... "$project": {
90+
... "propStart": "$prop.start",
91+
... "propName": "$prop.name",
92+
... }
93+
... },
94+
... ],
95+
... )

bindings/python/pymongoarrow/schema.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ class Schema:
2727
Each key in ``schema`` is a field name and its corresponding value
2828
is the expected type of the data contained in the named field.
2929
30+
For more examples, see :ref:`schema usage`.
31+
3032
Data types can be specified as pyarrow type instances (e.g.
3133
an instance of :class:`pyarrow.int64`), bson types (e.g.
3234
:class:`bson.Int64`), or python type-identifiers (e.g. ``int``,

0 commit comments

Comments
 (0)