|
| 1 | +.. _schema usage: |
| 2 | + |
| 3 | +Schema Examples |
| 4 | +=============== |
| 5 | + |
| 6 | +The following are a few examples of usage of PyMongoArrow Schemas in common situations. |
| 7 | + |
| 8 | + |
| 9 | +Nested Data With Schema |
| 10 | +----------------------- |
| 11 | + |
| 12 | +With aggregate or find methods, you can provide a schema for nested data using the struct object. Note that there can be conflicting |
| 13 | +names in sub-documents compared to their parent documents. |
| 14 | + |
| 15 | +.. code-block:: pycon |
| 16 | +
|
| 17 | + >>> from pymongo import MongoClient |
| 18 | + ... from pymongoarrow.api import Schema, find_arrow_all |
| 19 | + ... from pyarrow import struct, field, int32 |
| 20 | + ... coll = MongoClient().db.coll |
| 21 | + ... coll.insert_many( |
| 22 | + ... [ |
| 23 | + ... {"start": "string", "prop": {"name": "foo", "start": 0}}, |
| 24 | + ... {"start": "string", "prop": {"name": "bar", "start": 10}}, |
| 25 | + ... ] |
| 26 | + ... ) |
| 27 | + ... arrow_table = find_arrow_all( |
| 28 | + ... coll, {}, schema=Schema({"start": str, "prop": struct([field("start", int32())])}) |
| 29 | + ... ) |
| 30 | + ... print(arrow_table) |
| 31 | + pyarrow.Table |
| 32 | + start: string |
| 33 | + prop: struct<start: int32> |
| 34 | + child 0, start: int32 |
| 35 | + ---- |
| 36 | + start: [["string","string"]] |
| 37 | + prop: [ |
| 38 | + -- is_valid: all not null |
| 39 | + -- child 0 type: int32 |
| 40 | + [0,10]] |
| 41 | +
|
| 42 | +For Pandas and NumPy you can do the same exact thing: |
| 43 | + |
| 44 | +.. code-block:: pycon |
| 45 | +
|
| 46 | + >>> df = find_pandas_all( |
| 47 | + ... coll, {}, schema=Schema({"start": str, "prop": struct([field("start", int32())])}) |
| 48 | + ... ) |
| 49 | + ... print(df) |
| 50 | + start prop |
| 51 | + 0 string {'start': 0} |
| 52 | + 1 string {'start': 10} |
| 53 | +
|
| 54 | +
|
| 55 | +Nested Data With Projections |
| 56 | +---------------------------- |
| 57 | + |
| 58 | +One can also use projections to flatten the data prior to ingesting into PyMongoArrow. |
| 59 | +The following example illustrates how to do it with a very simple nested document structure. |
| 60 | + |
| 61 | +.. code-block:: pycon |
| 62 | +
|
| 63 | + >>> df = find_pandas_all( |
| 64 | + ... coll, |
| 65 | + ... { |
| 66 | + ... "prop.start": { |
| 67 | + ... "$gte": 0, |
| 68 | + ... "$lte": 10, |
| 69 | + ... } |
| 70 | + ... }, |
| 71 | + ... projection={"propName": "$prop.name", "propStart": "$prop.start"}, |
| 72 | + ... schema=Schema({"_id": ObjectIdType(), "propStart": int, "propName": str}), |
| 73 | + ... ) |
| 74 | + ... print(df) |
| 75 | + _id propStart propName |
| 76 | + 0 b'c\xec2\x98R(\xc9\x1e@#\xcc\xbb' 0 foo |
| 77 | + 1 b'c\xec2\x98R(\xc9\x1e@#\xcc\xbc' 10 bar |
| 78 | +
|
| 79 | +
|
| 80 | +For aggregate you can flatten the fields using the `$project` stage, like so: |
| 81 | + |
| 82 | +.. code-block:: pycon |
| 83 | +
|
| 84 | + >>> df = aggregate_pandas_all( |
| 85 | + ... coll, |
| 86 | + ... pipeline=[ |
| 87 | + ... {"$match": {"prop.start": {"$gte": 0, "$lte": 10}}}, |
| 88 | + ... { |
| 89 | + ... "$project": { |
| 90 | + ... "propStart": "$prop.start", |
| 91 | + ... "propName": "$prop.name", |
| 92 | + ... } |
| 93 | + ... }, |
| 94 | + ... ], |
| 95 | + ... ) |
0 commit comments