@@ -44,17 +44,17 @@ e.g. :meth:`~pymongoarrow.api.find_pandas_all`.
44
44
45
45
Test data
46
46
^^^^^^^^^
47
- Before we begein , we must first add some data to our cluster that we can
47
+ Before we begin , we must first add some data to our cluster that we can
48
48
query. We can do so using **PyMongo **::
49
49
50
50
from datetime import datetime
51
51
from pymongo import MongoClient
52
52
client = MongoClient()
53
53
client.db.data.insert_many([
54
- {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': { 'name': " Customer1" , 'account_number': 1}}, " txns" : [1, 2, 3 ]},
55
- {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': { 'name': " Customer2" , 'account_number': 2}}, " txns" : [1, 2, 3 ]},
56
- {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': { 'name': " Customer3" , 'account_number': 3}}, " txns" : [1, 2, 3 ]},
57
- {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': { 'name': " Customer4" , 'account_number': 4}}, " txns" : [1, 2, 3 ]}])
54
+ {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': ' Customer1' , 'account_number': 1}, ' txns' : ['A' ]},
55
+ {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': ' Customer2' , 'account_number': 2}, ' txns' : ['A', 'B' ]},
56
+ {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': ' Customer3' , 'account_number': 3}, ' txns' : ['A', 'B', 'C' ]},
57
+ {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': ' Customer4' , 'account_number': 4}, ' txns' : ['A', 'B', 'C', 'D' ]}])
58
58
59
59
Defining the schema
60
60
-------------------
@@ -67,28 +67,14 @@ to type-specifiers, e.g.::
67
67
from pymongoarrow.api import Schema
68
68
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
69
69
70
- There are multiple permissible type-identifiers for each supported BSON type.
71
- For a full-list of data types and associated type-identifiers see
72
- :doc: `data_types `.
73
70
74
71
Nested data (embedded documents) are also supported::
75
72
76
- from pymongoarrow.api import Schema
77
73
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
78
74
79
- Arrays (and nested arrays) are also supported::
80
-
81
- from pymongoarrow.api import Schema
82
- schema = Schema({'_id': int, 'amount': float, 'txns': list_(int32())})
83
-
84
- .. note ::
85
-
86
- For all of the examples below, the schema can be omitted like so::
87
-
88
- arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
89
-
90
- In this case, PyMongoArrow will try to automatically apply a schema based on
91
- the data contained in the first batch.
75
+ There are multiple permissible type-identifiers for each supported BSON type.
76
+ For a full-list of data types and associated type-identifiers see
77
+ :doc: `data_types `.
92
78
93
79
94
80
Find operations
@@ -103,76 +89,75 @@ We can also load the same result set as a :class:`pyarrow.Table` instance::
103
89
104
90
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
105
91
106
- Or as :class: `numpy.ndarray ` instances::
92
+ In the NumPy case, the return value is a dictionary where the keys are field
93
+ names and values are corresponding :class: `numpy.ndarray ` instances::
107
94
108
95
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
109
96
110
- In the NumPy case, the return value is a dictionary where the keys are field
111
- names and values are the corresponding arrays.
112
97
113
- Nested data (embedded documents ) are also supported::
98
+ Arrays (and nested arrays ) are also supported::
114
99
115
- from pymongoarrow.api import Schema
116
- schema = Schema({'_id': int, 'amount': float, 'account ': { 'name': str, 'account_number': int} })
100
+ from pyarrow import list_, string
101
+ schema = Schema({'_id': int, 'amount': float, 'txns ': list_(string()) })
117
102
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
118
103
119
- Arrays (and nested arrays) are also supported::
120
104
121
- from pymongoarrow.api import Schema
122
- schema = Schema({'_id': int, 'amount': float, 'txns': list_(int32())})
123
- arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
105
+ .. note ::
106
+ For all of the examples above, the schema can be omitted like so::
107
+
108
+ arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
109
+
110
+ In this case, PyMongoArrow will try to automatically apply a schema based on
111
+ the data contained in the first batch.
112
+
124
113
125
114
Aggregate operations
126
115
--------------------
127
- Running ``aggregate `` operations is similar to ``find ``. Here is an example of
128
- an aggregation that loads all records with an ``amount `` less than 10::
116
+ Running an ``aggregate `` operation is similar to ``find ``, but it takes a sequence of operations to perform.
117
+ Here is a simple example of ``aggregate_pandas_all `` that outputs a new dataframe
118
+ in which all ``_id `` values are grouped together and their ``amount `` values summed::
129
119
130
- # pandas
131
- df = client.db.data.aggregate_pandas_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
132
- # arrow
133
- arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
134
- # numpy
135
- ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
120
+ df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
136
121
137
- Nested data (embedded documents) are also supported::
122
+ Nested data (embedded documents) are also supported.
123
+ In this more complex example, we unwind values in the nested ``txn `` field, count the number of each,
124
+ then return as a list of numpy ndarrays sorted in decreasing order::
138
125
139
- from pymongoarrow.api import Schema
140
- schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
141
- arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
142
- arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
126
+ pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
127
+ ndarrays = client.db.data.aggregate_numpy_all(pipeline)
143
128
129
+ More information on aggregation pipelines can be found `here <https://www.mongodb.com/docs/manual/core/aggregation-pipeline/ >`_.
130
+
131
+ Writing to MongoDB
132
+ -----------------------
133
+ Result sets that have been loaded as Arrow's :class: `~pyarrow.Table ` type, Pandas'
134
+ :class: `~pandas.DataFrame ` type, or NumPy's :class: `~numpy.ndarray ` type can
135
+ be easily written to your MongoDB database using the :meth: `~pymongoarrow.api.write ` function::
136
+
137
+ from pymongoarrow.api import write
138
+ from pymongo import MongoClient
139
+ coll = MongoClient().db.my_collection
140
+ write(coll, df)
141
+ write(coll, arrow_table)
142
+ write(coll, ndarrays)
144
143
145
144
Writing to other formats
146
145
------------------------
147
- Result sets that have been loaded as Arrow's :class: `~pyarrow.Table ` type can
148
- be easily written to one of the formats supported by
149
- `PyArrow <https://arrow.apache.org/docs/python/index.html >`_. For example,
150
- to write the table referenced by the variable ``arrow_table `` to a Parquet
146
+ Once result sets have been loaded, one can then write them to any format that the package supports.
147
+
148
+ For example, to write the table referenced by the variable ``arrow_table `` to a Parquet
151
149
file ``example.parquet ``, run::
152
150
153
151
import pyarrow.parquet as pq
154
152
pq.write_table(arrow_table, 'example.parquet')
155
153
156
154
Pandas also supports writing :class: `~pandas.DataFrame ` instances to a variety
157
- of formats including CSV, and HDF. For example, to write the data frame
158
- referenced by the variable ``df `` to a CSV file ``out.csv ``, run::
155
+ of formats including CSV, and HDF. To write the data frame
156
+ referenced by the variable ``df `` to a CSV file ``out.csv ``, for example, run::
159
157
160
158
df.to_csv('out.csv', index=False)
161
159
162
160
.. note ::
163
161
164
162
Nested data is supported for parquet read/write but is not well supported
165
163
by Arrow or Pandas for CSV read/write.
166
-
167
- Writing back to MongoDB
168
- -----------------------
169
- Result sets that have been loaded as Arrow's :class: `~pyarrow.Table ` type, Pandas'
170
- :class: `~pandas.DataFrame ` type, or NumPy's :class: `~numpy.ndarray ` type can
171
- be easily written to your MongoDB database using the :meth: `~pymongoarrow.api.write ` function::
172
-
173
- from pymongoarrow.api import write
174
- from pymongo import MongoClient
175
- coll = MongoClient().db.my_collection
176
- write(coll, df)
177
- write(coll, arrow_table)
178
- write(coll, ndarrays)
0 commit comments