Skip to content

Commit 684c2ed

Browse files
ARROW-200 Cleans up Getting Starts in docs (#186)
* ARROW-200 Fix typos and ambiguities in Getting Started * Added meat to the Aggregate Operations section * Added Schema.__repr__. Helps disambiguate Arrows, Pandas, Bson..... * Replaced deprecated pkg_resource with packaging.version * Changed order of sections so that it flows, while not drawing attention to strange behavior with list_ in pandas and numpy * Corrected schema for ListType example in Getting Started.
1 parent 5b73be2 commit 684c2ed

File tree

2 files changed

+52
-64
lines changed

2 files changed

+52
-64
lines changed

bindings/python/docs/source/quickstart.rst

Lines changed: 49 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -44,17 +44,17 @@ e.g. :meth:`~pymongoarrow.api.find_pandas_all`.
4444

4545
Test data
4646
^^^^^^^^^
47-
Before we begein, we must first add some data to our cluster that we can
47+
Before we begin, we must first add some data to our cluster that we can
4848
query. We can do so using **PyMongo**::
4949

5050
from datetime import datetime
5151
from pymongo import MongoClient
5252
client = MongoClient()
5353
client.db.data.insert_many([
54-
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': { 'name': "Customer1", 'account_number': 1}}, "txns": [1, 2, 3]},
55-
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': { 'name': "Customer2", 'account_number': 2}}, "txns": [1, 2, 3]},
56-
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': { 'name': "Customer3", 'account_number': 3}}, "txns": [1, 2, 3]},
57-
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': { 'name': "Customer4", 'account_number': 4}}, "txns": [1, 2, 3]}])
54+
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},
55+
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},
56+
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},
57+
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
5858

5959
Defining the schema
6060
-------------------
@@ -67,28 +67,14 @@ to type-specifiers, e.g.::
6767
from pymongoarrow.api import Schema
6868
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
6969

70-
There are multiple permissible type-identifiers for each supported BSON type.
71-
For a full-list of data types and associated type-identifiers see
72-
:doc:`data_types`.
7370

7471
Nested data (embedded documents) are also supported::
7572

76-
from pymongoarrow.api import Schema
7773
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
7874

79-
Arrays (and nested arrays) are also supported::
80-
81-
from pymongoarrow.api import Schema
82-
schema = Schema({'_id': int, 'amount': float, 'txns': list_(int32())})
83-
84-
.. note::
85-
86-
For all of the examples below, the schema can be omitted like so::
87-
88-
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
89-
90-
In this case, PyMongoArrow will try to automatically apply a schema based on
91-
the data contained in the first batch.
75+
There are multiple permissible type-identifiers for each supported BSON type.
76+
For a full-list of data types and associated type-identifiers see
77+
:doc:`data_types`.
9278

9379

9480
Find operations
@@ -103,76 +89,75 @@ We can also load the same result set as a :class:`pyarrow.Table` instance::
10389

10490
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
10591

106-
Or as :class:`numpy.ndarray` instances::
92+
In the NumPy case, the return value is a dictionary where the keys are field
93+
names and values are corresponding :class:`numpy.ndarray` instances::
10794

10895
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
10996

110-
In the NumPy case, the return value is a dictionary where the keys are field
111-
names and values are the corresponding arrays.
11297

113-
Nested data (embedded documents) are also supported::
98+
Arrays (and nested arrays) are also supported::
11499

115-
from pymongoarrow.api import Schema
116-
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
100+
from pyarrow import list_, string
101+
schema = Schema({'_id': int, 'amount': float, 'txns': list_(string())})
117102
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
118103

119-
Arrays (and nested arrays) are also supported::
120104

121-
from pymongoarrow.api import Schema
122-
schema = Schema({'_id': int, 'amount': float, 'txns': list_(int32())})
123-
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
105+
.. note::
106+
For all of the examples above, the schema can be omitted like so::
107+
108+
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
109+
110+
In this case, PyMongoArrow will try to automatically apply a schema based on
111+
the data contained in the first batch.
112+
124113

125114
Aggregate operations
126115
--------------------
127-
Running ``aggregate`` operations is similar to ``find``. Here is an example of
128-
an aggregation that loads all records with an ``amount`` less than 10::
116+
Running an ``aggregate`` operation is similar to ``find``, but it takes a sequence of operations to perform.
117+
Here is a simple example of ``aggregate_pandas_all`` that outputs a new dataframe
118+
in which all ``_id`` values are grouped together and their ``amount`` values summed::
129119

130-
# pandas
131-
df = client.db.data.aggregate_pandas_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
132-
# arrow
133-
arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
134-
# numpy
135-
ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
120+
df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
136121

137-
Nested data (embedded documents) are also supported::
122+
Nested data (embedded documents) are also supported.
123+
In this more complex example, we unwind values in the nested ``txn`` field, count the number of each,
124+
then return as a list of numpy ndarrays sorted in decreasing order::
138125

139-
from pymongoarrow.api import Schema
140-
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
141-
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
142-
arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
126+
pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
127+
ndarrays = client.db.data.aggregate_numpy_all(pipeline)
143128

129+
More information on aggregation pipelines can be found `here <https://www.mongodb.com/docs/manual/core/aggregation-pipeline/>`_.
130+
131+
Writing to MongoDB
132+
-----------------------
133+
Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type, Pandas'
134+
:class:`~pandas.DataFrame` type, or NumPy's :class:`~numpy.ndarray` type can
135+
be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function::
136+
137+
from pymongoarrow.api import write
138+
from pymongo import MongoClient
139+
coll = MongoClient().db.my_collection
140+
write(coll, df)
141+
write(coll, arrow_table)
142+
write(coll, ndarrays)
144143

145144
Writing to other formats
146145
------------------------
147-
Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type can
148-
be easily written to one of the formats supported by
149-
`PyArrow <https://arrow.apache.org/docs/python/index.html>`_. For example,
150-
to write the table referenced by the variable ``arrow_table`` to a Parquet
146+
Once result sets have been loaded, one can then write them to any format that the package supports.
147+
148+
For example, to write the table referenced by the variable ``arrow_table`` to a Parquet
151149
file ``example.parquet``, run::
152150

153151
import pyarrow.parquet as pq
154152
pq.write_table(arrow_table, 'example.parquet')
155153

156154
Pandas also supports writing :class:`~pandas.DataFrame` instances to a variety
157-
of formats including CSV, and HDF. For example, to write the data frame
158-
referenced by the variable ``df`` to a CSV file ``out.csv``, run::
155+
of formats including CSV, and HDF. To write the data frame
156+
referenced by the variable ``df`` to a CSV file ``out.csv``, for example, run::
159157

160158
df.to_csv('out.csv', index=False)
161159

162160
.. note::
163161

164162
Nested data is supported for parquet read/write but is not well supported
165163
by Arrow or Pandas for CSV read/write.
166-
167-
Writing back to MongoDB
168-
-----------------------
169-
Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type, Pandas'
170-
:class:`~pandas.DataFrame` type, or NumPy's :class:`~numpy.ndarray` type can
171-
be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function::
172-
173-
from pymongoarrow.api import write
174-
from pymongo import MongoClient
175-
coll = MongoClient().db.my_collection
176-
write(coll, df)
177-
write(coll, arrow_table)
178-
write(coll, ndarrays)

bindings/python/pymongoarrow/schema.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ def __init__(self, schema):
5555
def __iter__(self):
5656
yield from self.typemap
5757

58+
def __repr__(self):
59+
return f"<{self.__class__.__name__} {self.typemap!r}>"
60+
5861
@staticmethod
5962
def _normalize_mapping(mapping):
6063
normed = {}

0 commit comments

Comments
 (0)