ARROW-200 Cleans up Getting Starts in docs (#186)

caseyclements · web-flow · commit 684c2ed20c6a · 2024-01-29T17:08:44.000-05:00
* ARROW-200 Fix typos and ambiguities in Getting Started

* Added meat to the Aggregate Operations section

* Added Schema.__repr__. Helps disambiguate Arrows, Pandas, Bson.....

* Replaced deprecated pkg_resource with packaging.version

* Changed order of sections so that it flows, while not drawing attention to strange behavior with list_ in pandas and numpy

* Corrected schema for ListType example in Getting Started.
diff --git a/bindings/python/docs/source/quickstart.rst b/bindings/python/docs/source/quickstart.rst
@@ -44,17 +44,17 @@ e.g. :meth:`~pymongoarrow.api.find_pandas_all`.
 
 Test data
 ^^^^^^^^^
-Before we begein, we must first add some data to our cluster that we can
+Before we begin, we must first add some data to our cluster that we can
 query. We can do so using **PyMongo**::
 
   from datetime import datetime
   from pymongo import MongoClient
   client = MongoClient()
   client.db.data.insert_many([
-      {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': { 'name': "Customer1", 'account_number': 1}}, "txns": [1, 2, 3]},
-      {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': { 'name': "Customer2", 'account_number': 2}}, "txns": [1, 2, 3]},
-      {'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': { 'name': "Customer3", 'account_number': 3}}, "txns": [1, 2, 3]},
-      {'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': { 'name': "Customer4", 'account_number': 4}}, "txns": [1, 2, 3]}])
+    {'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1), 'account': {'name': 'Customer1', 'account_number': 1}, 'txns': ['A']},
+    {'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11), 'account': {'name': 'Customer2', 'account_number': 2}, 'txns': ['A', 'B']},
+    {'_id': 3, 'amount': 3,  'last_updated': datetime(2021, 3, 10, 18, 43, 9), 'account': {'name': 'Customer3', 'account_number': 3}, 'txns': ['A', 'B', 'C']},
+    {'_id': 4, 'amount': 0,  'last_updated': datetime(2021, 2, 25, 3, 50, 31), 'account': {'name': 'Customer4', 'account_number': 4}, 'txns': ['A', 'B', 'C', 'D']}])
 
 Defining the schema
 -------------------
@@ -67,28 +67,14 @@ to type-specifiers, e.g.::
   from pymongoarrow.api import Schema
   schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
 
-There are multiple permissible type-identifiers for each supported BSON type.
-For a full-list of data types and associated type-identifiers see
-:doc:`data_types`.
 
 Nested data (embedded documents) are also supported::
 
-  from pymongoarrow.api import Schema
   schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
 
-Arrays (and nested arrays) are also supported::
-
-  from pymongoarrow.api import Schema
-  schema = Schema({'_id': int, 'amount': float, 'txns': list_(int32())})
-
-.. note::
-
-   For all of the examples below, the schema can be omitted like so::
-
-    arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
-
-   In this case, PyMongoArrow will try to automatically apply a schema based on
-   the data contained in the first batch.
+There are multiple permissible type-identifiers for each supported BSON type.
+For a full-list of data types and associated type-identifiers see
+:doc:`data_types`.
 
 
 Find operations
@@ -103,76 +89,75 @@ We can also load the same result set as a :class:`pyarrow.Table` instance::
 
   arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
 
-Or as :class:`numpy.ndarray` instances::
+In the NumPy case, the return value is a dictionary where the keys are field
+names and values are corresponding :class:`numpy.ndarray` instances::
 
   ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
 
-In the NumPy case, the return value is a dictionary where the keys are field
-names and values are the corresponding arrays.
 
-Nested data (embedded documents) are also supported::
+Arrays (and nested arrays) are also supported::
 
-  from pymongoarrow.api import Schema
-  schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
+  from pyarrow import list_, string
+  schema = Schema({'_id': int, 'amount': float, 'txns': list_(string())})
   arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
 
-Arrays (and nested arrays) are also supported::
 
-  from pymongoarrow.api import Schema
-  schema = Schema({'_id': int, 'amount': float, 'txns': list_(int32())})
-  arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
+.. note::
+   For all of the examples above, the schema can be omitted like so::
+
+    arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}})
+
+   In this case, PyMongoArrow will try to automatically apply a schema based on
+   the data contained in the first batch.
+
 
 Aggregate operations
 --------------------
-Running ``aggregate`` operations is similar to ``find``. Here is an example of
-an aggregation that loads all records with an ``amount`` less than 10::
+Running an ``aggregate`` operation is similar to ``find``, but it takes a sequence of operations to perform.
+Here is a simple example of ``aggregate_pandas_all`` that outputs a new dataframe
+in which all ``_id`` values are grouped together and their ``amount`` values summed::
 
-  # pandas
-  df = client.db.data.aggregate_pandas_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
-  # arrow
-  arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
-  # numpy
-  ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
+  df = client.db.data.aggregate_pandas_all([{'$group': {'_id': None, 'total_amount': { '$sum': '$amount' }}}])
 
-Nested data (embedded documents) are also supported::
+Nested data (embedded documents) are also supported.
+In this more complex example, we unwind values in the nested ``txn`` field, count the number of each,
+then return as a list of numpy ndarrays sorted in decreasing order::
 
-  from pymongoarrow.api import Schema
-  schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
-  arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
-  arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
+  pipeline = [{'$unwind': '$txns'}, {'$group': {'_id': '$txns', 'count': {'$sum': 1}}}, {'$sort': {"count": -1}}]
+  ndarrays = client.db.data.aggregate_numpy_all(pipeline)
 
+More information on aggregation pipelines can be found `here <https://www.mongodb.com/docs/manual/core/aggregation-pipeline/>`_.
+
+Writing to MongoDB
+-----------------------
+Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type, Pandas'
+:class:`~pandas.DataFrame` type, or NumPy's :class:`~numpy.ndarray` type can
+be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function::
+
+  from pymongoarrow.api import write
+  from pymongo import MongoClient
+  coll = MongoClient().db.my_collection
+  write(coll, df)
+  write(coll, arrow_table)
+  write(coll, ndarrays)
 
 Writing to other formats
 ------------------------
-Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type can
-be easily written to one of the formats supported by
-`PyArrow <https://arrow.apache.org/docs/python/index.html>`_. For example,
-to write the table referenced by the variable ``arrow_table`` to a Parquet
+Once result sets have been loaded, one can then write them to any format that the package supports.
+
+For example, to write the table referenced by the variable ``arrow_table`` to a Parquet
 file ``example.parquet``, run::
 
   import pyarrow.parquet as pq
   pq.write_table(arrow_table, 'example.parquet')
 
 Pandas also supports writing :class:`~pandas.DataFrame` instances to a variety
-of formats including CSV, and HDF. For example, to write the data frame
-referenced by the variable ``df`` to a CSV file ``out.csv``, run::
+of formats including CSV, and HDF. To write the data frame
+referenced by the variable ``df`` to a CSV file ``out.csv``, for example, run::
 
   df.to_csv('out.csv', index=False)
 
 .. note::
 
   Nested data is supported for parquet read/write but is not well supported
   by Arrow or Pandas for CSV read/write.
-
-Writing back to MongoDB
------------------------
-Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type, Pandas'
-:class:`~pandas.DataFrame` type, or NumPy's :class:`~numpy.ndarray` type can
-be easily written to your MongoDB database using the :meth:`~pymongoarrow.api.write` function::
-
-  from pymongoarrow.api import write
-  from pymongo import MongoClient
-  coll = MongoClient().db.my_collection
-  write(coll, df)
-  write(coll, arrow_table)
-  write(coll, ndarrays)
diff --git a/bindings/python/pymongoarrow/schema.py b/bindings/python/pymongoarrow/schema.py
@@ -55,6 +55,9 @@ def __init__(self, schema):
     def __iter__(self):
         yield from self.typemap
 
+    def __repr__(self):
+        return f"<{self.__class__.__name__} {self.typemap!r}>"
+
     @staticmethod
     def _normalize_mapping(mapping):
         normed = {}