Skip to content

Commit b9c5164

Browse files
authored
ARROW-9 BSON subdocument support (#104)
1 parent 32bdd6e commit b9c5164

File tree

13 files changed

+558
-142
lines changed

13 files changed

+558
-142
lines changed

.github/workflows/benchmark.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,6 @@ jobs:
3838
- name: Install Python dependencies
3939
run: |
4040
python -m pip install -U pip
41-
- name: Install pymongoarrow
42-
run: |
43-
# Install the library
44-
LIBBSON_INSTALL_DIR=$(pwd)/libbson python -m pip install -vvv -e ".[test]"
4541
- name: Run tests
4642
run: |
4743
set -eu
@@ -63,10 +59,14 @@ jobs:
6359
# the current target that this PR will be merged into is HEAD^1.
6460
git update-ref refs/bm/merge-target $(git log -n 1 --pretty=format:"%H" main --)
6561
git checkout --force refs/bm/pr --
62+
# Install the library
63+
LIBBSON_INSTALL_DIR=$(pwd)/libbson python -m pip install -vvv -e ".[test]"
6664
run_asv
6765
6866
6967
git checkout --force refs/bm/merge-target --
68+
# Install the library
69+
LIBBSON_INSTALL_DIR=$(pwd)/libbson python -m pip install -vvv -e ".[test]"
7070
run_asv
7171
7272
asv compare refs/bm/merge-target refs/bm/pr --

bindings/python/docs/source/changelog.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,16 @@
11
Changelog
22
=========
33

4+
Changes in Version 0.7.0
5+
------------------------
6+
- Added support for BSON Embedded Document type.
7+
48
Changes in Version 0.6.3
59
------------------------
610

711
- Added wheels for Linux AArch64 and Python 3.11.
812
- Fixed handling of time zones in schema auto-discovery.
913

10-
1114
Changes in Version 0.6.2
1215
------------------------
1316
Note: We did not publish 0.6.0 or 0.6.1 due to technical difficulties.

bindings/python/docs/source/quickstart.rst

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,12 @@ There are multiple permissible type-identifiers for each supported BSON type.
7171
For a full-list of supported types and associated type-identifiers see
7272
:doc:`supported_types`.
7373

74+
Nested data (embedded documents) are also supported::
75+
76+
from pymongoarrow.api import Schema
77+
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
78+
79+
7480
.. note::
7581

7682
For all of the examples below, the schema can be omitted like so::
@@ -80,6 +86,7 @@ For a full-list of supported types and associated type-identifiers see
8086
In this case, PyMongoArrow will try to automatically apply a schema based on
8187
the data contained in the first batch.
8288

89+
8390
Find operations
8491
---------------
8592
We are now ready to query our data. Let's start by running a ``find``
@@ -99,6 +106,12 @@ Or as :class:`numpy.ndarray` instances::
99106
In the NumPy case, the return value is a dictionary where the keys are field
100107
names and values are the corresponding arrays.
101108

109+
Nested data (embedded documents) are also supported::
110+
111+
from pymongoarrow.api import Schema
112+
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
113+
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
114+
102115
Aggregate operations
103116
--------------------
104117
Running ``aggregate`` operations is similar to ``find``. Here is an example of
@@ -111,6 +124,14 @@ an aggregation that loads all records with an ``amount`` less than 10::
111124
# numpy
112125
ndarrays = client.db.data.aggregate_numpy_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
113126

127+
Nested data (embedded documents) are also supported::
128+
129+
from pymongoarrow.api import Schema
130+
schema = Schema({'_id': int, 'amount': float, 'account': { 'name': str, 'account_number': int}})
131+
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
132+
arrow_table = client.db.data.aggregate_arrow_all([{'$match': {'amount': {'$lte': 10}}}], schema=schema)
133+
134+
114135
Writing to other formats
115136
------------------------
116137
Result sets that have been loaded as Arrow's :class:`~pyarrow.Table` type can
@@ -128,6 +149,10 @@ referenced by the variable ``df`` to a CSV file ``out.csv``, run::
128149

129150
df.to_csv('out.csv', index=False)
130151

152+
.. note::
153+
154+
Nested data is supported for parquet read/write but is not well supported
155+
by Arrow or Pandas for CSV read/write.
131156

132157
Writing back to MongoDB
133158
-----------------------

bindings/python/docs/source/supported_types.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ Support for additional types will be added in subsequent releases.
1717
- Type Identifiers
1818
* - String
1919
- :class:`py.str`, an instance of :class:`pyarrow.string`
20+
* - Embedded document
21+
- :class:`py.dict`, and instance of :class:`pyarrow.struct`
2022
* - ObjectId
2123
- :class:`py.bytes`, :class:`bson.ObjectId`, an instance of :class:`pymongoarrow.types.ObjectIdType`, an instance of :class:`pyarrow.FixedSizeBinaryScalar`
2224
* - Boolean

bindings/python/pymongoarrow/context.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from pymongoarrow.lib import (
1717
BoolBuilder,
1818
DatetimeBuilder,
19+
DocumentBuilder,
1920
DoubleBuilder,
2021
Int32Builder,
2122
Int64Builder,
@@ -33,6 +34,7 @@
3334
_BsonArrowTypes.decimal128_str: StringBuilder,
3435
_BsonArrowTypes.string: StringBuilder,
3536
_BsonArrowTypes.bool: BoolBuilder,
37+
_BsonArrowTypes.document: DocumentBuilder,
3638
}
3739

3840

@@ -68,16 +70,22 @@ def from_schema(cls, schema, codec_options=DEFAULT_CODEC_OPTIONS):
6870
return cls(schema, {}, codec_options)
6971

7072
builder_map = {}
73+
tzinfo = codec_options.tzinfo
74+
7175
str_type_map = _get_internal_typemap(schema.typemap)
7276
for fname, ftype in str_type_map.items():
7377
builder_cls = _TYPE_TO_BUILDER_CLS[ftype]
7478
encoded_fname = fname.encode("utf-8")
79+
7580
# special-case initializing builders for parameterized types
7681
if builder_cls == DatetimeBuilder:
7782
arrow_type = schema.typemap[fname]
78-
if codec_options.tzinfo is not None and arrow_type.tz is None:
79-
arrow_type = timestamp(arrow_type.unit, tz=codec_options.tzinfo)
80-
builder_map[encoded_fname] = builder_cls(dtype=arrow_type)
83+
if tzinfo is not None and arrow_type.tz is None:
84+
arrow_type = timestamp(arrow_type.unit, tz=tzinfo)
85+
builder_map[encoded_fname] = DatetimeBuilder(dtype=arrow_type)
86+
elif builder_cls == DocumentBuilder:
87+
arrow_type = schema.typemap[fname]
88+
builder_map[encoded_fname] = DocumentBuilder(arrow_type, tzinfo)
8189
else:
8290
builder_map[encoded_fname] = builder_cls()
8391
return cls(schema, builder_map)

0 commit comments

Comments
 (0)