-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Labels
Description
While fetching data with find_polars_all
, find_pandas_all
, find_arrow_all
from pymongoarrow.api
, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null
.
MongoDB documentation
[
{
"name": "test",
"code": "1"
},
{
"name": "test",
"code": 1
}
]
Current implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘
In case of such known discrepancies where the first document have pyarrow.str()
and subsequent documents have pyarrow.int*()
, which can be inferred as pyarrow.str()
by adding an optional parameter coerce_number_to_str
for all find_*
apis.
Expected implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query,
coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# └─────────────────────────────────┴──────┴──────┘
Reference - coerce_numbers_to_str
in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field
aclark4life