Skip to content

Feature auto infer orient as table or split in read_json #60969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Other enhancements
- :meth:`Series.map` can now accept kwargs to pass on to func (:issue:`59814`)
- :meth:`Series.str.get_dummies` now accepts a ``dtype`` parameter to specify the dtype of the resulting DataFrame (:issue:`47872`)
- :meth:`pandas.concat` will raise a ``ValueError`` when ``ignore_index=True`` and ``keys`` is not ``None`` (:issue:`59274`)
- :meth:`pandas.read_json` now automatically infers the ``orient`` parameter if it is not explicitly specified. This allows the correct format to be detected based on the input JSON structure. This only works if json schema matches for split or table. (:issue:`52713`).
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`)
- Errors occurring during SQL I/O will now throw a generic :class:`.DatabaseError` instead of the raw Exception type from the underlying driver manager library (:issue:`60748`)
- Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`)
Expand Down
21 changes: 21 additions & 0 deletions pandas/io/json/_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
)
from collections import abc
from itertools import islice
import json
import os
from typing import (
TYPE_CHECKING,
Any,
Expand Down Expand Up @@ -559,6 +561,12 @@ def read_json(
- ``'values'`` : just the values array
- ``'table'`` : dict like ``{{'schema': {{schema}}, 'data': {{data}}}}``

**Automatic Orient Inference for split or table**:
If the `orient` parameter is not specified,
this function will automatically infer the correct JSON format.
This works only if the schema matches for a table or split.
If the json was created using to_json with orient=split or orient=table

The allowed and default values depend on the value
of the `typ` parameter.

Expand Down Expand Up @@ -768,6 +776,19 @@ def read_json(
0 0 1 2.5 True a 1577.2
1 1 <NA> 4.5 False b 1577.1
"""
if orient is None:
if isinstance(path_or_buf, (str, bytes, os.PathLike)):
with open(path_or_buf, encoding="utf-8") as f:
json_data = json.load(f)
else:
json_data = json.load(path_or_buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is loading the entire JSON document twice now? Isn't that going to at least double the runtime?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @WillAyd

Thanks a lot for your feedback. Much appreciated.
Yeah I did not fully consider the performance aspect of this.

So, if there is no orient explicitly mentioned by the user, this will read the document one extra time as you have mentioned.

But, I couldn't think of any other way to validate the schema of the json file to automatically infer an appropriate orient.


if isinstance(json_data, dict):
if "schema" in json_data and "data" in json_data:
orient = "table"
elif "columns" in json_data and "data" in json_data:
orient = "split"

if orient == "table" and dtype:
raise ValueError("cannot pass both dtype and orient='table'")
if orient == "table" and convert_axes:
Expand Down
32 changes: 32 additions & 0 deletions pandas/tests/io/json/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -2283,3 +2283,35 @@ def test_large_number():
)
expected = Series([9999999999999999])
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"json_data, should_fail",
[
(
json.dumps(
{
"schema": {"fields": [{"name": "A", "type": "integer"}]},
"data": [{"A": 1}, {"A": 2}, {"A": 3}],
}
),
False,
),
(json.dumps({"columns": ["A"], "data": [[1], [2], [3]]}), False),
],
)
def test_read_json_auto_infer_orient_table_split(json_data, should_fail, tmp_path):
"""Test pd.read_json auto-infers 'table' and 'split' formats."""

# Use tmp_path to create a temporary file
temp_file = tmp_path / "test_read_json.json"

# Write the json_data to the temporary file
with open(temp_file, "w") as f:
f.write(json_data)

if should_fail:
with pytest.raises(ValueError, match=".*expected.*"):
read_json(temp_file)
else:
read_json(temp_file)
Loading