Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions doc/source/user_guide/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1178,3 +1178,32 @@ Use ``copy=True`` to prevent such a behaviour or simply don't reuse ``Categorica
This also happens in some cases when you supply a NumPy array instead of a ``Categorical``:
using an int array (e.g. ``np.array([1,2,3,4])``) will exhibit the same behavior, while using
a string array (e.g. ``np.array(["a","b","c","a"])``) will not.

.. note::

When constructing a :class:`pandas.Categorical` from a pandas :class:`Series` or
:class:`Index` with ``dtype='object'``, the dtype of the categories will be
preserved as ``object``. When constructing from a NumPy array
with ``dtype='object'`` or a raw Python sequence, pandas will infer the most
specific dtype for the categories (for example, ``str`` if all elements are strings).

.. ipython:: python

pd.options.future.infer_string = True
ser = pd.Series(["foo", "bar", "baz"], dtype="object")
idx = pd.Index(["foo", "bar", "baz"], dtype="object")
arr = np.array(["foo", "bar", "baz"], dtype="object")
pylist = ["foo", "bar", "baz"]

cat_from_ser = pd.Categorical(ser)
cat_from_idx = pd.Categorical(idx)
cat_from_arr = pd.Categorical(arr)
cat_from_list = pd.Categorical(pylist)

# Series/Index with object dtype: preserve object dtype
assert cat_from_ser.categories.dtype == "object"
assert cat_from_idx.categories.dtype == "object"

# Numpy array or list: infer string dtype
assert cat_from_arr.categories.dtype == "str"
assert cat_from_list.categories.dtype == "str"
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -690,7 +690,7 @@ Categorical
- Bug in :meth:`Categorical.astype` where ``copy=False`` would still trigger a copy of the codes (:issue:`62000`)
- Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
-
- Bug in :class:`Categorical` where constructing from a pandas :class:`Series` or :class:`Index` with ``dtype='object'`` did not preserve the categories' dtype as ``object``; now the dtype is preserved as ``object`` for these cases, while numpy arrays and Python sequences with ``dtype='object'`` continue to infer the most specific dtype (for example, ``str`` if all elements are strings).

Datetimelike
^^^^^^^^^^^^
Expand Down
13 changes: 10 additions & 3 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,11 @@ def __init__(
codes = arr.indices.to_numpy()
dtype = CategoricalDtype(categories, values.dtype.pyarrow_dtype.ordered)
else:
# Check for pandas Series/ Index with object dtye
preserve_object_dtpe = False
if isinstance(values, (ABCSeries, ABCIndex)):
if getattr(values.dtype, "name", None) == "object":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just check values.dtype == object

preserve_object_dtpe = True
if not isinstance(values, ABCIndex):
# in particular RangeIndex xref test_index_equal_range_categories
values = sanitize_array(values, None)
Expand All @@ -465,15 +470,17 @@ def __init__(
except TypeError as err:
codes, categories = factorize(values, sort=False)
if dtype.ordered:
# raise, as we don't have a sortable data structure and so
# the user should give us one by specifying categories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt the comments were redundant as TypeError already explain it clearly and also new logic is added to detect if the input values is a pandas Series or Index with "object" dtype, and then force the categories to use object dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I do not have any strong preference, I am happy to add it back.

raise TypeError(
"'values' is not ordered, please "
"explicitly specify the categories order "
"by passing in a categories argument."
) from err

# we're inferring from values
# If we should prserve object dtype, force categories to object dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo prserve -> preserve

if preserve_object_dtpe:
from pandas import Index

categories = Index(categories, dtype=object, copy=False)
dtype = CategoricalDtype(categories, dtype.ordered)

elif isinstance(values.dtype, CategoricalDtype):
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/extension/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,31 @@ def test_array_repr(self, data, size):
def test_groupby_extension_agg(self, as_index, data_for_grouping):
super().test_groupby_extension_agg(as_index, data_for_grouping)

def test_categorical_preserve_object_dtype_from_pandas(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will probably go in tests.arrays.categorical.test_constructors or something similar

import numpy as np

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these imports go at the top of the file


pd.options.future.infer_string = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tm.option_context for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


ser = pd.Series(["foo", "bar", "baz"], dtype="object")
idx = pd.Index(["foo", "bar", "baz"], dtype="object")
arr = np.array(["foo", "bar", "baz"], dtype="object")
pylist = ["foo", "bar", "baz"]

cat_from_ser = Categorical(ser)
cat_from_idx = Categorical(idx)
cat_from_arr = Categorical(arr)
cat_from_list = Categorical(pylist)

# Series/Index with object dtype: preserve object dtype
assert cat_from_ser.categories.dtype == "object"
assert cat_from_idx.categories.dtype == "object"

# Numpy array or list: infer string dtype
assert cat_from_arr.categories.dtype == "str"
assert cat_from_list.categories.dtype == "str"


class Test2DCompat(base.NDArrayBacked2DTests):
def test_repr_2d(self, data):
Expand Down
Loading