-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Because of the new string dtype, we also implicitly changes the representation of the unique categories in the Categorical dtype repr (aside the object -> str change for the dtype):
pd.options.future.infer_string = False
pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, object): ['a', 'b', 'c']
pd.options.future.infer_string = True
pd.Categorical(list("abca"))
['a', 'b', 'c', 'a']
Categories (3, str): [a, b, c]
So the actual array values are always quotes, but the list of unique categories in the dtype repr goes from ['a', 'b', 'c'] to [a, b, c].
Brock already fixed a bunch of xfails in the tests because of this in pandas-dev#61727. And we also run into this issue for the failing doctests (pandas-dev#61886).
@jbrockmendel mentioned there:
It isn't 100% obvious that the new repr for Categoricals is an improvement, but it's non-crazy.
With which I agree, also no strong opinion either way.
But before we also go fixing doctests, let's confirm that we are OK with this change. Because if we don't have a strong opinion that it is an improvement, we could also leave it how it was originally (and avoiding some breakage because of this for downstream projects or users (eg who also have doctests))