Skip to content

ENH: Consistent naming conventions for string dtype aliases #58141

@WillAyd

Description

@WillAyd

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Right now the string aliases for our types is inconsistent

>>> import pandas as pd
>>> pd.Series(range(3), dtype="int8")  # NumPy type
>>> pd.Series(range(3), dtype="Int8")  # Pandas extension type
>>> pd.Series(range(3), dtype="int8[pyarrow]") # Arrow type

Strings have a similar inconsistency with "string", "string[pyarrow]" and "string[pyarrow_numpy]"

Feature Description

I think we should create"int8[numpy]" and "int8[pandas]" aliases to stay consistent with pyarrow. This also has the advantage of decoupling "int8" from NumPy, so perhaps in the future we can allow the setting of the backend determine if NumPy or pyarrow types are returned

The pattern thus becomes "data_type[backend]", with the exception of "string[pyarrow_numpy]" which combines combines the backend and nullability semantics together. I am less sure what to do in that case - maybe even that should be called "string[pyarrow, numpy]" where the second argument is nullability?

In any case I am just hoping we can start to detach the logical type from the physical storage / nulllability semantics with a well defined pattern

@phofl

Alternative Solutions

n/a

Additional Context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions