Skip to content

[Python] Unable to import arrow table to pandas if it has categorical columns with index types of unsigned ints #47022

@dweih

Description

@dweih

Describe the bug, including details regarding any error messages, version, and platform.

Our code primarily uses polars but external tools use pandas, and when we use them to import parquet files with categorical columns that have unsigned int index types (uint16 and uint32) we get the error

ArrowTypeError: Converting unsigned ddictionary indices to pandas not yet supported, index type: uint32

Simple repro below.

import polars as pl
import pyarrow as pa

n = 100
cat_values = [f"cat_{i}" for i in range(n)]
df = pl.DataFrame({
    "cat": cat_values,
    "val": list(range(n))
})
arrow_table = df.to_arrow()

dict_type = pa.dictionary(index_type=pa.uint16(), value_type=pa.string())
arrow_table = arrow_table.set_column(
    arrow_table.schema.get_field_index("cat"),
    "cat",
    arrow_table.column("cat").cast(dict_type)
)

print("Arrow schema:", arrow_table.schema)


try:
    pdf = pl.from_table(arrow_table).to_pandas()
    pdf = arrow_table.to_pandas()
    print("Loaded into pandas successfully.")
except Exception as e:
    print("Failed to load into pandas:")
    print(e)

try:
    pol_df = pl.from_arrow(arrow_table)
    print("Loaded into Polars successfully.")
except Exception as e:
    print("Failed to load into Polars:")
    print(e)

Finally, I wasn't sure whether to make this a FR or Issue, because it's missing, not incorrect.

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions