Skip to content

Conversation

@johnnygreco
Copy link
Contributor

@johnnygreco johnnygreco commented Dec 12, 2025

closed #128

The issue here is that we use the pyarrow backend of pandas to infer the data type we report (e.g., it distinguishes different types of objects, whereas pandas labels strings, lists, and dicts as "object" type). However, sometimes we might generate a column with multiple data types (whether it is intentional or not), particularly if the column is generated by an LLM. This makes pyarrow unhappy.

This PR makes it so that if we can't convert the dataframe backend to pyarrow, (1) the dataset generation does not fail and (2) we fallback to using the data type of the first non-null element in the analysis report.

Comment on lines -32 to -37
@model_validator(mode="after")
def ensure_pyarrow_backend(self) -> Self:
if not all(isinstance(dtype, pd.ArrowDtype) for dtype in self.df.dtypes):
self.df = pa.Table.from_pandas(self.df).to_pandas(types_mapper=pd.ArrowDtype)
return self

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this validator is what was causing generation jobs to fail at the profiling step

@johnnygreco johnnygreco force-pushed the johnny/bug/128-profiling-mixed-dtype-bug branch from b77e803 to c8cafcf Compare December 12, 2025 21:51
Comment on lines 108 to 112
logger.warning(
"⚠️ Unable to convert the dataset to a PyArrow backend. This is often due to at least "
"one column having mixed data types. As a result, the reported data types "
"will be inferred from the type of the first non-null value of each column."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the clue to the user that their dataset has at least one column with mixed data types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking this needed bugfix: but is it possible to notify the user about which column generated the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, probably can grab it from the exception message. let me check that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new message looks like this

[17:19:19] [WARNING] ⚠️ Unable to convert the dataset to a PyArrow backend
[17:19:19] [WARNING]   |-- Conversion Error Message: Conversion failed for column 'nano_response' with type object
[17:19:19] [WARNING]   |-- This is often due to at least one column having mixed data types
[17:19:19] [WARNING]   |-- Note: Reported data types will be inferred from the first non-null value of each column

@johnnygreco johnnygreco merged commit 6e65b10 into main Dec 15, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ArrowTypeError in Profiler due to implicit dict casting of JSON-like LLM responses

4 participants