-
Notifications
You must be signed in to change notification settings - Fork 51
fix: analysis report when there is a column with mixed data types #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @model_validator(mode="after") | ||
| def ensure_pyarrow_backend(self) -> Self: | ||
| if not all(isinstance(dtype, pd.ArrowDtype) for dtype in self.df.dtypes): | ||
| self.df = pa.Table.from_pandas(self.df).to_pandas(types_mapper=pd.ArrowDtype) | ||
| return self | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this validator is what was causing generation jobs to fail at the profiling step
b77e803 to
c8cafcf
Compare
| logger.warning( | ||
| "⚠️ Unable to convert the dataset to a PyArrow backend. This is often due to at least " | ||
| "one column having mixed data types. As a result, the reported data types " | ||
| "will be inferred from the type of the first non-null value of each column." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's the clue to the user that their dataset has at least one column with mixed data types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking this needed bugfix: but is it possible to notify the user about which column generated the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, probably can grab it from the exception message. let me check that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new message looks like this
[17:19:19] [WARNING] ⚠️ Unable to convert the dataset to a PyArrow backend
[17:19:19] [WARNING] |-- Conversion Error Message: Conversion failed for column 'nano_response' with type object
[17:19:19] [WARNING] |-- This is often due to at least one column having mixed data types
[17:19:19] [WARNING] |-- Note: Reported data types will be inferred from the first non-null value of each column
closed #128
The issue here is that we use the pyarrow backend of pandas to infer the data type we report (e.g., it distinguishes different types of objects, whereas pandas labels strings, lists, and dicts as "object" type). However, sometimes we might generate a column with multiple data types (whether it is intentional or not), particularly if the column is generated by an LLM. This makes pyarrow unhappy.
This PR makes it so that if we can't convert the dataframe backend to pyarrow, (1) the dataset generation does not fail and (2) we fallback to using the data type of the first non-null element in the analysis report.