fix: analysis report when there is a column with mixed data types #131

johnnygreco · 2025-12-12T21:43:37Z

closed #128

The issue here is that we use the pyarrow backend of pandas to infer the data type we report (e.g., it distinguishes different types of objects, whereas pandas labels strings, lists, and dicts as "object" type). However, sometimes we might generate a column with multiple data types (whether it is intentional or not), particularly if the column is generated by an LLM. This makes pyarrow unhappy.

This PR makes it so that if we can't convert the dataframe backend to pyarrow, (1) the dataset generation does not fail and (2) we fallback to using the data type of the first non-null element in the analysis report.

johnnygreco · 2025-12-12T21:44:22Z

src/data_designer/engine/analysis/column_profilers/base.py

-    @model_validator(mode="after")
-    def ensure_pyarrow_backend(self) -> Self:
-        if not all(isinstance(dtype, pd.ArrowDtype) for dtype in self.df.dtypes):
-            self.df = pa.Table.from_pandas(self.df).to_pandas(types_mapper=pd.ArrowDtype)
-        return self
-


this validator is what was causing generation jobs to fail at the profiling step

johnnygreco · 2025-12-12T21:53:33Z

src/data_designer/engine/analysis/dataset_profiler.py

+                logger.warning(
+                    "⚠️ Unable to convert the dataset to a PyArrow backend. This is often due to at least "
+                    "one column having mixed data types. As a result, the reported data types "
+                    "will be inferred from the type of the first non-null value of each column."
+                )


here's the clue to the user that their dataset has at least one column with mixed data types

Not blocking this needed bugfix: but is it possible to notify the user about which column generated the issue?

good question, probably can grab it from the exception message. let me check that

new message looks like this

[17:19:19] [WARNING] ⚠️ Unable to convert the dataset to a PyArrow backend [17:19:19] [WARNING] |-- Conversion Error Message: Conversion failed for column 'nano_response' with type object [17:19:19] [WARNING] |-- This is often due to at least one column having mixed data types [17:19:19] [WARNING] |-- Note: Reported data types will be inferred from the first non-null value of each column

johnnygreco requested a review from nabinchha December 12, 2025 21:43

johnnygreco assigned andreatgretel Dec 12, 2025

johnnygreco commented Dec 12, 2025

View reviewed changes

johnnygreco unassigned andreatgretel Dec 12, 2025

johnnygreco requested a review from andreatgretel December 12, 2025 21:45

johnnygreco added 3 commits December 12, 2025 16:51

column config -> column name when possible

20d12c8

fallback to dtype of first non-null element

c724834

add unit tests

c8cafcf

johnnygreco force-pushed the johnny/bug/128-profiling-mixed-dtype-bug branch from b77e803 to c8cafcf Compare December 12, 2025 21:51

johnnygreco commented Dec 12, 2025

View reviewed changes

johnnygreco added 2 commits December 12, 2025 17:22

add error message info to warning

690c1cb

catch str_ too

df003d8

eric-tramel approved these changes Dec 13, 2025

View reviewed changes

johnnygreco merged commit 6e65b10 into main Dec 15, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: analysis report when there is a column with mixed data types #131

fix: analysis report when there is a column with mixed data types #131

Uh oh!

johnnygreco commented Dec 12, 2025 •

edited

Loading

Uh oh!

johnnygreco Dec 12, 2025

Uh oh!

johnnygreco Dec 12, 2025

Uh oh!

eric-tramel Dec 12, 2025

Uh oh!

johnnygreco Dec 12, 2025

Uh oh!

johnnygreco Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: analysis report when there is a column with mixed data types #131

fix: analysis report when there is a column with mixed data types #131

Uh oh!

Conversation

johnnygreco commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnygreco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

johnnygreco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

eric-tramel Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

johnnygreco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

johnnygreco Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

johnnygreco commented Dec 12, 2025 •

edited

Loading