-
Notifications
You must be signed in to change notification settings - Fork 52
fix: analysis report when there is a column with mixed data types #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
20d12c8
c724834
c8cafcf
690c1cb
df003d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,6 +6,7 @@ | |
| from functools import cached_property | ||
|
|
||
| import pandas as pd | ||
| import pyarrow as pa | ||
| from pydantic import Field, field_validator | ||
|
|
||
| from data_designer.config.analysis.column_profilers import ColumnProfilerConfigT | ||
|
|
@@ -19,10 +20,8 @@ | |
| from data_designer.engine.analysis.column_profilers.base import ColumnConfigWithDataFrame, ColumnProfiler | ||
| from data_designer.engine.analysis.column_statistics import get_column_statistics_calculator | ||
| from data_designer.engine.analysis.errors import DatasetProfilerConfigurationError | ||
| from data_designer.engine.dataset_builders.multi_column_configs import ( | ||
| DatasetBuilderColumnConfigT, | ||
| MultiColumnConfig, | ||
| ) | ||
| from data_designer.engine.analysis.utils.column_statistics_calculations import has_pyarrow_backend | ||
| from data_designer.engine.dataset_builders.multi_column_configs import DatasetBuilderColumnConfigT, MultiColumnConfig | ||
| from data_designer.engine.registry.data_designer_registry import DataDesignerRegistry | ||
| from data_designer.engine.resources.resource_provider import ResourceProvider | ||
|
|
||
|
|
@@ -68,6 +67,7 @@ def profile_dataset( | |
| logger.info("📐 Measuring dataset column statistics:") | ||
|
|
||
| self._validate_schema_consistency(list(dataset.columns)) | ||
| dataset = self._convert_to_pyarrow_backend_if_needed(dataset) | ||
|
|
||
| column_statistics = [] | ||
| for c in self.config.column_configs: | ||
|
|
@@ -100,6 +100,18 @@ def profile_dataset( | |
| column_profiles=column_profiles if column_profiles else None, | ||
| ) | ||
|
|
||
| def _convert_to_pyarrow_backend_if_needed(self, dataset: pd.DataFrame) -> pd.DataFrame: | ||
| if not has_pyarrow_backend(dataset): | ||
| try: | ||
| dataset = pa.Table.from_pandas(dataset).to_pandas(types_mapper=pd.ArrowDtype) | ||
| except Exception: | ||
| logger.warning( | ||
| "⚠️ Unable to convert the dataset to a PyArrow backend. This is often due to at least " | ||
| "one column having mixed data types. As a result, the reported data types " | ||
| "will be inferred from the type of the first non-null value of each column." | ||
| ) | ||
|
Comment on lines
119
to
121
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here's the clue to the user that their dataset has at least one column with mixed data types
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not blocking this needed bugfix: but is it possible to notify the user about which column generated the issue?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good question, probably can grab it from the exception message. let me check that
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. new message looks like this |
||
| return dataset | ||
|
|
||
| def _create_column_profiler(self, profiler_config: ColumnProfilerConfigT) -> ColumnProfiler: | ||
| return self.registry.column_profilers.get_for_config_type(type(profiler_config))( | ||
| config=profiler_config, resource_provider=self.resource_provider | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this validator is what was causing generation jobs to fail at the profiling step