Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 6 additions & 20 deletions Orange/widgets/data/owcsvimport.py
Original file line number Diff line number Diff line change
Expand Up @@ -1627,33 +1627,19 @@ def guess_data_type(col: pd.Series) -> pd.Series:
-------
Data column with correct dtype
"""
def parse_dates(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
try:
dates = {date: pd.to_datetime(date) for date in s.unique()}
except ValueError:
return None
return s.map(dates)

if pdtypes.is_numeric_dtype(col):
unique_values = col.unique()
if len(unique_values) <= 2 and (
len(np.setdiff1d(unique_values, [0, 1])) == 0
or len(np.setdiff1d(unique_values, [1, 2])) == 0):
return col.astype("category")
else: # object
# try parse as date - if None not a date
parsed_col = parse_dates(col)
if parsed_col is not None:
return parsed_col
unique_values = col.unique()
if len(unique_values) < 100 and len(unique_values) < len(col)**0.7:
return col.astype("category")
try:
return pd.to_datetime(col)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would specifying a format here and retrying on ParserErrors be a valid fix for #6499?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work. Maybe we can try with all formats that we currently support, and then we fall back to default if None works (since date utils support more formats that we do). We would need to test how time-consuming it is, but it is a solution. It would not solve the problem in #6499 since the d/m/y format currently doesn't exist in the list.

Even a better solution would be to allow users to specify the format.

except ValueError:
unique_values = col.unique()
if len(unique_values) < 100 and len(unique_values) < len(col)**0.7:
return col.astype("category")
return col


Expand Down