[FIX] CSV Import - Change datetime format parsing#6539
Merged
noahnovsak merged 1 commit intobiolab:masterfrom Aug 25, 2023
Merged
[FIX] CSV Import - Change datetime format parsing#6539noahnovsak merged 1 commit intobiolab:masterfrom
noahnovsak merged 1 commit intobiolab:masterfrom
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #6539 +/- ##
==========================================
- Coverage 87.68% 87.68% -0.01%
==========================================
Files 321 321
Lines 69406 69400 -6
==========================================
- Hits 60860 60853 -7
- Misses 8546 8547 +1 |
noahnovsak
approved these changes
Aug 25, 2023
| if len(unique_values) < 100 and len(unique_values) < len(col)**0.7: | ||
| return col.astype("category") | ||
| try: | ||
| return pd.to_datetime(col) |
Contributor
There was a problem hiding this comment.
Would specifying a format here and retrying on ParserErrors be a valid fix for #6499?
Contributor
Author
There was a problem hiding this comment.
It should work. Maybe we can try with all formats that we currently support, and then we fall back to default if None works (since date utils support more formats that we do). We would need to test how time-consuming it is, but it is a solution. It would not solve the problem in #6499 since the d/m/y format currently doesn't exist in the list.
Even a better solution would be to allow users to specify the format.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Addresses but doesn't fix #6499
The datetime format parsing in the Import CSV widget was implemented to work faster when the same times repeat, but it may cause problems since each time is parsed separately.
Description of changes
Since Pandas improved datetime parsing speeds, I suggest not parsing unique times separately but parsing them in one call of pd.to_datetime. Doing this way, Pandas try to guess the format of times in a column and then parse them with the same format.
It may solve issues with some formats that Pandas can recognize but will only solve some problems. E.g. when Pandas cannot recognize the format, they fall back to
dateutilimplementation, and in this case, dates are still parsed separately, which can cause different parsing between dates in the same column. It happens in case #6499, which means that this issue is not solved yet.Includes