-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
PERF: fix performance regression from #62542 #62623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this faster than the original code? Seems like we've only added instructions to the conversion function(s), so I'm worried we are overlooking something
uint64_t uint_max, int *error, char tsep); | ||
int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max, | ||
int *error, char tsep); | ||
int64_t str_to_int64(const char *p_item, char decimal_separator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add this argument to the int conversion functions - that repurposes these functions in a way that's not really clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it. Additionally, I am no longer checking explicitly if it's a float, just assigning the error code for invalid char.
pandas/_libs/src/parser/tokenizer.c
Outdated
} else { | ||
*error = ERROR_OVERFLOW; | ||
return 0; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to keep these are immediate returns? This opens up the door to ambiguous behavior of the parser in case of "multiple" failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of the changes in #62542 was because of this branch, where big numbers that indicate float were cast to string due to overflow.
Can I still check posterior characters to change the error code? If not, I don't think there is anything to do in this PR, and it's best to revert the problematic commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put back the immediate return, but added an inline function before it to check if it's not an integer after the overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks - that's helpful. I somewhat disagree with the premise of that change to cast to float even if its a lossy operation. I understand that in some cases there is a desire for numeric operations on numbers like that, but its unclear that should take precedence over the string cast, which is in some sense more "value preserving".
The larger issue is that pandas does not have native support for Decimal precision types
I don't give much value for the performance increase that I reported on the first edit. I still need to update the description after all these changes. Just waiting for all the changes that I made are considered correct. Additionally, The changes should only apply for integer parsing, everything else can be considered noise. |
@WillAyd I updated the benchmark results on the description. There were no significant changes. |
pandas/_libs/src/parser/tokenizer.c
Outdated
return self->seen_uint && (self->seen_sint || self->seen_null); | ||
} | ||
|
||
static inline void check_for_invalid_char(const char *p_item, int *error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document what this function does? The name check_for_invalid_char
is a bit too vague - this is better described as something like cast_char_p_as_float
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also suggest that you either return int and drop the int *
argument, or return something useful (ex: return the parsed float value) and then set the pointer value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a Doxygen comment. It's also returning the pointer to the last verified character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the error pointer in the function to prevent code duplication, and also because the main purpose of the function is just to assign a value to it. Considering that it's now returning the position of the last verified character, it's possible to change the error value outside the function, but I think it's more clean the way it is.
p_item++; | ||
} | ||
|
||
while (*p_item != '\0' && isspace_ascii(*p_item)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be combined with the previous loop? Is there a reason for trailing whitespace to be handed specially, or is there a reason at all to allow whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs a separate loop because this case should be invalid "7890123 1351713789"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we allow trailing white space though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is permitted below too if an overflow doesn't occur. I added it in this function to make it consistent.
pandas/_libs/src/parser/tokenizer.c
Outdated
} else { | ||
*error = ERROR_OVERFLOW; | ||
return 0; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks - that's helpful. I somewhat disagree with the premise of that change to cast to float even if its a lossy operation. I understand that in some cases there is a desire for numeric operations on numbers like that, but its unclear that should take precedence over the string cast, which is in some sense more "value preserving".
The larger issue is that pandas does not have native support for Decimal precision types
read_csv()
fails to detect floats larger than 2.0^64 #51295 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Benchmarks
asv continuous -f 1.1 -E virtualenv:3.13 "db31f6a38353a311cc471eb98506470b39c676d8~" HEAD -b io.csv
asv compare db31f6a38353a311cc471eb98506470b39c676d8~ HEAD
IO.csv benchmarks:
cc @WillAyd