Skip to content

Conversation

Alvaro-Kothe
Copy link
Contributor

@Alvaro-Kothe Alvaro-Kothe commented Oct 8, 2025


Benchmarks

asv continuous -f 1.1 -E virtualenv:3.13 "db31f6a38353a311cc471eb98506470b39c676d8~" HEAD -b io.csv

asv compare db31f6a38353a311cc471eb98506470b39c676d8~ HEAD

IO.csv benchmarks:

Change Before [d8b3ff3] <main~12> After [4c8d770] <perf/read-csv> Ratio Benchmark (Parameter)
18.2±0.06ms 17.7±0.8ms 0.97 io.csv.ParseDateComparison.time_read_csv_dayfirst(False)
3.50±0.04ms 3.30±0.09ms 0.94 io.csv.ParseDateComparison.time_read_csv_dayfirst(True)
19.6±0.2ms 19.5±0.1ms 1.00 io.csv.ParseDateComparison.time_to_datetime_dayfirst(False)
3.63±0.08ms 3.49±0.08ms 0.96 io.csv.ParseDateComparison.time_to_datetime_dayfirst(True)
19.2±0.2ms 19.2±0.1ms 1.00 io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(False)
3.50±0.09ms 3.36±0.07ms 0.96 io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(True)
6.1G 6.05G 0.99 io.csv.ReadCSVCParserLowMemory.peakmem_over_2gb_input
905±3μs 904±5μs 1.00 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'c')
1.50±0ms 1.49±0.01ms 1.00 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'python')
922±4μs 913±7μs 0.99 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'c')
1.52±0.01ms 1.51±0.02ms 0.99 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'python')
25.1±0.3ms 24.6±0.6ms 0.98 io.csv.ReadCSVCategorical.time_convert_direct('c')
231±7ms 222±2ms 0.96 io.csv.ReadCSVCategorical.time_convert_direct('python')
61.4±0.5ms 60.5±2ms 0.99 io.csv.ReadCSVCategorical.time_convert_post('c')
152±1ms 144±1ms 0.95 io.csv.ReadCSVCategorical.time_convert_post('python')
35.3±1ms 35.2±0.6ms 1.00 io.csv.ReadCSVComment.time_comment('c')
35.6±0.9ms 34.9±0.2ms 0.98 io.csv.ReadCSVComment.time_comment('python')
20.6±0.5ms 20.5±0.4ms 1.00 io.csv.ReadCSVConcatDatetime.time_read_csv
9.90±0.4ms 10.0±0.4ms 1.01 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')
7.00±0.5ms 6.72±0.2ms 0.96 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
11.6±0.4ms 11.1±0.3ms 0.96 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
3.68±0.01ms 3.53±0.2ms 0.96 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('custom')
1.10±0ms 1.07±0.05ms 0.97 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('iso8601')
895±9μs 863±30μs 0.96 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('ymd')
938±40μs 940±30μs 1.00 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(None)
4.31±0.05ms 4.20±0.2ms 0.97 io.csv.ReadCSVDatePyarrowEngine.time_read_csv_index_col
47.3M 47.1M 1.00 io.csv.ReadCSVEngine.peakmem_read_csv('c')
63.8M 63.8M 1.00 io.csv.ReadCSVEngine.peakmem_read_csv('pyarrow')
217M 217M 1.00 io.csv.ReadCSVEngine.peakmem_read_csv('python')
10.0±0.5ms 9.39±0.2ms 0.94 io.csv.ReadCSVEngine.time_read_bytescsv('c')
6.83±0.3ms 6.92±0.4ms 1.01 io.csv.ReadCSVEngine.time_read_bytescsv('pyarrow')
279±3ms 278±20ms 1.00 io.csv.ReadCSVEngine.time_read_bytescsv('python')
10.0±0.4ms 9.63±0.1ms 0.96 io.csv.ReadCSVEngine.time_read_stringcsv('c')
7.61±0.2ms 7.60±0.2ms 1.00 io.csv.ReadCSVEngine.time_read_stringcsv('pyarrow')
275±5ms 276±4ms 1.00 io.csv.ReadCSVEngine.time_read_stringcsv('python')
788±7μs 771±30μs 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
1.83±0.02ms 1.79±0.04ms 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
788±10μs 775±30μs 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
1.10±0.01ms 1.09±0.03ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
1.10±0.01ms 1.13±0.02ms 1.03 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
1.09±0ms 1.08±0.03ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
787±10μs 772±30μs 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
1.83±0.02ms 1.80±0.04ms 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
782±10μs 766±20μs 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
1.09±0.01ms 1.07±0.04ms 0.97 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'high')
1.09±0ms 1.07±0.03ms 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
1.09±0.01ms 1.08±0.03ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
2.52±0.1ms 2.61±0.02ms 1.03 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
2.52±0.1ms 2.60±0.02ms 1.03 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
2.53±0.04ms 2.59±0.05ms 1.02 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', None)
2.14±0.03ms 2.13±0.01ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'high')
2.12±0.02ms 2.14±0ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'round_trip')
2.06±0.07ms 2.13±0.02ms 1.03 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
2.61±0.02ms 2.60±0.02ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
2.63±0.03ms 2.62±0.02ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'round_trip')
2.59±0.02ms 2.61±0.04ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', None)
2.13±0.02ms 2.14±0.01ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'high')
2.15±0.01ms 2.13±0.01ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
2.14±0.03ms 2.13±0.01ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', None)
5.28±0.2ms 4.95±0.1ms 0.94 io.csv.ReadCSVIndexCol.time_read_csv_index_col
73.6±1ms 68.1±1ms 0.92 io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8
0 0 n/a io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('c')
0 0 n/a io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('python')
804±30μs 781±4μs 0.97 io.csv.ReadCSVParseDates.time_baseline('c')
956±30μs 922±3μs 0.96 io.csv.ReadCSVParseDates.time_baseline('python')
3.05±0.01ms 2.94±0.06ms 0.96 io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'c')
8.58±0.1ms 8.32±0.3ms 0.97 io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'python')
7.07±0.08ms 6.66±0.06ms 0.94 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'c')
24.8±0.1ms 23.6±0.1ms 0.95 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'python')
3.43±0.02ms 3.24±0.1ms 0.95 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'c')
9.56±0.1ms 9.27±0.3ms 0.97 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'python')
9.22±0.09ms 9.44±0.2ms 1.02 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'c')
3.52±0.1ms 3.54±0.09ms 1.01 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'pyarrow')
38.7±0.3ms 38.6±0.3ms 1.00 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'python')
14.6±0.1ms 14.3±0.3ms 0.98 io.csv.ReadCSVSkipRows.time_skipprows(None, 'c')
3.54±0.2ms 3.53±0.06ms 1.00 io.csv.ReadCSVSkipRows.time_skipprows(None, 'pyarrow')
57.0±0.8ms 56.0±1ms 0.98 io.csv.ReadCSVSkipRows.time_skipprows(None, 'python')
10.1±0.6ms 11.1±0.9ms 1.10 io.csv.ReadCSVThousands.time_thousands(',', ',', 'c')
127±2ms 126±0.8ms 0.99 io.csv.ReadCSVThousands.time_thousands(',', ',', 'python')
9.56±0.1ms 10.0±0.1ms 1.05 io.csv.ReadCSVThousands.time_thousands(',', None, 'c')
53.8±1ms 53.8±0.3ms 1.00 io.csv.ReadCSVThousands.time_thousands(',', None, 'python')
10.2±0.5ms 10.9±0.9ms 1.07 io.csv.ReadCSVThousands.time_thousands('
126±5ms 120±3ms 0.95 io.csv.ReadCSVThousands.time_thousands('
9.63±0.4ms 10.0±0.1ms 1.04 io.csv.ReadCSVThousands.time_thousands('
53.3±2ms 51.8±0.9ms 0.97 io.csv.ReadCSVThousands.time_thousands('
1.30±0.05ms 1.32±0.06ms 1.02 io.csv.ReadUint64Integers.time_read_uint64
3.69±0.1ms 3.69±0.1ms 1.00 io.csv.ReadUint64Integers.time_read_uint64_na_values
3.43±0.09ms 3.45±0.08ms 1.00 io.csv.ReadUint64Integers.time_read_uint64_neg_values
127±0.4ms 126±0.8ms 0.99 io.csv.ToCSV.time_frame('long')
15.8±0.6ms 16.2±0.03ms 1.03 io.csv.ToCSV.time_frame('mixed')
117±0.4ms 116±3ms 0.99 io.csv.ToCSV.time_frame('wide')
9.25±0.3ms 9.51±0.07ms 1.03 io.csv.ToCSVDatetime.time_frame_date_formatting
3.63±0.1ms 3.74±0.02ms 1.03 io.csv.ToCSVDatetimeBig.time_frame(1000)
34.1±0.8ms 34.3±0.2ms 1.01 io.csv.ToCSVDatetimeBig.time_frame(10000)
344±4ms 338±10ms 0.98 io.csv.ToCSVDatetimeBig.time_frame(100000)
500±10ms 500±10ms 1.00 io.csv.ToCSVDatetimeIndex.time_frame_date_formatting_index
148±1ms 148±2ms 1.00 io.csv.ToCSVDatetimeIndex.time_frame_date_no_format_index
731±7ms 732±20ms 1.00 io.csv.ToCSVFloatFormatVariants.time_callable_format
808±20ms 794±3ms 0.98 io.csv.ToCSVFloatFormatVariants.time_new_style_brace_format
865±6ms 884±20ms 1.02 io.csv.ToCSVFloatFormatVariants.time_new_style_thousands_format
870±10ms 896±20ms 1.03 io.csv.ToCSVFloatFormatVariants.time_old_style_percent_format
659±9ms 664±10ms 1.01 io.csv.ToCSVIndexes.time_head_of_multiindex
658±10ms 664±20ms 1.01 io.csv.ToCSVIndexes.time_multiindex
670±9ms 656±2ms 0.98 io.csv.ToCSVIndexes.time_standard_index
191±5ms 192±1ms 1.00 io.csv.ToCSVMultiIndexUnusedLevels.time_full_frame
16.3±0.4ms 16.3±0.2ms 1.00 io.csv.ToCSVMultiIndexUnusedLevels.time_single_index_frame
18.7±0.5ms 18.9±0.1ms 1.01 io.csv.ToCSVMultiIndexUnusedLevels.time_sliced_frame
3.73±0.02ms 3.77±0.09ms 1.01 io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'D')
3.65±0.1ms 3.76±0.02ms 1.03 io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'h')
34.6±1ms 35.6±0.09ms 1.03 io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'D')
35.2±1ms 35.8±0.3ms 1.02 io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'h')
1.10±0.03ms 1.13±0ms 1.03 io.csv.ToCSVPeriod.time_frame_period_formatting_default(1000, 'D')
1.36±0.04ms 1.40±0.01ms 1.03 io.csv.ToCSVPeriod.time_frame_period_formatting_default(1000, 'h')
9.59±0.3ms 9.83±0.01ms 1.03 io.csv.ToCSVPeriod.time_frame_period_formatting_default(10000, 'D')
12.2±0.4ms 12.5±0.06ms 1.03 io.csv.ToCSVPeriod.time_frame_period_formatting_default(10000, 'h')
1.16±0.03ms 1.14±0ms 0.98 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(1000, 'D')
1.39±0.03ms 1.42±0.01ms 1.02 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(1000, 'h')
9.91±0.02ms 9.93±0.03ms 1.00 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(10000, 'D')
12.6±0.04ms 12.6±0.03ms 1.00 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(10000, 'h')
5.43±0.03ms 5.35±0.02ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'D')
5.46±0.04ms 5.25±0.1ms 0.96 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'h')
50.8±0.3ms 50.2±0.3ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'D')
50.9±0.5ms 50.3±0.2ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'h')
1.06±0ms 1.13±0ms 1.06 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(1000, 'D')
1.36±0.06ms 1.40±0ms 1.03 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(1000, 'h')
9.03±0.3ms 9.25±0.02ms 1.03 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(10000, 'D')
12.0±0.1ms 12.0±0.04ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(10000, 'h')
2.71±0.02ms 2.68±0.02ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(1000, 'D')
3.01±0.02ms 2.97±0.02ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(1000, 'h')
23.8±0.3ms 23.3±0.1ms 0.98 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(10000, 'D')
26.7±0.3ms 26.2±0.06ms 0.98 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(10000, 'h')

cc @WillAyd

@mroeschke mroeschke requested a review from WillAyd October 8, 2025 15:48
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this faster than the original code? Seems like we've only added instructions to the conversion function(s), so I'm worried we are overlooking something

uint64_t uint_max, int *error, char tsep);
int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max,
int *error, char tsep);
int64_t str_to_int64(const char *p_item, char decimal_separator,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add this argument to the int conversion functions - that repurposes these functions in a way that's not really clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. Additionally, I am no longer checking explicitly if it's a float, just assigning the error code for invalid char.

} else {
*error = ERROR_OVERFLOW;
return 0;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to keep these are immediate returns? This opens up the door to ambiguous behavior of the parser in case of "multiple" failures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the changes in #62542 was because of this branch, where big numbers that indicate float were cast to string due to overflow.

Can I still check posterior characters to change the error code? If not, I don't think there is anything to do in this PR, and it's best to revert the problematic commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put back the immediate return, but added an inline function before it to check if it's not an integer after the overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks - that's helpful. I somewhat disagree with the premise of that change to cast to float even if its a lossy operation. I understand that in some cases there is a desire for numeric operations on numbers like that, but its unclear that should take precedence over the string cast, which is in some sense more "value preserving".

The larger issue is that pandas does not have native support for Decimal precision types

@Alvaro-Kothe
Copy link
Contributor Author

What makes this faster than the original code?

I don't give much value for the performance increase that I reported on the first edit. I still need to update the description after all these changes. Just waiting for all the changes that I made are considered correct.

Additionally, The changes should only apply for integer parsing, everything else can be considered noise.

@Alvaro-Kothe Alvaro-Kothe requested a review from WillAyd October 8, 2025 19:20
@Alvaro-Kothe
Copy link
Contributor Author

@WillAyd I updated the benchmark results on the description. There were no significant changes.

return self->seen_uint && (self->seen_sint || self->seen_null);
}

static inline void check_for_invalid_char(const char *p_item, int *error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document what this function does? The name check_for_invalid_char is a bit too vague - this is better described as something like cast_char_p_as_float no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also suggest that you either return int and drop the int * argument, or return something useful (ex: return the parsed float value) and then set the pointer value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a Doxygen comment. It's also returning the pointer to the last verified character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the error pointer in the function to prevent code duplication, and also because the main purpose of the function is just to assign a value to it. Considering that it's now returning the position of the last verified character, it's possible to change the error value outside the function, but I think it's more clean the way it is.

p_item++;
}

while (*p_item != '\0' && isspace_ascii(*p_item)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be combined with the previous loop? Is there a reason for trailing whitespace to be handed specially, or is there a reason at all to allow whitespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs a separate loop because this case should be invalid "7890123 1351713789"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we allow trailing white space though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is permitted below too if an overflow doesn't occur. I added it in this function to make it consistent.

} else {
*error = ERROR_OVERFLOW;
return 0;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks - that's helpful. I somewhat disagree with the premise of that change to cast to float even if its a lossy operation. I understand that in some cases there is a desire for numeric operations on numbers like that, but its unclear that should take precedence over the string cast, which is in some sense more "value preserving".

The larger issue is that pandas does not have native support for Decimal precision types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants