Skip to content

Conversation

Alvaro-Kothe
Copy link
Contributor


Benchmarks

asv continuous -f 1.1 -E virtualenv:3.13 "db31f6a38353a311cc471eb98506470b39c676d8~" HEAD -b io.csv

Performance increase:

Change Before [d8b3ff3] <main~12> After [be21b2e] <perf/read-csv> Ratio Benchmark (Parameter)
- 10.1±0.2ms 8.78±0.2ms 0.87 io.csv.ReadCSVThousands.time_thousands(',', None, 'c')
- 10.0±0.2ms 8.31±0.3ms 0.83 io.csv.ReadCSVThousands.time_thousands('

asv compare db31f6a38353a311cc471eb98506470b39c676d8~ HEAD

All benchmarks:

Change Before [d8b3ff3] <main~12> After [be21b2e] <perf/read-csv> Ratio Benchmark (Parameter)
17.4±0.3ms 17.1±0.1ms 0.98 io.csv.ParseDateComparison.time_read_csv_dayfirst(False)
3.45±0.02ms 3.39±0.01ms 0.98 io.csv.ParseDateComparison.time_read_csv_dayfirst(True)
18.5±0.2ms 18.2±0.1ms 0.98 io.csv.ParseDateComparison.time_to_datetime_dayfirst(False)
3.58±0.05ms 3.54±0.03ms 0.99 io.csv.ParseDateComparison.time_to_datetime_dayfirst(True)
18.2±0.1ms 18.1±0.1ms 0.99 io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(False)
3.46±0.06ms 3.35±0.06ms 0.97 io.csv.ParseDateComparison.time_to_datetime_format_DD_MM_YYYY(True)
4.9G 4.71G 0.96 io.csv.ReadCSVCParserLowMemory.peakmem_over_2gb_input
897±8μs 895±5μs 1.00 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'c')
1.50±0.02ms 1.49±0.01ms 0.99 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False, 'python')
1.10±0.2ms 913±3μs ~0.83 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'c')
1.76±0.3ms 1.52±0.01ms ~0.86 io.csv.ReadCSVCachedParseDates.time_read_csv_cached(True, 'python')
25.5±1ms 24.0±0.7ms 0.94 io.csv.ReadCSVCategorical.time_convert_direct('c')
227±6ms 222±3ms 0.98 io.csv.ReadCSVCategorical.time_convert_direct('python')
61.7±0.5ms 60.5±0.5ms 0.98 io.csv.ReadCSVCategorical.time_convert_post('c')
150±2ms 153±1ms 1.02 io.csv.ReadCSVCategorical.time_convert_post('python')
37.5±1ms 36.6±0.3ms 0.98 io.csv.ReadCSVComment.time_comment('c')
38.0±0.9ms 36.6±0.3ms 0.96 io.csv.ReadCSVComment.time_comment('python')
20.5±0.7ms 21.3±0.2ms 1.04 io.csv.ReadCSVConcatDatetime.time_read_csv
10.1±0.1ms 10.0±0.02ms 0.99 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('')
7.19±0.3ms 6.76±0.05ms 0.94 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('0')
11.6±0.1ms 11.2±0.1ms 0.96 io.csv.ReadCSVConcatDatetimeBadDateValue.time_read_csv('nan')
3.56±0.2ms 3.53±0.02ms 0.99 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('custom')
1.11±0.01ms 1.11±0.01ms 1.00 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('iso8601')
895±6μs 901±20μs 1.01 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv('ymd')
950±30μs 963±3μs 1.01 io.csv.ReadCSVDInferDatetimeFormat.time_read_csv(None)
4.32±0.07ms 4.40±0.2ms 1.02 io.csv.ReadCSVDatePyarrowEngine.time_read_csv_index_col
47.2M 47.4M 1.00 io.csv.ReadCSVEngine.peakmem_read_csv('c')
63.5M 63.6M 1.00 io.csv.ReadCSVEngine.peakmem_read_csv('pyarrow')
217M 216M 1.00 io.csv.ReadCSVEngine.peakmem_read_csv('python')
11.2±1ms 9.76±0.1ms ~0.87 io.csv.ReadCSVEngine.time_read_bytescsv('c')
6.62±0.4ms 6.50±0.3ms 0.98 io.csv.ReadCSVEngine.time_read_bytescsv('pyarrow')
288±2ms 296±4ms 1.03 io.csv.ReadCSVEngine.time_read_bytescsv('python')
12.1±1ms 10.1±0.2ms ~0.84 io.csv.ReadCSVEngine.time_read_stringcsv('c')
7.58±0.2ms 7.64±0.5ms 1.01 io.csv.ReadCSVEngine.time_read_stringcsv('pyarrow')
288±3ms 290±4ms 1.01 io.csv.ReadCSVEngine.time_read_stringcsv('python')
797±20μs 790±8μs 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
1.85±0.01ms 1.87±0.01ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'round_trip')
810±10μs 791±6μs 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', None)
1.11±0.02ms 1.11±0ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
1.12±0.02ms 1.10±0ms 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'round_trip')
1.12±0.01ms 1.11±0.01ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', None)
794±20μs 785±8μs 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'high')
1.85±0.01ms 1.86±0.01ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', 'round_trip')
798±10μs 784±10μs 0.98 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '.', None)
1.10±0.01ms 1.10±0.01ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'high')
1.12±0.01ms 1.10±0.01ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'round_trip')
1.11±0.01ms 1.10±0.01ms 0.99 io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', None)
2.52±0.1ms 2.65±0.02ms 1.05 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
2.51±0.09ms 2.63±0.03ms 1.05 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'round_trip')
2.59±0.03ms 2.66±0.04ms 1.03 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', None)
2.04±0.08ms 2.13±0.01ms 1.04 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'high')
2.04±0.07ms 2.15±0.02ms 1.05 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', 'round_trip')
2.13±0.07ms 2.14±0.02ms 1.00 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '_', None)
2.58±0.02ms 2.65±0.01ms 1.03 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'high')
2.57±0.01ms 2.62±0.03ms 1.02 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', 'round_trip')
2.61±0.01ms 2.65±0.05ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '.', None)
2.13±0.02ms 2.16±0.03ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'high')
2.13±0.02ms 2.14±0.03ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', 'round_trip')
2.11±0.01ms 2.13±0.01ms 1.01 io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(';', '_', None)
6.50±1ms 4.94±0.1ms ~0.76 io.csv.ReadCSVIndexCol.time_read_csv_index_col
68.6±1ms 70.7±1ms 1.03 io.csv.ReadCSVMemMapUTF8.time_read_memmapped_utf8
0 0 n/a io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('c')
0 0 n/a io.csv.ReadCSVMemoryGrowth.mem_parser_chunks('python')
802±30μs 837±20μs 1.04 io.csv.ReadCSVParseDates.time_baseline('c')
967±50μs 971±10μs 1.00 io.csv.ReadCSVParseDates.time_baseline('python')
2.98±0.1ms 3.04±0ms 1.02 io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'c')
8.32±0.2ms 8.70±0.2ms 1.04 io.csv.ReadCSVParseSpecialDate.time_read_special_date('hm', 'python')
6.99±0.3ms 7.20±0.08ms 1.03 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'c')
24.2±0.9ms 25.1±0.2ms 1.04 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mY', 'python')
3.33±0.1ms 3.42±0.02ms 1.03 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'c')
9.19±0.2ms 9.36±0.08ms 1.02 io.csv.ReadCSVParseSpecialDate.time_read_special_date('mdY', 'python')
9.60±0.03ms 9.12±0.1ms 0.95 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'c')
3.53±0.08ms 3.59±0.09ms 1.02 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'pyarrow')
38.4±0.2ms 39.0±1ms 1.02 io.csv.ReadCSVSkipRows.time_skipprows(10000, 'python')
14.9±0.2ms 13.9±0.08ms 0.94 io.csv.ReadCSVSkipRows.time_skipprows(None, 'c')
3.59±0.06ms 3.51±0.06ms 0.98 io.csv.ReadCSVSkipRows.time_skipprows(None, 'pyarrow')
56.8±0.7ms 57.7±1ms 1.02 io.csv.ReadCSVSkipRows.time_skipprows(None, 'python')
10.4±0.6ms 9.54±1ms 0.91 io.csv.ReadCSVThousands.time_thousands(',', ',', 'c')
126±1ms 119±0.7ms 0.95 io.csv.ReadCSVThousands.time_thousands(',', ',', 'python')
- 10.1±0.2ms 8.78±0.2ms 0.87 io.csv.ReadCSVThousands.time_thousands(',', None, 'c')
55.1±2ms 55.2±0.6ms 1.00 io.csv.ReadCSVThousands.time_thousands(',', None, 'python')
10.5±0.2ms 10.0±0.4ms 0.95 io.csv.ReadCSVThousands.time_thousands('
126±4ms 120±3ms 0.96 io.csv.ReadCSVThousands.time_thousands('
- 10.0±0.2ms 8.31±0.3ms 0.83 io.csv.ReadCSVThousands.time_thousands('
55.0±0.3ms 50.9±1ms 0.93 io.csv.ReadCSVThousands.time_thousands('
1.40±0.03ms 1.38±0.05ms 0.98 io.csv.ReadUint64Integers.time_read_uint64
3.71±0.1ms 3.67±0.1ms 0.99 io.csv.ReadUint64Integers.time_read_uint64_na_values
3.66±0.1ms 3.52±0.08ms 0.96 io.csv.ReadUint64Integers.time_read_uint64_neg_values
125±0.3ms 128±2ms 1.02 io.csv.ToCSV.time_frame('long')
15.7±0.5ms 16.2±0.04ms 1.03 io.csv.ToCSV.time_frame('mixed')
116±3ms 113±5ms 0.97 io.csv.ToCSV.time_frame('wide')
9.36±0.1ms 9.15±0.04ms 0.98 io.csv.ToCSVDatetime.time_frame_date_formatting
3.73±0.01ms 3.73±0.01ms 1.00 io.csv.ToCSVDatetimeBig.time_frame(1000)
34.4±0.1ms 34.3±0.09ms 1.00 io.csv.ToCSVDatetimeBig.time_frame(10000)
343±5ms 335±10ms 0.97 io.csv.ToCSVDatetimeBig.time_frame(100000)
491±6ms 460±9ms 0.94 io.csv.ToCSVDatetimeIndex.time_frame_date_formatting_index
155±1ms 146±4ms 0.95 io.csv.ToCSVDatetimeIndex.time_frame_date_no_format_index
718±20ms 716±10ms 1.00 io.csv.ToCSVFloatFormatVariants.time_callable_format
805±20ms 833±20ms 1.03 io.csv.ToCSVFloatFormatVariants.time_new_style_brace_format
893±30ms 861±30ms 0.96 io.csv.ToCSVFloatFormatVariants.time_new_style_thousands_format
888±7ms 852±2ms 0.96 io.csv.ToCSVFloatFormatVariants.time_old_style_percent_format
667±9ms 638±20ms 0.96 io.csv.ToCSVIndexes.time_head_of_multiindex
670±10ms 682±20ms 1.02 io.csv.ToCSVIndexes.time_multiindex
670±10ms 636±2ms 0.95 io.csv.ToCSVIndexes.time_standard_index
187±1ms 183±0.9ms 0.98 io.csv.ToCSVMultiIndexUnusedLevels.time_full_frame
15.8±0.06ms 15.7±0.5ms 1.00 io.csv.ToCSVMultiIndexUnusedLevels.time_single_index_frame
18.7±0.6ms 18.0±0.1ms 0.96 io.csv.ToCSVMultiIndexUnusedLevels.time_sliced_frame
3.55±0.02ms 3.52±0.03ms 0.99 io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'D')
3.63±0.1ms 3.66±0.1ms 1.01 io.csv.ToCSVPeriod.time_frame_period_formatting(1000, 'h')
34.5±0.9ms 34.2±2ms 0.99 io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'D')
34.9±1ms 34.3±1ms 0.98 io.csv.ToCSVPeriod.time_frame_period_formatting(10000, 'h')
1.10±0.03ms 1.13±0.01ms 1.03 io.csv.ToCSVPeriod.time_frame_period_formatting_default(1000, 'D')
1.37±0.04ms 1.36±0.05ms 0.99 io.csv.ToCSVPeriod.time_frame_period_formatting_default(1000, 'h')
9.49±0.2ms 9.84±0.01ms 1.04 io.csv.ToCSVPeriod.time_frame_period_formatting_default(10000, 'D')
12.1±0.3ms 12.2±0.2ms 1.01 io.csv.ToCSVPeriod.time_frame_period_formatting_default(10000, 'h')
1.16±0.03ms 1.11±0.04ms 0.96 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(1000, 'D')
1.43±0.05ms 1.42±0.05ms 1.00 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(1000, 'h')
9.92±0.09ms 9.62±0.3ms 0.97 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(10000, 'D')
12.7±0.2ms 12.2±0.4ms 0.96 io.csv.ToCSVPeriod.time_frame_period_formatting_default_explicit(10000, 'h')
5.38±0.03ms 5.41±0.03ms 1.01 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'D')
5.44±0.06ms 5.38±0.01ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(1000, 'h')
50.8±0.3ms 50.0±0.04ms 0.98 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'D')
51.1±1ms 50.4±0.1ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index(10000, 'h')
1.13±0.01ms 1.13±0ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(1000, 'D')
1.40±0.01ms 1.39±0ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(1000, 'h')
9.43±0.2ms 9.31±0.01ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(10000, 'D')
12.0±0.06ms 11.9±0.04ms 1.00 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default(10000, 'h')
2.70±0.02ms 2.67±0.01ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(1000, 'D')
2.99±0.05ms 2.95±0.02ms 0.99 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(1000, 'h')
23.8±0.2ms 23.2±0.2ms 0.98 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(10000, 'D')
26.6±0.09ms 26.1±0.1ms 0.98 io.csv.ToCSVPeriodIndex.time_frame_period_formatting_index_default_explicit(10000, 'h')

cc @WillAyd

@mroeschke mroeschke requested a review from WillAyd October 8, 2025 15:48
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this faster than the original code? Seems like we've only added instructions to the conversion function(s), so I'm worried we are overlooking something

uint64_t uint_max, int *error, char tsep);
int64_t str_to_int64(const char *p_item, int64_t int_min, int64_t int_max,
int *error, char tsep);
int64_t str_to_int64(const char *p_item, char decimal_separator,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add this argument to the int conversion functions - that repurposes these functions in a way that's not really clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. Additionally, I am no longer checking explicitly if it's a float, just assigning the error code for invalid char.

} else {
*error = ERROR_OVERFLOW;
return 0;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to keep these are immediate returns? This opens up the door to ambiguous behavior of the parser in case of "multiple" failures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the changes in #62542 was because of this branch, where big numbers that indicate float were cast to string due to overflow.

Can I still check posterior characters to change the error code? If not, I don't think there is anything to do in this PR, and it's best to revert the problematic commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put back the immediate return, but added an inline function before it to check if it's not an integer after the overflow.

@Alvaro-Kothe
Copy link
Contributor Author

What makes this faster than the original code?

I don't give much value for the performance increase that I reported on the first edit. I still need to update the description after all these changes. Just waiting for all the changes that I made are considered correct.

Additionally, The changes should only apply for integer parsing, everything else can be considered noise.

@Alvaro-Kothe Alvaro-Kothe requested a review from WillAyd October 8, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants