-
-
Notifications
You must be signed in to change notification settings - Fork 33k
gh-138213: Make csv.reader
up to 2x faster
#138214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider not only the gain, but also the cost/benefit ratio. This change significantly complicates an already complex code.
The benefit is not unconditional either. In the IN_FIELD
state, PyUnicode_FindChar()
is called 4 times. This can actually slow down the code for long non-ASCII lines.
Thank you for review!
I agree that it increases complexity. I added an explanation before the There are only two building blocks: jump with My thinking here goes that I believe that conceptually it's simple: process the whole field at once.
The ideal would be I fully agree there are scenarios where it's worsening performance. My hunch is that CSV files without any fields (eg: the parser is spending significantly more time in states other than What's the best way of sharing here? I already feel that all these I'm not sure what is the best benchmarking strategy here. Unfortunately, all the combinations from the benchmark with I tried measuring a long non-ASCII line and still observed a significant benefit:
The benchmark similar to the above:import csv
import io
import os
import pyperf
import random
NUM_ROWS = (1_000, 10_000)
NUM_COLS = (5, 10)
FIELD_LENGTH = (300, 1000)
CASES = [
# (label, field_chars, delimiter, escapechar)
(
"nonascii_no_escape",
"ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
"λ",
None,
),
(
"nonascii_escape",
"ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
"λ",
"\\",
),
]
def generate_csv_data(rows, cols, field_len, ch, delim):
# random.choice() so we're not cache-friendly
field = "".join(random.choices(ch, k=field_len))
row = delim.join([field] * cols)
return os.linesep.join([row] * rows) + os.linesep
def benchmark_csv_reader(csv_data, delim, escapechar):
rdr = csv.reader(
io.StringIO(csv_data), delimiter=delim, escapechar=escapechar
)
for _ in rdr:
pass
runner = pyperf.Runner()
for rows in NUM_ROWS:
for cols in NUM_COLS:
for field_len in FIELD_LENGTH:
for label, ch, delim, esc in CASES:
csv_data = generate_csv_data(rows, cols, field_len, ch, delim)
runner.bench_func(
f"csv_reader({rows},{cols},{field_len})[{label}]",
benchmark_csv_reader,
csv_data,
delim,
esc,
) I updated the benchmark in the description to be more comprehensive. |
csv.reader
1.4x fastercsv.reader
1.4-2x faster
csv.reader
1.4-2x fastercsv.reader
2x faster
csv.reader
2x fastercsv.reader
up to 2x faster
The basic observation is that there's no need to process character by character, and call the state machine while we're in a field (
IN_FIELD
,IN_QUOTED_FIELD
).Most characters are ordinary (ie: they're not delimiters, escapes, quotes etc.), so we can find the next interesting character and copy the whole slice in between.
This is my very first C change in
cpython
so I'm more than happy to pair with someone.Benchmark
There's no
pyperformance
benchmark forcsv.reader
.The script:
The results:
I observe similar results with real CSV files.
The environment:
sudo ./python -m pyperf system tune
ensured.csv.reader
calls the state machine for every character needlessly #138213📚 Documentation preview 📚: https://cpython-previews--138214.org.readthedocs.build/