gh-138213: Make `csv.reader` up to 2x faster #138214

maurycy · 2025-08-28T00:46:41Z

The basic observation is that there's no need to process character by character, and call the state machine while we're in a field (IN_FIELD, IN_QUOTED_FIELD).

Most characters are ordinary (ie: they're not delimiters, escapes, quotes etc.), so we can find the next interesting character and copy the whole slice in between.

This is my very first C change in cpython so I'm more than happy to pair with someone.

Benchmark

There's no pyperformance benchmark for csv.reader.

The script:

import csv
import io
import os
import pyperf
import random

NUM_ROWS = (1_000, 10_000)
NUM_COLS = (5, 10)
FIELD_LENGTH = (300, 1000)

CASES = [
    # (label, field_chars, delimiter, escapechar)
    (
        "ascii",
        "a",
        None,
        None,
    ),
    (
        "nonascii_no_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        None,
    ),
    (
        "nonascii_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        "\\",
    ),
]


def generate_csv_data(rows, cols, field_len, ch, delim):
    # random.choice() so we're not that cache-friendly
    field = "".join(random.choices(ch, k=field_len))
    actual_delim = delim if delim is not None else ","
    row = actual_delim.join([field] * cols)
    return os.linesep.join([row] * rows) + os.linesep


def benchmark_csv_reader(csv_data, delim, escapechar):
    kwargs = {"delimiter": delim, "escapechar": escapechar}
    active_kwargs = {
        key: value for key, value in kwargs.items() if value is not None
    }
    rdr = csv.reader(io.StringIO(csv_data), **active_kwargs)
    for _ in rdr:
        pass


runner = pyperf.Runner()

for rows in NUM_ROWS:
    for cols in NUM_COLS:
        for field_len in FIELD_LENGTH:
            for label, ch, delim, esc in CASES:
                csv_data = generate_csv_data(rows, cols, field_len, ch, delim)
                runner.bench_func(
                    f"csv_reader({rows},{cols},{field_len})[{label}]",
                    benchmark_csv_reader,
                    csv_data,
                    delim,
                    esc,
                )

The results:

Benchmark	bench_csv_reader.main	bench_csv_reader.csv-read-chunks
csv_reader(1000,5,300)[ascii]	5.45 ms	2.05 ms: 2.66x faster
csv_reader(1000,5,300)[nonascii_no_escape]	5.68 ms	2.67 ms: 2.13x faster
csv_reader(1000,5,300)[nonascii_escape]	5.55 ms	2.60 ms: 2.13x faster
csv_reader(1000,5,1000)[ascii]	17.0 ms	6.65 ms: 2.55x faster
csv_reader(1000,5,1000)[nonascii_no_escape]	18.2 ms	7.97 ms: 2.29x faster
csv_reader(1000,5,1000)[nonascii_escape]	17.8 ms	7.80 ms: 2.29x faster
csv_reader(1000,10,300)[ascii]	10.4 ms	3.89 ms: 2.66x faster
csv_reader(1000,10,300)[nonascii_no_escape]	10.9 ms	5.08 ms: 2.15x faster
csv_reader(1000,10,300)[nonascii_escape]	11.2 ms	5.81 ms: 1.94x faster
csv_reader(1000,10,1000)[ascii]	38.2 ms	17.5 ms: 2.18x faster
csv_reader(1000,10,1000)[nonascii_no_escape]	40.8 ms	21.0 ms: 1.94x faster
csv_reader(1000,10,1000)[nonascii_escape]	40.6 ms	22.1 ms: 1.84x faster
csv_reader(10000,5,300)[ascii]	60.9 ms	28.1 ms: 2.17x faster
csv_reader(10000,5,300)[nonascii_no_escape]	64.8 ms	33.7 ms: 1.93x faster
csv_reader(10000,5,300)[nonascii_escape]	64.7 ms	34.6 ms: 1.87x faster
csv_reader(10000,5,1000)[ascii]	193 ms	90.7 ms: 2.13x faster
csv_reader(10000,5,1000)[nonascii_no_escape]	206 ms	102 ms: 2.01x faster
csv_reader(10000,5,1000)[nonascii_escape]	207 ms	106 ms: 1.96x faster
csv_reader(10000,10,300)[ascii]	119 ms	54.1 ms: 2.20x faster
csv_reader(10000,10,300)[nonascii_no_escape]	129 ms	69.5 ms: 1.86x faster
csv_reader(10000,10,300)[nonascii_escape]	128 ms	72.4 ms: 1.76x faster
csv_reader(10000,10,1000)[ascii]	383 ms	177 ms: 2.17x faster
csv_reader(10000,10,1000)[nonascii_no_escape]	408 ms	215 ms: 1.90x faster
csv_reader(10000,10,1000)[nonascii_escape]	409 ms	222 ms: 1.84x faster
Geometric mean	(ref)	2.09x faster

I observe similar results with real CSV files.

The environment:

% ./python -c "import sysconfig; print(sysconfig.get_config_var('CONFIG_ARGS'))"
'--enable-lto' '--with-optimizations'

sudo ./python -m pyperf system tune ensured.

Issue: csv.reader calls the state machine for every character needlessly #138213

📚 Documentation preview 📚: https://cpython-previews--138214.org.readthedocs.build/

…atement

serhiy-storchaka

We need to consider not only the gain, but also the cost/benefit ratio. This change significantly complicates an already complex code.

The benefit is not unconditional either. In the IN_FIELD state, PyUnicode_FindChar() is called 4 times. This can actually slow down the code for long non-ASCII lines.

maurycy · 2025-08-28T09:00:41Z

@serhiy-storchaka

Thank you for review!

We need to consider not only the gain, but also the cost/benefit ratio. This change significantly complicates an already complex code.

I agree that it increases complexity.

I added an explanation before the switch block, and two macros:

https://github.com/python/cpython/pull/138214/files#diff-38fcce6bb475616052f5c9a0973eefd49489a4dff719f30e407534258e2a3ec3R1030-R1080

There are only two building blocks: jump with PyUnicode_FindChar, or parse each with PyUnicode_READ_CHAR. This might have been obscured by duplication before.

My thinking here goes that csv.reader is a global hot path, one of the most common usages of Python. Tuning this code has some real positive impact, and it's an explainable algorithmic improvement (without relying on low-level SIMD magic.)

I believe that conceptually it's simple: process the whole field at once.

The benefit is not unconditional either. In the IN_FIELD state, PyUnicode_FindChar() is called 4 times. This can actually slow down the code for long non-ASCII lines.

The ideal would be PyUnicode_strcspn(). :-)

I fully agree there are scenarios where it's worsening performance. My hunch is that CSV files without any fields (eg: the parser is spending significantly more time in states other than IN_FIELD and IN_QUOTED_FIELD) it'd worsen. How important are they?

What's the best way of sharing here? I already feel that all these compare_to --table --table-format md are overwhelming.

I'm not sure what is the best benchmarking strategy here. Unfortunately, all the combinations from the benchmark with --rigorous already take around ~30 minutes (despite NVMe, i9-12900K and 128G memory), so I cannot range randomly. I will think how to approach this.

I tried measuring a long non-ASCII line and still observed a significant benefit:

Benchmark	bench_csv_reader.main	bench_csv_reader.csv-read-chunks
csv_reader(1000,5,300)[nonascii_no_escape]	5.74 ms	2.72 ms: 2.11x faster
csv_reader(1000,5,300)[nonascii_escape]	5.50 ms	2.60 ms: 2.12x faster
csv_reader(1000,5,1000)[nonascii_no_escape]	18.1 ms	7.90 ms: 2.30x faster
csv_reader(1000,5,1000)[nonascii_escape]	17.8 ms	7.75 ms: 2.29x faster
csv_reader(1000,10,300)[nonascii_no_escape]	11.3 ms	5.38 ms: 2.11x faster
csv_reader(1000,10,300)[nonascii_escape]	10.9 ms	5.45 ms: 2.00x faster
csv_reader(1000,10,1000)[nonascii_no_escape]	40.8 ms	21.1 ms: 1.93x faster
csv_reader(1000,10,1000)[nonascii_escape]	40.8 ms	22.3 ms: 1.83x faster
csv_reader(10000,5,300)[nonascii_no_escape]	65.0 ms	33.7 ms: 1.93x faster
csv_reader(10000,5,300)[nonascii_escape]	64.8 ms	35.5 ms: 1.83x faster
csv_reader(10000,5,1000)[nonascii_no_escape]	208 ms	103 ms: 2.02x faster
csv_reader(10000,5,1000)[nonascii_escape]	208 ms	109 ms: 1.90x faster
csv_reader(10000,10,300)[nonascii_no_escape]	128 ms	70.6 ms: 1.81x faster
csv_reader(10000,10,300)[nonascii_escape]	128 ms	73.2 ms: 1.74x faster
csv_reader(10000,10,1000)[nonascii_no_escape]	407 ms	213 ms: 1.91x faster
csv_reader(10000,10,1000)[nonascii_escape]	406 ms	226 ms: 1.80x faster
Geometric mean	(ref)	1.97x faster

The benchmark similar to the above:

import csv
import io
import os
import pyperf
import random

NUM_ROWS = (1_000, 10_000)
NUM_COLS = (5, 10)
FIELD_LENGTH = (300, 1000)

CASES = [
    # (label, field_chars, delimiter, escapechar)
    (
        "nonascii_no_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        None,
    ),
    (
        "nonascii_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        "\\",
    ),
]


def generate_csv_data(rows, cols, field_len, ch, delim):
    # random.choice() so we're not cache-friendly
    field = "".join(random.choices(ch, k=field_len))
    row = delim.join([field] * cols)
    return os.linesep.join([row] * rows) + os.linesep


def benchmark_csv_reader(csv_data, delim, escapechar):
    rdr = csv.reader(
        io.StringIO(csv_data), delimiter=delim, escapechar=escapechar
    )
    for _ in rdr:
        pass


runner = pyperf.Runner()

for rows in NUM_ROWS:
    for cols in NUM_COLS:
        for field_len in FIELD_LENGTH:
            for label, ch, delim, esc in CASES:
                csv_data = generate_csv_data(rows, cols, field_len, ch, delim)
                runner.bench_func(
                    f"csv_reader({rows},{cols},{field_len})[{label}]",
                    benchmark_csv_reader,
                    csv_data,
                    delim,
                    esc,
                )

I updated the benchmark in the description to be more comprehensive.

maurycy added 4 commits August 28, 2025 01:45

chunk

c07cadc

tests

7887891

blurb

1516b75

whatsnew

dabe3c1

bedevere-app bot mentioned this pull request Aug 28, 2025

csv.reader calls the state machine for every character needlessly #138213

Open

maurycy added 2 commits August 28, 2025 02:48

correct gh issue

e01263b

a label can only be part of a statement and a declaration is not a st…

502a5d4

…atement

maurycy marked this pull request as ready for review August 28, 2025 01:27

maurycy requested a review from AA-Turner as a code owner August 28, 2025 01:27

bedevere-app bot added the awaiting review label Aug 28, 2025

serhiy-storchaka reviewed Aug 28, 2025

View reviewed changes

maurycy changed the title ~~gh-138213: Make csv.reader 1.4x faster~~ gh-138213: Make csv.reader 1.4-2x faster Aug 28, 2025

maurycy added 6 commits August 28, 2025 11:45

func

50cce09

macros

41c11df

comment

acbfaa0

reduce the diff

6b90887

do not be clever

f2d56da

keep reducing the diff

b4b8dbc

maurycy changed the title ~~gh-138213: Make csv.reader 1.4-2x faster~~ gh-138213: Make csv.reader 2x faster Aug 28, 2025

maurycy added 2 commits August 28, 2025 13:34

docs

4d903d7

reduce the diff

7ac8785

maurycy changed the title ~~gh-138213: Make csv.reader 2x faster~~ gh-138213: Make csv.reader up to 2x faster Aug 28, 2025

maurycy mentioned this pull request Aug 30, 2025

gh-138270: Use PyUnicodeWriter in csv.writer #138271

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-138213: Make `csv.reader` up to 2x faster #138214

gh-138213: Make `csv.reader` up to 2x faster #138214

Uh oh!

maurycy commented Aug 28, 2025 •

edited

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

maurycy commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

gh-138213: Make csv.reader up to 2x faster #138214

Are you sure you want to change the base?

gh-138213: Make csv.reader up to 2x faster #138214

Uh oh!

Conversation

maurycy commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

maurycy commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gh-138213: Make `csv.reader` up to 2x faster #138214

gh-138213: Make `csv.reader` up to 2x faster #138214

maurycy commented Aug 28, 2025 •

edited

Loading

maurycy commented Aug 28, 2025 •

edited

Loading