Skip to content

Conversation

maurycy
Copy link
Contributor

@maurycy maurycy commented Aug 28, 2025

The basic observation is that there's no need to process character by character, and call the state machine while we're in a field (IN_FIELD, IN_QUOTED_FIELD).

Most characters are ordinary (ie: they're not delimiters, escapes, quotes etc.), so we can find the next interesting character and copy the whole slice in between.

This is my very first C change in cpython so I'm more than happy to pair with someone.

Benchmark

There's no pyperformance benchmark for csv.reader.

The script:
import csv
import io
import os
import pyperf
import random

NUM_ROWS = (1_000, 10_000)
NUM_COLS = (5, 10)
FIELD_LENGTH = (300, 1000)

CASES = [
    # (label, field_chars, delimiter, escapechar)
    (
        "ascii",
        "a",
        None,
        None,
    ),
    (
        "nonascii_no_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        None,
    ),
    (
        "nonascii_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        "\\",
    ),
]


def generate_csv_data(rows, cols, field_len, ch, delim):
    # random.choice() so we're not that cache-friendly
    field = "".join(random.choices(ch, k=field_len))
    actual_delim = delim if delim is not None else ","
    row = actual_delim.join([field] * cols)
    return os.linesep.join([row] * rows) + os.linesep


def benchmark_csv_reader(csv_data, delim, escapechar):
    kwargs = {"delimiter": delim, "escapechar": escapechar}
    active_kwargs = {
        key: value for key, value in kwargs.items() if value is not None
    }
    rdr = csv.reader(io.StringIO(csv_data), **active_kwargs)
    for _ in rdr:
        pass


runner = pyperf.Runner()

for rows in NUM_ROWS:
    for cols in NUM_COLS:
        for field_len in FIELD_LENGTH:
            for label, ch, delim, esc in CASES:
                csv_data = generate_csv_data(rows, cols, field_len, ch, delim)
                runner.bench_func(
                    f"csv_reader({rows},{cols},{field_len})[{label}]",
                    benchmark_csv_reader,
                    csv_data,
                    delim,
                    esc,
                )

The results:

Benchmark bench_csv_reader.main bench_csv_reader.csv-read-chunks
csv_reader(1000,5,300)[ascii] 5.45 ms 2.05 ms: 2.66x faster
csv_reader(1000,5,300)[nonascii_no_escape] 5.68 ms 2.67 ms: 2.13x faster
csv_reader(1000,5,300)[nonascii_escape] 5.55 ms 2.60 ms: 2.13x faster
csv_reader(1000,5,1000)[ascii] 17.0 ms 6.65 ms: 2.55x faster
csv_reader(1000,5,1000)[nonascii_no_escape] 18.2 ms 7.97 ms: 2.29x faster
csv_reader(1000,5,1000)[nonascii_escape] 17.8 ms 7.80 ms: 2.29x faster
csv_reader(1000,10,300)[ascii] 10.4 ms 3.89 ms: 2.66x faster
csv_reader(1000,10,300)[nonascii_no_escape] 10.9 ms 5.08 ms: 2.15x faster
csv_reader(1000,10,300)[nonascii_escape] 11.2 ms 5.81 ms: 1.94x faster
csv_reader(1000,10,1000)[ascii] 38.2 ms 17.5 ms: 2.18x faster
csv_reader(1000,10,1000)[nonascii_no_escape] 40.8 ms 21.0 ms: 1.94x faster
csv_reader(1000,10,1000)[nonascii_escape] 40.6 ms 22.1 ms: 1.84x faster
csv_reader(10000,5,300)[ascii] 60.9 ms 28.1 ms: 2.17x faster
csv_reader(10000,5,300)[nonascii_no_escape] 64.8 ms 33.7 ms: 1.93x faster
csv_reader(10000,5,300)[nonascii_escape] 64.7 ms 34.6 ms: 1.87x faster
csv_reader(10000,5,1000)[ascii] 193 ms 90.7 ms: 2.13x faster
csv_reader(10000,5,1000)[nonascii_no_escape] 206 ms 102 ms: 2.01x faster
csv_reader(10000,5,1000)[nonascii_escape] 207 ms 106 ms: 1.96x faster
csv_reader(10000,10,300)[ascii] 119 ms 54.1 ms: 2.20x faster
csv_reader(10000,10,300)[nonascii_no_escape] 129 ms 69.5 ms: 1.86x faster
csv_reader(10000,10,300)[nonascii_escape] 128 ms 72.4 ms: 1.76x faster
csv_reader(10000,10,1000)[ascii] 383 ms 177 ms: 2.17x faster
csv_reader(10000,10,1000)[nonascii_no_escape] 408 ms 215 ms: 1.90x faster
csv_reader(10000,10,1000)[nonascii_escape] 409 ms 222 ms: 1.84x faster
Geometric mean (ref) 2.09x faster

I observe similar results with real CSV files.

The environment:

% ./python -c "import sysconfig; print(sysconfig.get_config_var('CONFIG_ARGS'))"
'--enable-lto' '--with-optimizations'

sudo ./python -m pyperf system tune ensured.


📚 Documentation preview 📚: https://cpython-previews--138214.org.readthedocs.build/

@maurycy maurycy marked this pull request as ready for review August 28, 2025 01:27
@maurycy maurycy requested a review from AA-Turner as a code owner August 28, 2025 01:27
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider not only the gain, but also the cost/benefit ratio. This change significantly complicates an already complex code.

The benefit is not unconditional either. In the IN_FIELD state, PyUnicode_FindChar() is called 4 times. This can actually slow down the code for long non-ASCII lines.

@maurycy
Copy link
Contributor Author

maurycy commented Aug 28, 2025

@serhiy-storchaka

Thank you for review!

We need to consider not only the gain, but also the cost/benefit ratio. This change significantly complicates an already complex code.

I agree that it increases complexity.

I added an explanation before the switch block, and two macros:

https://github.com/python/cpython/pull/138214/files#diff-38fcce6bb475616052f5c9a0973eefd49489a4dff719f30e407534258e2a3ec3R1030-R1080

There are only two building blocks: jump with PyUnicode_FindChar, or parse each with PyUnicode_READ_CHAR. This might have been obscured by duplication before.

My thinking here goes that csv.reader is a global hot path, one of the most common usages of Python. Tuning this code has some real positive impact, and it's an explainable algorithmic improvement (without relying on low-level SIMD magic.)

I believe that conceptually it's simple: process the whole field at once.

The benefit is not unconditional either. In the IN_FIELD state, PyUnicode_FindChar() is called 4 times. This can actually slow down the code for long non-ASCII lines.

The ideal would be PyUnicode_strcspn(). :-)

I fully agree there are scenarios where it's worsening performance. My hunch is that CSV files without any fields (eg: the parser is spending significantly more time in states other than IN_FIELD and IN_QUOTED_FIELD) it'd worsen. How important are they?

What's the best way of sharing here? I already feel that all these compare_to --table --table-format md are overwhelming.

I'm not sure what is the best benchmarking strategy here. Unfortunately, all the combinations from the benchmark with --rigorous already take around ~30 minutes (despite NVMe, i9-12900K and 128G memory), so I cannot range randomly. I will think how to approach this.

I tried measuring a long non-ASCII line and still observed a significant benefit:

Benchmark bench_csv_reader.main bench_csv_reader.csv-read-chunks
csv_reader(1000,5,300)[nonascii_no_escape] 5.74 ms 2.72 ms: 2.11x faster
csv_reader(1000,5,300)[nonascii_escape] 5.50 ms 2.60 ms: 2.12x faster
csv_reader(1000,5,1000)[nonascii_no_escape] 18.1 ms 7.90 ms: 2.30x faster
csv_reader(1000,5,1000)[nonascii_escape] 17.8 ms 7.75 ms: 2.29x faster
csv_reader(1000,10,300)[nonascii_no_escape] 11.3 ms 5.38 ms: 2.11x faster
csv_reader(1000,10,300)[nonascii_escape] 10.9 ms 5.45 ms: 2.00x faster
csv_reader(1000,10,1000)[nonascii_no_escape] 40.8 ms 21.1 ms: 1.93x faster
csv_reader(1000,10,1000)[nonascii_escape] 40.8 ms 22.3 ms: 1.83x faster
csv_reader(10000,5,300)[nonascii_no_escape] 65.0 ms 33.7 ms: 1.93x faster
csv_reader(10000,5,300)[nonascii_escape] 64.8 ms 35.5 ms: 1.83x faster
csv_reader(10000,5,1000)[nonascii_no_escape] 208 ms 103 ms: 2.02x faster
csv_reader(10000,5,1000)[nonascii_escape] 208 ms 109 ms: 1.90x faster
csv_reader(10000,10,300)[nonascii_no_escape] 128 ms 70.6 ms: 1.81x faster
csv_reader(10000,10,300)[nonascii_escape] 128 ms 73.2 ms: 1.74x faster
csv_reader(10000,10,1000)[nonascii_no_escape] 407 ms 213 ms: 1.91x faster
csv_reader(10000,10,1000)[nonascii_escape] 406 ms 226 ms: 1.80x faster
Geometric mean (ref) 1.97x faster
The benchmark similar to the above:
import csv
import io
import os
import pyperf
import random

NUM_ROWS = (1_000, 10_000)
NUM_COLS = (5, 10)
FIELD_LENGTH = (300, 1000)

CASES = [
    # (label, field_chars, delimiter, escapechar)
    (
        "nonascii_no_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        None,
    ),
    (
        "nonascii_escape",
        "ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικ",
        "λ",
        "\\",
    ),
]


def generate_csv_data(rows, cols, field_len, ch, delim):
    # random.choice() so we're not cache-friendly
    field = "".join(random.choices(ch, k=field_len))
    row = delim.join([field] * cols)
    return os.linesep.join([row] * rows) + os.linesep


def benchmark_csv_reader(csv_data, delim, escapechar):
    rdr = csv.reader(
        io.StringIO(csv_data), delimiter=delim, escapechar=escapechar
    )
    for _ in rdr:
        pass


runner = pyperf.Runner()

for rows in NUM_ROWS:
    for cols in NUM_COLS:
        for field_len in FIELD_LENGTH:
            for label, ch, delim, esc in CASES:
                csv_data = generate_csv_data(rows, cols, field_len, ch, delim)
                runner.bench_func(
                    f"csv_reader({rows},{cols},{field_len})[{label}]",
                    benchmark_csv_reader,
                    csv_data,
                    delim,
                    esc,
                )

I updated the benchmark in the description to be more comprehensive.

@maurycy maurycy changed the title gh-138213: Make csv.reader 1.4x faster gh-138213: Make csv.reader 1.4-2x faster Aug 28, 2025
@maurycy maurycy changed the title gh-138213: Make csv.reader 1.4-2x faster gh-138213: Make csv.reader 2x faster Aug 28, 2025
@maurycy maurycy changed the title gh-138213: Make csv.reader 2x faster gh-138213: Make csv.reader up to 2x faster Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants