Skip to content

Conversation

maurycy
Copy link
Contributor

@maurycy maurycy commented Aug 30, 2025

The purpose of this PR is not performance but using the modern https://docs.python.org/dev/c-api/unicode.html#c.PyUnicodeWriter API, similarly to gh-125196.

There's a risk that the code is slower, as it turned out in gh-133968. I'd prefer optimizing it after getting an ack that this is the correct direction.

Similarly to #138214 (comment), I'm not sure what is the best benchmarking strategy, besides a simple snippet. Perhaps we need https://github.com/nineteendo/jsonyx-performance-tests but for CSV.

I believe that csv.reader (ReaderObj) could also use PyUnicodeWriter. If my thinking is sound, if there's any interest and this code is OK, I can handle it.

@maurycy maurycy marked this pull request as ready for review August 30, 2025 21:58
@maurycy
Copy link
Contributor Author

maurycy commented Aug 30, 2025

cc @vstinner

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please:

  • don't add comments for self-explanatory code;
  • follow PEP-7 for C code;
  • revert unrelated changes;
  • provide benchmarks to show whether this speeds things up or not.

c == dialect->escapechar ||
c == dialect->quotechar) {
if (dialect->escapechar == NOT_SET) {
PyErr_SetString(self->error_obj, "need to escape, but no escapechar set");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change this.

Comment on lines +1244 to +1246
bool first_field_was_empty_like = false;
bool first_field_was_none = false;
bool first_field_was_quoted_in_loop = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are those now needed?

@bedevere-app
Copy link

bedevere-app bot commented Aug 30, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@picnixz
Copy link
Member

picnixz commented Aug 30, 2025

And yes, it could be meaningful to use PyUnicodeWriter instead of manual buffer constructions, but we need to check if this really improves things or not before deciding whether we can do it.

@maurycy
Copy link
Contributor Author

maurycy commented Sep 1, 2025

Marking this as a draft.

There's performance regression:

Benchmark bench_csv_writer.main bench_csv_writer.csv-writer-pyunicodewriter
writerows 10 integer rows 10.6 us 12.1 us: 1.14x slower
writerows 10 complex string rows 6.64 us 7.98 us: 1.20x slower
writerows 1000 integer rows 964 us 1.11 ms: 1.15x slower
writerows 1000 complex string rows 575 us 699 us: 1.22x slower
writerows 10000 integer rows 9.76 ms 11.1 ms: 1.14x slower
writerows 10000 complex string rows 5.69 ms 6.95 ms: 1.22x slower
Geometric mean (ref) 1.18x slower
The script:
import csv
import io
import pyperf

runner = pyperf.Runner()

INT_ROW = list(range(10))
COMPLEX_STRING_ROW = ['a,b', 'c"d', 'e\nf'] * 3 + ['ghi']

def write_the_rows(rows):
    f = io.StringIO()
    writer = csv.writer(f)
    writer.writerows(rows)

for num_rows in (10, 1_000, 10_000, ):
    int_rows = [INT_ROW] * num_rows
    complex_rows = [COMPLEX_STRING_ROW] * num_rows

    runner.bench_func(
        f'writerows {num_rows} integer rows',
        write_the_rows,
        int_rows
    )

    runner.bench_func(
        f'writerows {num_rows} complex string rows',
        write_the_rows,
        complex_rows
    )

There are two obvious issues:

  • naive PyUnicodeWriter_WriteChar(), while PyUnicodeWriter_WriteSubstring() is possible sometimes.
  • two pass (present also in the current version, though).

I need some time to address it, perhaps with jumping similar to gh-138214.

@maurycy maurycy marked this pull request as draft September 1, 2025 01:13
/* grow record buffer if necessary */
if (!join_check_rec_size(self, self->rec_len + terminator_len))
return 0;
if (PyUnicodeWriter_WriteChar(writer, c) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to try calling WriteSubstring() at once rather than writing characters one by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants