feat: add `separator` argument in `read_csv`/`scan_csv` #2989

raisadz · 2025-08-14T09:27:21Z

Closes #2930

What type of PR is this? (check all applicable)

Related issues

Related issue enh: add separator argument to read_csv / scan_csv #2930
Closes #<issue number>

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

FBruzzesi

Thanks @raisadz - I left a comment regarding the special check for pyarrow. I am afraid that it would not fully achieve the goal.

Maybe what we could do is:

check that separator is not None at the top of the function and raise otherwise
then at each backend level, check if the separator was passed with it's specific backend argument name and if that's the case raise a more informative error specifying to use separator instead of sep|parse_options.
Disclaimer: I am expecting no one to use this feature so far. Yet the only problem with that is that this is actually a regression: parse_options (together with read_options) is the only way for pyarrow to specify arguments in read_csv. Therefore we are enabling to pass separator in a standard way but basically disallowing to pass any other argument.
The long way to do this is something along the following lines I think

elif impl is Implementation.PYARROW:
    if "parse_options" in kwargs:
        passed_options = kwargs.pop("parse_options")
        fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )
        parse_options = csv.ParseOptions(
            delimiter=separator, **{field: getattr(passed_options, field) for field in fields}
        )
    else:
        parse_options = csv.ParseOptions(delimiter=separator)
            
    native_frame = csv.read_csv(
        source, parse_options=parse_options, **kwargs
    )

narwhals/functions.py

FBruzzesi · 2025-08-16T20:03:26Z

narwhals/functions.py

+        if separator is not None and "parse_options" in kwargs:
+            msg = "Can't pass both `separator` and `parse_options`."
+            raise TypeError(msg)
        from pyarrow import csv  # ignore-banned-import

-        native_frame = csv.read_csv(source, **kwargs)
+        native_frame = csv.read_csv(
+            source, parse_options=csv.ParseOptions(delimiter=separator), **kwargs
+        )


I think this is a bit odd:

separator is not typed to be None

Even if that was the case, the following would not error in line 607:

nw.read_csv(..., separator=None, parse_option=csv.ParseOptions(...), backend=nw.Implementation.PYARROW)

However, then in line 613, we would call

csv.read_csv( source, parse_options=csv.ParseOptions(delimiter=None), parse_options=parse_options, ... )

which would end up raising an exception at this point

Should we handle the same for other backends? i.e. pandas-like check that sep is not passed, and below for lazy backends

Co-authored-by: Francesco Bruzzesi <[email protected]>

raisadz · 2025-08-18T13:07:56Z

@FBruzzesi thank you for the review! I agree about the separators validation. I added some support functions that should check the passed kwargs now. In PyArrow, maybe we shouldn't type all the fields:

fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )

as they might go out of date if PyArrow changes them and we just need to check delimiter? Please, let me know what you think

narwhals/functions.py

FBruzzesi

Thanks @raisadz - I am taking a closer look.

I am still not the biggest fan of the hustle for pyarrow users (which again, I don't expect to be many).

One part of me leans towards suggesting to completely remove **kwargs and allow only for explicit parameters we can have full control over. I am honestly not sure

narwhals/functions.py

FBruzzesi · 2025-08-18T18:38:44Z

narwhals/functions.py

+        validate_separator(separator, "delimiter", **kwargs)
+        validate_separator(separator, "delim", **kwargs)


Oh wow! TIL: duckdb and pyspark have two ways to pass the separator

FBruzzesi · 2025-08-18T19:01:20Z

narwhals/functions.py

+        return kwargs
+    from pyarrow import csv  # ignore-banned-import
+
+    return {"parse_options": csv.ParseOptions(delimiter=separator)}


Nevermind I completely misread this

Fake panic review

The issue I have with this is that if any other argument was provided in parse_options then it will be silently ignored.

Say someone is calling the following:

nw.read_csv(file, separator=",", parse_options=csv.ParseOptions(ignore_empty_lines=False)

Then at the end of validate_separator_pyarrow, we will end up with csv.ParseOptions(delimiter=separator, ignore_empty_lines=True) (i.e. the default value), silently.

I agree that hardcoding fields as suggested in #2989 (review) is not ideal, yet pyarrow does not provide much else we can use. We could dynamically lookup its __dir__ or use inspect.get_members, exclude dunder methods, but for example we would end up with validate and equals, which are not attribute to set at instantiation.

Unsuccessful tentatives I tried:

inspect.signature

from inspect import signature from pyarrow import csv print(signature(csv.ParseOptions.__init__))

(self, /, *args, **kwargs)

dataclasses.fields

From the stubs I got tricked into thinking that is a dataclass:

from dataclasses import fields from pyarrow import csv print(fields(csv.ParseOptions))

TypeError: must be called with a dataclass type or instance

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

Thanks @FBruzzesi!

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

A user won't need to pass both separator and delimiter as if "parse_options" not in kwargs: then we
return {"parse_options": csv.ParseOptions(delimiter=separator)}

right, so someone would only need to pass both separator and delimiter if they were specifying another parse option (like double_quote)

tbh I think this is fine

Since our default is:

separator: str = ","

and matches pyarrow's default:

delimiter: str = ","

Alternative

ParseOptions.delimiter has higher precedence unless separator overrides the default.

In either case - every other argument is respected

Show definitions

from __future__ import annotations from pyarrow import csv from typing import Any def merge_options(separator: str = ",", **kwargs: Any) -> dict[str, Any]: DEFAULT = "," # noqa: N806 if separator != DEFAULT: if opts := kwargs.pop("parse_options", None): opts.delimiter = separator else: opts = csv.ParseOptions(delimiter=separator) kwargs["parse_options"] = opts return kwargs def display_merge(result: dict[str, Any]) -> None: if result and (options := result.pop("parse_options", None)): print(f"{options.delimiter=}\n{options.double_quote=}") if result: print(f"Remaining: {result!r}") elif result: print(f"Unrelated: {result!r}") else: print(f"Empty: {result!r}")

Would this behavior not be more ideal?

# NOTE: `double_quote` default is `True` user_options = csv.ParseOptions(delimiter="\t", double_quote=False)

>>> display_merge(merge_options(parse_options=user_options)) options.delimiter='\t' options.double_quote=False

>>> display_merge(merge_options(",", parse_options=user_options)) options.delimiter='\t' options.double_quote=False

>>> display_merge(merge_options("?", parse_options=user_options)) options.delimiter='?' options.double_quote=False

>>> display_merge(merge_options()) Empty: {}

>>> display_merge(merge_options("\t")) options.delimiter='\t' options.double_quote=True

>>> display_merge( merge_options( "?", parse_options=csv.ParseOptions(double_quote=False), read_options=csv.ReadOptions(), ) ) options.delimiter='?' options.double_quote=False Remaining: {'read_options': <pyarrow._csv.ReadOptions object at 0x000001F29413AD40>}

Although it is cython, the important part is they're all properties with setters
https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/python/pyarrow/_csv.pyx#L435-L543

ParseOptions.delimiter has higher precedence unless separator overrides the default.

hmmm yes, that does sound better actually, thanks!

Thanks Marco

with (#2989 (comment)) in mind ...

Not sure if this is on duckdb or sqlframe, but sep has higher precedence than delim

from sqlframe.duckdb import DuckDBSession import polars as pl from pathlib import Path data: Mapping[str, Any] = {"a": [1, 2, 3], "b": [4.5, 6.7, 8.9], "z": ["x", "y", "w"]} fp = Path.cwd() / "data" / "file.csv" pl.DataFrame(data).write_csv(fp, separator="\t") session = DuckDBSession.builder.getOrCreate() >>> session.read.format("csv").load(str(fp), sep="\t", delim="?").collect() [Row(a=1, b=4.5, z='x'), Row(a=2, b=6.7, z='y'), Row(a=3, b=8.9, z='w')]

Personally, I think we're best off just defining rule(s) and documenting what we do for each backend if needed.

So instead of

>>> nw.scan_csv("...", backend="sqlframe", separator=",", sep="?", delim="\t", delimiter="!") TypeError: `separator` and `sep` do not match: `separator`=, and `sep`=?.

We either:

pick one and replace it - leaving everything else unchanged

say we'll pick ... then ... and then ...

If any backend raises on non-matching arguments - I say let them - as it saves us the hassle 😅

Co-authored-by: Francesco Bruzzesi <[email protected]>

…t/add-separator-arg

narwhals/functions.py

MarcoGorelli · 2025-10-17T10:35:17Z

Hi - do you still have time / interest to work on this? If so, I think Dan's suggestion in #2989 (comment) is good, and it may even simplify the implementation
If that's addressed, I think we can get it in?

dangotbanned · 2025-10-18T11:36:34Z

The main source of conflicts will be coming from this PR

test: Simplify read_scan_test, spark session #3024

100% recommend checking that out first, since the tests for the whole file have been rewritten (sorry 😅)

MarcoGorelli · 2025-10-20T08:34:43Z

Looking at this again, I think I'd misunderstood @dangotbanned 's request, I'm not sure I agree with

unless separator overrides the default

This is Dan's suggestion, which I'd find surprising:

nw.scan_csv(..., separator=',', parse_options=ParseOptions(delim='\t')) uses '\t' as separator
nw.scan_csv(..., separator='?', parse_options=ParseOptions(delim='\t')) uses '?' as separator

This would the "native arguments always take precedence" approach, which would also be surprising:

nw.scan_csv(..., separator=',', parse_options=ParseOptions(double_quote=False)) uses ',' as separator
nw.scan_csv(..., separator='?', parse_options=ParseOptions(double_quote=False)) uses ',' as separator, because the parse options include delim=',' as the default

What this PR suggests, on the other hand, is:

nw.scan_csv(..., separator=',', parse_options=ParseOptions(double_quote=False)) uses ',' as separator
nw.scan_csv(..., separator='?', parse_options=ParseOptions(double_quote=False)) raises, and the user is forced to write nw.scan_csv(..., separator='?', parse_options=ParseOptions(double_quote=False, delim='?'))

It's true that the current PR's approach is more verbose, but I think it's also the safest and least surprising

Sorry for the conflicting requests here - I'd suggest that if we resolve the merge conflicts then we can move forwards

dangotbanned · 2025-10-20T09:00:07Z

#2989 (comment)

@MarcoGorelli I'm gonna have to push back a little here.

Vs some I gave in (#2989 (comment)), your examples seem to be missing the fact that , is the default both for us and pyarrow (IIRC all the others as well?)

I think a user is more likely to encounter the error by our default getting in the way, when writing something like:

nw.scan_csv(..., parse_options=ParseOptions(another_option=..., delim='\t'))

Rather than somehow specifying this:

nw.scan_csv(..., separator=',', parse_options=ParseOptions(delim='\t'))

I also don't like that (#2989 (comment)) can break currently working code and would set a bad precedent (:wink:) for any time in the future we want to standardise on other **kwargs.
#2930 gave me the impression that this is something you may be interested in elsewhere

MarcoGorelli · 2025-10-20T09:24:19Z

your examples seem to be missing the fact that , is the default both for us and pyarrow

Not sure what you mean here, I wrote "because the parse options include delim=',' as the default"

I also don't like that (#2989 (comment)) can break currently working code

True, but you can do a search of its usage on github: https://github.com/search?q=%22nw.scan_csv%22&type=code&p=5. There's few enough cases of nw.scan_csv and nw.read_csv that it's possible to read though all of them, and none of them include ParseOptions

our default getting in the way

I think this is OK. We can raise an informative error message for PyArrow that can make it very clear what the user is expected to do. For now this is the strictest and safest option, and I have general preference for starting strict and potentially relaxing later if necessary

If we eventually had Narwhals equivalents of all of quote_char, double_quote, newline_in_values, ignore_empty_lines, then that would mostly obviate the need for users passing in ParseOptions themselves anyway, so this issue would go away anyway

raisadz added 6 commits August 14, 2025 10:00

feat: add separator argument to read_csv / scan_csv

409dd4b

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

8143ae3

add stable api

9d6e850

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

9000f88

add coverage

b99dfcd

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

6b90890

raisadz added the pyspark Issue is related to pyspark backend label Aug 14, 2025

add session for sqlframe for coverage

c4ff1c6

raisadz marked this pull request as ready for review August 14, 2025 13:37

FBruzzesi reviewed Aug 16, 2025

View reviewed changes

raisadz and others added 6 commits August 17, 2025 17:24

Update narwhals/functions.py

00f0bc2

Co-authored-by: Francesco Bruzzesi <[email protected]>

add separator validation

af21d2f

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

59a5b6b

fix merge

d0c7283

modify kwargs for pyarrow

ff68327

restore header that was there before

b7cb02c

dangotbanned reviewed Aug 18, 2025

View reviewed changes

narwhals/functions.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Aug 18, 2025

View reviewed changes

raisadz and others added 3 commits August 19, 2025 14:16

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

7cfae8f

Update narwhals/functions.py

126c5c4

Co-authored-by: Francesco Bruzzesi <[email protected]>

Merge remote-tracking branch 'origin/feat/add-separator-arg' into fea…

cf7c67d

…t/add-separator-arg

dangotbanned mentioned this pull request Aug 19, 2025

Establish safe patterns using Implementation.UNKNOWN #2786

Open

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

8ace0f9

MarcoGorelli reviewed Aug 23, 2025

View reviewed changes

narwhals/functions.py Outdated Show resolved Hide resolved

raisadz added 2 commits August 23, 2025 09:38

make validate support functions private

512c529

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

ec12904

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

bf4c269

raisadz added 3 commits October 20, 2025 10:29

readd tests

003d3e7

Merge remote-tracking branch 'upstream/main' into feat/add-separator-arg

4fd93fc

add pyarrow parse_options for coverage

7629ce7

		validate_separator(separator, "delimiter", **kwargs)
		validate_separator(separator, "delim", **kwargs)

feat: add separator argument in read_csv/scan_csv #2989

Are you sure you want to change the base?

feat: add separator argument in read_csv/scan_csv #2989

Conversation

raisadz commented Aug 14, 2025 • edited by dangotbanned Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

FBruzzesi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raisadz commented Aug 18, 2025

Uh oh!

Uh oh!

FBruzzesi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FBruzzesi Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Alternative

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MarcoGorelli commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned commented Oct 18, 2025

Uh oh!

MarcoGorelli commented Oct 20, 2025

Uh oh!

dangotbanned commented Oct 20, 2025

Uh oh!

MarcoGorelli commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add `separator` argument in `read_csv`/`scan_csv` #2989

feat: add `separator` argument in `read_csv`/`scan_csv` #2989

raisadz commented Aug 14, 2025 •

edited by dangotbanned

Loading

FBruzzesi left a comment •

edited

Loading

FBruzzesi Aug 18, 2025 •

edited

Loading

dangotbanned Aug 25, 2025 •

edited

Loading

MarcoGorelli commented Oct 17, 2025 •

edited

Loading