Skip to content

Conversation

raisadz
Copy link
Contributor

@raisadz raisadz commented Aug 14, 2025

Closes #2930

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@raisadz raisadz added the pyspark Issue is related to pyspark backend label Aug 14, 2025
@raisadz raisadz marked this pull request as ready for review August 14, 2025 13:37
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - I left a comment regarding the special check for pyarrow. I am afraid that it would not fully achieve the goal.

Maybe what we could do is:

  1. check that separator is not None at the top of the function and raise otherwise
  2. then at each backend level, check if the separator was passed with it's specific backend argument name and if that's the case raise a more informative error specifying to use separator instead of sep|parse_options.
  3. Disclaimer: I am expecting no one to use this feature so far. Yet the only problem with that is that this is actually a regression: parse_options (together with read_options) is the only way for pyarrow to specify arguments in read_csv. Therefore we are enabling to pass separator in a standard way but basically disallowing to pass any other argument.
  4. The long way to do this is something along the following lines I think
elif impl is Implementation.PYARROW:
    if "parse_options" in kwargs:
        passed_options = kwargs.pop("parse_options")
        fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )
        parse_options = csv.ParseOptions(
            delimiter=separator, **{field: getattr(passed_options, field) for field in fields}
        )
    else:
        parse_options = csv.ParseOptions(delimiter=separator)
            
    native_frame = csv.read_csv(
        source, parse_options=parse_options, **kwargs
    )

Comment on lines 607 to 614
if separator is not None and "parse_options" in kwargs:
msg = "Can't pass both `separator` and `parse_options`."
raise TypeError(msg)
from pyarrow import csv # ignore-banned-import

native_frame = csv.read_csv(source, **kwargs)
native_frame = csv.read_csv(
source, parse_options=csv.ParseOptions(delimiter=separator), **kwargs
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit odd:

  1. separator is not typed to be None

  2. Even if that was the case, the following would not error in line 607:

    nw.read_csv(..., separator=None, parse_option=csv.ParseOptions(...), backend=nw.Implementation.PYARROW)

    However, then in line 613, we would call

    csv.read_csv(
            source, parse_options=csv.ParseOptions(delimiter=None), parse_options=parse_options, ...
        )

    which would end up raising an exception at this point

  3. Should we handle the same for other backends? i.e. pandas-like check that sep is not passed, and below for lazy backends

@raisadz
Copy link
Contributor Author

raisadz commented Aug 18, 2025

@FBruzzesi thank you for the review! I agree about the separators validation. I added some support functions that should check the passed kwargs now. In PyArrow, maybe we shouldn't type all the fields:

fields = (
            'quote_char',
            'double_quote',
            'escape_char',
            'newlines_in_values',
            'ignore_empty_lines',
            'invalid_row_handler',
        )

as they might go out of date if PyArrow changes them and we just need to check delimiter? Please, let me know what you think

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raisadz - I am taking a closer look.

I am still not the biggest fan of the hustle for pyarrow users (which again, I don't expect to be many).

One part of me leans towards suggesting to completely remove **kwargs and allow only for explicit parameters we can have full control over. I am honestly not sure

Comment on lines 718 to 719
validate_separator(separator, "delimiter", **kwargs)
validate_separator(separator, "delim", **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow! TIL: duckdb and pyspark have two ways to pass the separator

return kwargs
from pyarrow import csv # ignore-banned-import

return {"parse_options": csv.ParseOptions(delimiter=separator)}
Copy link
Member

@FBruzzesi FBruzzesi Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind I completely misread this

Fake panic review

The issue I have with this is that if any other argument was provided in parse_options then it will be silently ignored.

Say someone is calling the following:

nw.read_csv(file, separator=",", parse_options=csv.ParseOptions(ignore_empty_lines=False)

Then at the end of validate_separator_pyarrow, we will end up with csv.ParseOptions(delimiter=separator, ignore_empty_lines=True) (i.e. the default value), silently.

I agree that hardcoding fields as suggested in #2989 (review) is not ideal, yet pyarrow does not provide much else we can use. We could dynamically lookup its __dir__ or use inspect.get_members, exclude dunder methods, but for example we would end up with validate and equals, which are not attribute to set at instantiation.

Unsuccessful tentatives I tried:

inspect.signature

from inspect import signature
from pyarrow import csv

print(signature(csv.ParseOptions.__init__))

(self, /, *args, **kwargs)

dataclasses.fields

From the stubs I got tricked into thinking that is a dataclass:

from dataclasses import fields
from pyarrow import csv

print(fields(csv.ParseOptions))

TypeError: must be called with a dataclass type or instance

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FBruzzesi!

I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)

A user won't need to pass both separator and delimiter as if "parse_options" not in kwargs: then we
return {"parse_options": csv.ParseOptions(delimiter=separator)}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so someone would only need to pass both separator and delimiter if they were specifying another parse option (like double_quote)

tbh I think this is fine

Copy link
Member

@dangotbanned dangotbanned Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our default is:

separator: str = ","

and matches pyarrow's default:

delimiter: str = ","

Alternative

ParseOptions.delimiter has higher precedence unless separator overrides the default.

In either case - every other argument is respected

Show definitions

from __future__ import annotations

from pyarrow import csv
from typing import Any


def merge_options(separator: str = ",", **kwargs: Any) -> dict[str, Any]:
    DEFAULT = ","  # noqa: N806
    if separator != DEFAULT:
        if opts := kwargs.pop("parse_options", None):
            opts.delimiter = separator
        else:
            opts = csv.ParseOptions(delimiter=separator)
        kwargs["parse_options"] = opts
    return kwargs


def display_merge(result: dict[str, Any]) -> None:
    if result and (options := result.pop("parse_options", None)):
        print(f"{options.delimiter=}\n{options.double_quote=}")
        if result:
            print(f"Remaining: {result!r}")
    elif result:
        print(f"Unrelated: {result!r}")
    else:
        print(f"Empty: {result!r}")

Would this behavior not be more ideal?

# NOTE: `double_quote` default is `True`
user_options = csv.ParseOptions(delimiter="\t", double_quote=False)
>>> display_merge(merge_options(parse_options=user_options))
options.delimiter='\t'
options.double_quote=False
>>> display_merge(merge_options(",", parse_options=user_options))
options.delimiter='\t'
options.double_quote=False
>>> display_merge(merge_options("?", parse_options=user_options))
options.delimiter='?'
options.double_quote=False
>>> display_merge(merge_options())
Empty: {}
>>> display_merge(merge_options("\t"))
options.delimiter='\t'
options.double_quote=True
>>> display_merge(
    merge_options(
        "?",
        parse_options=csv.ParseOptions(double_quote=False),
        read_options=csv.ReadOptions(),
    )
)
options.delimiter='?'
options.double_quote=False
Remaining: {'read_options': <pyarrow._csv.ReadOptions object at 0x000001F29413AD40>}

Although it is cython, the important part is they're all properties with setters
https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/python/pyarrow/_csv.pyx#L435-L543

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseOptions.delimiter has higher precedence unless separator overrides the default.

hmmm yes, that does sound better actually, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marco

with (#2989 (comment)) in mind ...

Not sure if this is on duckdb or sqlframe, but sep has higher precedence than delim

from sqlframe.duckdb import DuckDBSession
import polars as pl
from pathlib import Path


data: Mapping[str, Any] = {"a": [1, 2, 3], "b": [4.5, 6.7, 8.9], "z": ["x", "y", "w"]}
fp = Path.cwd() / "data" / "file.csv"

pl.DataFrame(data).write_csv(fp, separator="\t")

session = DuckDBSession.builder.getOrCreate()
>>> session.read.format("csv").load(str(fp), sep="\t", delim="?").collect()
[Row(a=1, b=4.5, z='x'), Row(a=2, b=6.7, z='y'), Row(a=3, b=8.9, z='w')]

Personally, I think we're best off just defining rule(s) and documenting what we do for each backend if needed.

So instead of

>>> nw.scan_csv("...", backend="sqlframe", separator=",", sep="?", delim="\t", delimiter="!")
TypeError: `separator` and `sep` do not match: `separator`=, and `sep`=?.

We either:

  • pick one and replace it - leaving everything else unchanged
  • say we'll pick ... then ... and then ...

If any backend raises on non-matching arguments - I say let them - as it saves us the hassle πŸ˜…

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Oct 17, 2025

Hi - do you still have time / interest to work on this? If so, I think Dan's suggestion in #2989 (comment) is good, and it may even simplify the implementation
If that's addressed, I think we can get it in?

@dangotbanned
Copy link
Member

The main source of conflicts will be coming from this PR

100% recommend checking that out first, since the tests for the whole file have been rewritten (sorry πŸ˜…)

@MarcoGorelli
Copy link
Member

Looking at this again, I think I'd misunderstood @dangotbanned 's request, I'm not sure I agree with

unless separator overrides the default

This is Dan's suggestion, which I'd find surprising:

  • nw.scan_csv(..., separator=',', parse_options=ParseOptions(delim='\t')) uses '\t' as separator
  • nw.scan_csv(..., separator='?', parse_options=ParseOptions(delim='\t')) uses '?' as separator

This would the "native arguments always take precedence" approach, which would also be surprising:

  • nw.scan_csv(..., separator=',', parse_options=ParseOptions(double_quote=False)) uses ',' as separator
  • nw.scan_csv(..., separator='?', parse_options=ParseOptions(double_quote=False)) uses ',' as separator, because the parse options include delim=',' as the default

What this PR suggests, on the other hand, is:

  • nw.scan_csv(..., separator=',', parse_options=ParseOptions(double_quote=False)) uses ',' as separator
  • nw.scan_csv(..., separator='?', parse_options=ParseOptions(double_quote=False)) raises, and the user is forced to write nw.scan_csv(..., separator='?', parse_options=ParseOptions(double_quote=False, delim='?'))

It's true that the current PR's approach is more verbose, but I think it's also the safest and least surprising

Sorry for the conflicting requests here - I'd suggest that if we resolve the merge conflicts then we can move forwards

@dangotbanned
Copy link
Member

#2989 (comment)

@MarcoGorelli I'm gonna have to push back a little here.

Vs some I gave in (#2989 (comment)), your examples seem to be missing the fact that , is the default both for us and pyarrow (IIRC all the others as well?)

I think a user is more likely to encounter the error by our default getting in the way, when writing something like:

nw.scan_csv(..., parse_options=ParseOptions(another_option=..., delim='\t'))

Rather than somehow specifying this:

nw.scan_csv(..., separator=',', parse_options=ParseOptions(delim='\t'))

I also don't like that (#2989 (comment)) can break currently working code and would set a bad precedent (:wink:) for any time in the future we want to standardise on other **kwargs.
#2930 gave me the impression that this is something you may be interested in elsewhere

@MarcoGorelli
Copy link
Member

your examples seem to be missing the fact that , is the default both for us and pyarrow

Not sure what you mean here, I wrote "because the parse options include delim=',' as the default"

I also don't like that (#2989 (comment)) can break currently working code

True, but you can do a search of its usage on github: https://github.com/search?q=%22nw.scan_csv%22&type=code&p=5. There's few enough cases of nw.scan_csv and nw.read_csv that it's possible to read though all of them, and none of them include ParseOptions

our default getting in the way

I think this is OK. We can raise an informative error message for PyArrow that can make it very clear what the user is expected to do. For now this is the strictest and safest option, and I have general preference for starting strict and potentially relaxing later if necessary

If we eventually had Narwhals equivalents of all of quote_char, double_quote, newline_in_values, ignore_empty_lines, then that would mostly obviate the need for users passing in ParseOptions themselves anyway, so this issue would go away anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pyspark Issue is related to pyspark backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enh: add separator argument to read_csv / scan_csv

4 participants