feat: add separator argument in read_csv/scan_csv#2989
feat: add separator argument in read_csv/scan_csv#2989MarcoGorelli merged 27 commits intonarwhals-dev:mainfrom
separator argument in read_csv/scan_csv#2989Conversation
FBruzzesi
left a comment
There was a problem hiding this comment.
Thanks @raisadz - I left a comment regarding the special check for pyarrow. I am afraid that it would not fully achieve the goal.
Maybe what we could do is:
- check that separator is not None at the top of the function and raise otherwise
- then at each backend level, check if the separator was passed with it's specific backend argument name and if that's the case raise a more informative error specifying to use
separatorinstead ofsep|parse_options. - Disclaimer: I am expecting no one to use this feature so far. Yet the only problem with that is that this is actually a regression:
parse_options(together withread_options) is the only way for pyarrow to specify arguments inread_csv. Therefore we are enabling to pass separator in a standard way but basically disallowing to pass any other argument. - The long way to do this is something along the following lines I think
elif impl is Implementation.PYARROW:
if "parse_options" in kwargs:
passed_options = kwargs.pop("parse_options")
fields = (
'quote_char',
'double_quote',
'escape_char',
'newlines_in_values',
'ignore_empty_lines',
'invalid_row_handler',
)
parse_options = csv.ParseOptions(
delimiter=separator, **{field: getattr(passed_options, field) for field in fields}
)
else:
parse_options = csv.ParseOptions(delimiter=separator)
native_frame = csv.read_csv(
source, parse_options=parse_options, **kwargs
)
narwhals/functions.py
Outdated
| if separator is not None and "parse_options" in kwargs: | ||
| msg = "Can't pass both `separator` and `parse_options`." | ||
| raise TypeError(msg) | ||
| from pyarrow import csv # ignore-banned-import | ||
|
|
||
| native_frame = csv.read_csv(source, **kwargs) | ||
| native_frame = csv.read_csv( | ||
| source, parse_options=csv.ParseOptions(delimiter=separator), **kwargs | ||
| ) |
There was a problem hiding this comment.
I think this is a bit odd:
-
separatoris not typed to beNone -
Even if that was the case, the following would not error in line 607:
nw.read_csv(..., separator=None, parse_option=csv.ParseOptions(...), backend=nw.Implementation.PYARROW)
However, then in line 613, we would call
csv.read_csv( source, parse_options=csv.ParseOptions(delimiter=None), parse_options=parse_options, ... )
which would end up raising an exception at this point
-
Should we handle the same for other backends? i.e. pandas-like check that
sepis not passed, and below for lazy backends
Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
|
@FBruzzesi thank you for the review! I agree about the separators validation. I added some support functions that should check the passed as they might go out of date if PyArrow changes them and we just need to check |
There was a problem hiding this comment.
Thanks @raisadz - I am taking a closer look.
I am still not the biggest fan of the hustle for pyarrow users (which again, I don't expect to be many).
One part of me leans towards suggesting to completely remove **kwargs and allow only for explicit parameters we can have full control over. I am honestly not sure
narwhals/functions.py
Outdated
| validate_separator(separator, "delimiter", **kwargs) | ||
| validate_separator(separator, "delim", **kwargs) |
There was a problem hiding this comment.
Oh wow! TIL: duckdb and pyspark have two ways to pass the separator
| return kwargs | ||
| from pyarrow import csv # ignore-banned-import | ||
|
|
||
| return {"parse_options": csv.ParseOptions(delimiter=separator)} |
There was a problem hiding this comment.
Nevermind I completely misread this
Fake panic review
The issue I have with this is that if any other argument was provided in parse_options then it will be silently ignored.
Say someone is calling the following:
nw.read_csv(file, separator=",", parse_options=csv.ParseOptions(ignore_empty_lines=False)Then at the end of validate_separator_pyarrow, we will end up with csv.ParseOptions(delimiter=separator, ignore_empty_lines=True) (i.e. the default value), silently.
I agree that hardcoding fields as suggested in #2989 (review) is not ideal, yet pyarrow does not provide much else we can use. We could dynamically lookup its __dir__ or use inspect.get_members, exclude dunder methods, but for example we would end up with validate and equals, which are not attribute to set at instantiation.
Unsuccessful tentatives I tried:
inspect.signature
from inspect import signature
from pyarrow import csv
print(signature(csv.ParseOptions.__init__))(self, /, *args, **kwargs)
dataclasses.fields
From the stubs I got tricked into thinking that is a dataclass:
from dataclasses import fields
from pyarrow import csv
print(fields(csv.ParseOptions))TypeError: must be called with a dataclass type or instance
I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)
There was a problem hiding this comment.
Thanks @FBruzzesi!
I have mixed feelings as now a pyarrow user should pass both separator=xyz, parse_options=csv.ParseOptions(delimiter=xyz)
A user won't need to pass both separator and delimiter as if "parse_options" not in kwargs: then we
return {"parse_options": csv.ParseOptions(delimiter=separator)}
There was a problem hiding this comment.
right, so someone would only need to pass both separator and delimiter if they were specifying another parse option (like double_quote)
tbh I think this is fine
There was a problem hiding this comment.
Since our default is:
separator: str = ","and matches pyarrow's default:
delimiter: str = ","Alternative
ParseOptions.delimiter has higher precedence unless separator overrides the default.
In either case - every other argument is respected
Show definitions
from __future__ import annotations
from pyarrow import csv
from typing import Any
def merge_options(separator: str = ",", **kwargs: Any) -> dict[str, Any]:
DEFAULT = "," # noqa: N806
if separator != DEFAULT:
if opts := kwargs.pop("parse_options", None):
opts.delimiter = separator
else:
opts = csv.ParseOptions(delimiter=separator)
kwargs["parse_options"] = opts
return kwargs
def display_merge(result: dict[str, Any]) -> None:
if result and (options := result.pop("parse_options", None)):
print(f"{options.delimiter=}\n{options.double_quote=}")
if result:
print(f"Remaining: {result!r}")
elif result:
print(f"Unrelated: {result!r}")
else:
print(f"Empty: {result!r}")Would this behavior not be more ideal?
# NOTE: `double_quote` default is `True`
user_options = csv.ParseOptions(delimiter="\t", double_quote=False)>>> display_merge(merge_options(parse_options=user_options))
options.delimiter='\t'
options.double_quote=False>>> display_merge(merge_options(",", parse_options=user_options))
options.delimiter='\t'
options.double_quote=False>>> display_merge(merge_options("?", parse_options=user_options))
options.delimiter='?'
options.double_quote=False>>> display_merge(merge_options())
Empty: {}>>> display_merge(merge_options("\t"))
options.delimiter='\t'
options.double_quote=True>>> display_merge(
merge_options(
"?",
parse_options=csv.ParseOptions(double_quote=False),
read_options=csv.ReadOptions(),
)
)
options.delimiter='?'
options.double_quote=False
Remaining: {'read_options': <pyarrow._csv.ReadOptions object at 0x000001F29413AD40>}Although it is cython, the important part is they're all properties with setters
https://github.com/apache/arrow/blob/f8b20f131a072ef423e81b8a676f42a82255f4ec/python/pyarrow/_csv.pyx#L435-L543
There was a problem hiding this comment.
ParseOptions.delimiter has higher precedence unless separator overrides the default.
hmmm yes, that does sound better actually, thanks!
There was a problem hiding this comment.
Thanks Marco
with (#2989 (comment)) in mind ...
Not sure if this is on duckdb or sqlframe, but sep has higher precedence than delim
from sqlframe.duckdb import DuckDBSession
import polars as pl
from pathlib import Path
data: Mapping[str, Any] = {"a": [1, 2, 3], "b": [4.5, 6.7, 8.9], "z": ["x", "y", "w"]}
fp = Path.cwd() / "data" / "file.csv"
pl.DataFrame(data).write_csv(fp, separator="\t")
session = DuckDBSession.builder.getOrCreate()
>>> session.read.format("csv").load(str(fp), sep="\t", delim="?").collect()
[Row(a=1, b=4.5, z='x'), Row(a=2, b=6.7, z='y'), Row(a=3, b=8.9, z='w')]Personally, I think we're best off just defining rule(s) and documenting what we do for each backend if needed.
So instead of
>>> nw.scan_csv("...", backend="sqlframe", separator=",", sep="?", delim="\t", delimiter="!")
TypeError: `separator` and `sep` do not match: `separator`=, and `sep`=?.We either:
- pick one and replace it - leaving everything else unchanged
- say we'll pick ... then ... and then ...
If any backend raises on non-matching arguments - I say let them - as it saves us the hassle 😅
Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com>
…t/add-separator-arg
|
Hi - do you still have time / interest to work on this? If so, I think Dan's suggestion in #2989 (comment) is good, and it may even simplify the implementation |
|
The main source of conflicts will be coming from this PR 100% recommend checking that out first, since the tests for the whole file have been rewritten (sorry 😅) |
|
Looking at this again, I think I'd misunderstood @dangotbanned 's request, I'm not sure I agree with
This is Dan's suggestion, which I'd find surprising:
This would the "native arguments always take precedence" approach, which would also be surprising:
What this PR suggests, on the other hand, is:
It's true that the current PR's approach is more verbose, but I think it's also the safest and least surprising Sorry for the conflicting requests here - I'd suggest that if we resolve the merge conflicts then we can move forwards |
|
@MarcoGorelli I'm gonna have to push back a little here. Vs some I gave in (#2989 (comment)), your examples seem to be missing the fact that I think a user is more likely to encounter the error by our default getting in the way, when writing something like: nw.scan_csv(..., parse_options=ParseOptions(another_option=..., delim='\t'))Rather than somehow specifying this: nw.scan_csv(..., separator=',', parse_options=ParseOptions(delim='\t'))I also don't like that (#2989 (comment)) can break currently working code and would set a bad precedent (:wink:) for any time in the future we want to standardise on other |
Not sure what you mean here, I wrote "because the parse options include delim=',' as the default"
True, but you can do a search of its usage on github: https://github.com/search?q=%22nw.scan_csv%22&type=code&p=5. There's few enough cases of
I think this is OK. We can raise an informative error message for PyArrow that can make it very clear what the user is expected to do. For now this is the strictest and safest option, and I have general preference for starting strict and potentially relaxing later if necessary If we eventually had Narwhals equivalents of all of |
|
thinking about this again, i'm not wild on precedence being value-dependent (i.e. based on whether some default is matched or not) i'm fairly confident that this won't break anyone's code, and that it offers the least surprising behaviour |
narwhals/functions.py
Outdated
| _validate_separator(separator, "delimiter", **kwargs) | ||
| _validate_separator(separator, "delim", **kwargs) |
There was a problem hiding this comment.
in light on the discussion, we also need to check against 'sep' here
maybe _validate_separator should take a tuple of native delimiters?
| getattr(nw, scan_method)("unused.csv", backend=backend) | ||
|
|
||
|
|
||
| def test_read_csv_raise_sep_multiple_lazy(csv_path: FileSource) -> None: |
There was a problem hiding this comment.
to cut down on test times, shall we just use a single csv_path here rather than looping over csv_path? csv_path is already checked in other tests anyway
|
Thanks @MarcoGorelli ! I addressed your comments |
MarcoGorelli
left a comment
There was a problem hiding this comment.
thanks @raisadz , and all for comments!
|
thanks all for comments, much appreciated. at this point i'm still inclined enough to go along with it, i think the risk is minimal and that this is the easiest-to-teach behaviour |

Closes #2930
What type of PR is this? (check all applicable)
Related issues
separatorargument toread_csv/scan_csv#2930Checklist
If you have comments or can explain your changes, please do so below