-
Notifications
You must be signed in to change notification settings - Fork 170
fix: Always accept {Expr,Series}.is_in(other: Iterable)
#3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 7 commits
234a091
26ae2eb
ce8e591
98cc8bd
2cd832c
38d6d5d
633a06d
0c66432
a6921ea
5409c21
b08ffe0
ffc848c
e318bdf
e9eb08e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -9,8 +9,14 @@ | |||
| ExprMetadata, | ||||
| apply_n_ary_operation, | ||||
| combine_metadata, | ||||
| is_series, | ||||
| ) | ||||
| from narwhals._utils import ( | ||||
| _validate_rolling_arguments, | ||||
| ensure_type, | ||||
| flatten, | ||||
| iterable_to_sequence, | ||||
| ) | ||||
| from narwhals._utils import _validate_rolling_arguments, ensure_type, flatten | ||||
| from narwhals.dtypes import _validate_dtype | ||||
| from narwhals.exceptions import ComputeError, InvalidOperationError | ||||
| from narwhals.expr_cat import ExprCatNamespace | ||||
|
|
@@ -19,7 +25,6 @@ | |||
| from narwhals.expr_name import ExprNameNamespace | ||||
| from narwhals.expr_str import ExprStringNamespace | ||||
| from narwhals.expr_struct import ExprStructNamespace | ||||
| from narwhals.translate import to_native | ||||
|
|
||||
| if TYPE_CHECKING: | ||||
| from typing import NoReturn, TypeVar | ||||
|
|
@@ -991,10 +996,9 @@ def is_in(self, other: Any) -> Self: | |||
| ββββββββββββββββββββ | ||||
| """ | ||||
| if isinstance(other, Iterable) and not isinstance(other, (str, bytes)): | ||||
| other = other.to_native() if is_series(other) else iterable_to_sequence(other) | ||||
| return self._with_elementwise( | ||||
| lambda plx: self._to_compliant_expr(plx).is_in( | ||||
| to_native(other, pass_through=True) | ||||
| ) | ||||
| lambda plx: self._to_compliant_expr(plx).is_in(other) | ||||
|
||||
| def is_in(self, other: Any) -> Self: ... |
I don't understand why we were allowing any kind of Native* to be passed to every backend? π€
Gonna add another test for at least the more common case of a Series from the wrong backend
Updated
test: Add test_expr_is_in_series_wrong_backend
I'm not sure of a safe way to keep the same behavior (if it is desired).
Since we don't know the backend at this stage, the options I see are:
- Unconditionally convert to
list | tuple - Raise elsewhere when a
Seriesis passed to a lazy backend - Disallow
Expr.is_in(other: Series) - Do nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli @FBruzzesi
do you guys have any preference on a path forward here?
This case is a bit different to some of the other places we'd reject nw.Series for lazy backends - since the length isn't an issue.
But the safer option of converting all nw.Series is gonna be less performant than the currently unsafe version (which only works on a matching eager implementation)
Everything seems like a tradeoff to me π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved out of draft for visibility (#3207 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dangotbanned I will try to take a look during the weekend, but I am not sure I can manage.
My thoughts from the context in the messages here (I didn't open the code changes just yet):
- Realistically I don't think it's too common to pass one series from a different backend but we should not exclude such possibility
- My opinion would be to:
- Use the native series if
isinstance(other, Series) and expr._implementation == other._implementation) - Otherwise convert it to a list (we can do that with native methods thankfully), but warn that such conversion is happening with a
UserWarning. This case includes a series passed to a lazy backend
- Use the native series if
Update: Regarding native series, then I would prefer to raise an exception in such case
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,18 +2,30 @@ | |
|
|
||
| import os | ||
| import uuid | ||
| from collections import deque | ||
| from copy import deepcopy | ||
| from functools import lru_cache | ||
| from importlib.util import find_spec | ||
| from typing import TYPE_CHECKING, Any, Callable, cast | ||
|
|
||
| import pytest | ||
|
|
||
| from narwhals._utils import Implementation, generate_temporary_column_name | ||
| from narwhals._utils import ( | ||
| Implementation, | ||
| generate_temporary_column_name, | ||
| qualified_type_name, | ||
| ) | ||
| from tests.utils import ID_PANDAS_LIKE, PANDAS_VERSION, pyspark_session, sqlframe_session | ||
|
|
||
| if TYPE_CHECKING: | ||
| from collections.abc import Sequence | ||
| from collections.abc import ( | ||
| Generator, | ||
| Iterable, | ||
| Iterator, | ||
| KeysView, | ||
| Sequence, | ||
| ValuesView, | ||
| ) | ||
|
|
||
| import duckdb | ||
| import ibis | ||
|
|
@@ -27,7 +39,7 @@ | |
| from narwhals._spark_like.dataframe import SQLFrameDataFrame | ||
| from narwhals._typing import EagerAllowed | ||
| from narwhals.typing import NativeDataFrame, NativeLazyFrame | ||
| from tests.utils import Constructor, ConstructorEager, ConstructorLazy | ||
| from tests.utils import Constructor, ConstructorEager, ConstructorLazy, IntoIterable | ||
|
|
||
| Data: TypeAlias = "dict[str, list[Any]]" | ||
|
|
||
|
|
@@ -317,7 +329,109 @@ def eager_backend(request: pytest.FixtureRequest) -> EagerAllowed: | |
| return request.param # type: ignore[no-any-return] | ||
|
|
||
|
|
||
| @pytest.fixture(params=[el for el in TEST_EAGER_BACKENDS if not isinstance(el, str)]) | ||
| @pytest.fixture( | ||
| params=[el for el in TEST_EAGER_BACKENDS if not isinstance(el, str)], scope="session" | ||
| ) | ||
| def eager_implementation(request: pytest.FixtureRequest) -> EagerAllowed: | ||
| """Use if a test is heavily parametric, skips `str` backend.""" | ||
| return request.param # type: ignore[no-any-return] | ||
|
|
||
|
|
||
| class UserDefinedIterable: | ||
| def __init__(self, iterable: Iterable[Any]) -> None: | ||
| self.iterable: Iterable[Any] = iterable | ||
|
|
||
| def __iter__(self) -> Iterator[Any]: | ||
| yield from self.iterable | ||
|
|
||
|
|
||
| def generator_function(iterable: Iterable[Any]) -> Generator[Any, Any, None]: | ||
| yield from iterable | ||
|
|
||
|
|
||
| def generator_expression(iterable: Iterable[Any]) -> Generator[Any, None, None]: | ||
| return (element for element in iterable) | ||
|
|
||
|
|
||
| def dict_keys(iterable: Iterable[Any]) -> KeysView[Any]: | ||
| return dict.fromkeys(iterable).keys() | ||
|
|
||
|
|
||
| def dict_values(iterable: Iterable[Any]) -> ValuesView[Any]: | ||
| return dict(enumerate(iterable)).values() | ||
|
|
||
|
|
||
| def chunked_array(iterable: Any) -> Iterable[Any]: | ||
| import pyarrow as pa | ||
|
|
||
| return pa.chunked_array([iterable]) | ||
|
|
||
|
|
||
| def _ids_into_iter(obj: Any) -> str: | ||
| module: str = "" | ||
| if (obj_module := obj.__module__) and obj_module != __name__: | ||
| module = obj.__module__ | ||
| name = qualified_type_name(obj) | ||
| if name in {"function", "builtin_function_or_method"} or "_cython" in name: | ||
| return f"{module}.{obj.__qualname__}" if module else obj.__qualname__ | ||
| return name.removeprefix(__name__).strip(".") | ||
|
|
||
|
|
||
| def _build_into_iter() -> Iterator[IntoIterable]: # pragma: no cover | ||
| yield from ( | ||
| # 1-4: should cover `Iterable`, `Sequence`, `Iterator` | ||
| list, | ||
| tuple, | ||
| iter, | ||
| deque, | ||
| # 5-6: cover `Generator` | ||
| generator_function, | ||
| generator_expression, | ||
| # 7-8: `Iterable`, but quite commonly cause issues upstream as they are `Sized` but not `Sequence` | ||
| dict_keys, | ||
| dict_values, | ||
| # 9: duck typing | ||
| UserDefinedIterable, | ||
| ) | ||
| # 10: 1D numpy | ||
| if find_spec("numpy"): | ||
| import numpy as np | ||
|
|
||
| yield np.array | ||
| # 11-13: 1D pandas | ||
| if find_spec("pandas"): | ||
| import pandas as pd | ||
|
|
||
| yield from (pd.Index, pd.array, pd.Series) | ||
| # 14: 1D polars | ||
| if find_spec("polars"): | ||
| import polars as pl | ||
|
|
||
| yield pl.Series | ||
| # 15-16: 1D pyarrow | ||
| if find_spec("pyarrow"): | ||
| import pyarrow as pa | ||
|
|
||
| yield from (pa.array, chunked_array) | ||
|
|
||
|
|
||
| def _into_iter_selector() -> Callable[[int], Iterator[IntoIterable]]: | ||
| callables = tuple(_build_into_iter()) | ||
|
|
||
| def pick(n: int, /) -> Iterator[IntoIterable]: | ||
| yield from callables[:n] | ||
|
|
||
| return pick | ||
|
|
||
|
|
||
| _into_iter: Callable[[int], Iterator[IntoIterable]] = _into_iter_selector() | ||
| """`into_iter` fixtures use the suffix `_<n>` to denote the maximum number of constructors. | ||
|
|
||
| Anything greater than **10** may return less depending on available dependencies. | ||
| """ | ||
|
|
||
|
|
||
| @pytest.fixture(params=_into_iter(16), scope="session", ids=_ids_into_iter) | ||
| def into_iter_16(request: pytest.FixtureRequest) -> IntoIterable: | ||
| function: IntoIterable = request.param | ||
| return function | ||
|
Comment on lines
+482
to
+492
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kinda outdated now since (633a06d) |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I hate this as well π
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better idea would be to ...
Rename
_CanTo_List->ToList, and move to_translate.pyalongside:narwhals/narwhals/_translate.py
Lines 124 to 125 in f05011c
Move these as well, but rename to reflect they their naming originates from
numpyandpyarrow(respectively):narwhals/narwhals/_utils.py
Lines 2131 to 2132 in 0c66432
narwhals/narwhals/_utils.py
Lines 2135 to 2136 in 0c66432
The names of the guards can still stay the same, since their implementations will (after updating the protocol names) the link between origin, protocol, method name:
narwhals/narwhals/_utils.py
Lines 2139 to 2144 in 0c66432
narwhals/narwhals/_utils.py
Lines 2147 to 2148 in 0c66432
narwhals/narwhals/_utils.py
Lines 2151 to 2154 in 0c66432
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated my comment (#3207 (comment)) - I would argue that native series are not ok, while numpy 1d arrays are.
My thought process for this is that if someone is doing something along the lines of:
then the function is clearly not agnostic and polars would be required in this case.
A different case would be if a narwhals series with a different backend is provided. This could mean that the function is agnostic but a user is "mixin" backends:
This is the case I suggested to yell at the user with a warning.
From our side, I think it would greatly simplify (read as, get rid of) most of the protocols here, the type guards as well as
iterable_to_sequencefunction.