-
Notifications
You must be signed in to change notification settings - Fork 172
test: div by zero returns nan #2636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@FBruzzesi, given that the division by zero causes the pyarrow backend to error, should I incorporate that behavior into the tests? The original description of the issue stated that all divisions by zero returned |
|
Hey @jrw34 here is how I would go about this:
|
|
@FBruzzesi, I avoided using I will also document all of the inconsistencies and create a new issue after this PR is closed. Thank you very much for walking me through all of this, it has been a great experience thus far! |
FBruzzesi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jrw34 , sorry for the slow feedback, I have been sick in the last few days.
I left a couple of comments in the code. The new approach seems a bit too convoluted in my opinion but we are not too far from the end goal.
Regarding collect-ing, I left a suggestion for how to use assert_data_equal, which internally takes care of collecting whenever necessary
| truediv_result = df["a"] / df["b"] # truediv | ||
| floordiv_result = df["a"] // df["b"] # floordiv | ||
| assert truediv_result[0] in div_by_zero_results or truediv_result.is_nan()[0] | ||
| assert floordiv_result[0] in div_by_zero_results or floordiv_result.is_nan()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can skip any operation and check if a branch is xfailing
| assert truediv_result[0] in div_by_zero_results or truediv_result.is_nan()[0] | ||
| assert floordiv_result[0] in div_by_zero_results or floordiv_result.is_nan()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about having an expected parameter together with left and right so that you can specify the exact expected output for each pair instead of having it generic? Then you can use assert_data_equal with manually constructed dicts:
assert_data_equal({"x": truediv_result}, {"x": [expected]}
|
|
||
|
|
||
| @pytest.mark.parametrize(("left", "right"), [(-2, 0), (0, 0), (2, 0)]) | ||
| def test_div_by_zero( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same ideas applied here, and using assert_data_equal should be even simpler since you can simply pass a frame as first object and that will take care of collecting when necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That drastically simplified everything, thanks for all the comments. I just pushed my changes and ended up separating the tests by their respective operation.
|
Hey @jrw34 there are a few tests failing, from what I can see there are many edge cases and different behaviors, would you be interested in trying to align the behavior with the latest polars version? I know it's a shift from the original issue and PR, but I feel somehow uneasy to know that (almost) each backend has a different output |
|
@FBruzzesi, the tests were all passing on my machine before I pushed this so I thought I was in the clear. I will resolve all of the failures before the next push. For the polars behavior, I will add that in. I also share the same sense of uneasiness pertaining to the discrepancies in backend behavior. It is interesting that floor division is causing all of the errors but not true division. I am currently documenting all of the differences for each backend and will include that in a new issue. I imagine down the road it would be a nice feature to standardize the behavior of division by zero for all backends once they are ingested into Narwhals (everything either returns |
|
@FBruzzesi, I included the polars specific behavior in the floordiv tests. Should I rebase down to a single commit before this gets merged? |
|
@FBruzzesi, how should I go about running the tests for all versions and platforms? When I run the test locally it passes but fails on many of the windows tests when it is on github. The failures are coming from the polars floordiv. To provide a sanity check, I am attaching what it looks like when I run it locally below. |
|
@FBruzzesi, I will delve into running multiple versions/packages. It looks like skipping some pandas versions dwindled down the number of failures, so we are on the right track. This has been an unexpected adventure after the very welcoming issue title, but it has definitely enhanced my appreciation for floor division by zero. Also, would it be acceptable to add a conditional for I feel as thought it will add even more bulk to test but it would help rule out more of the edge cases, let me know your thoughts! |
I can 100% related - I have been there sooo many times π
Sure! I think it's fine to catch all the cases, so that we know what we need to address afterwards Also I forgot to address the following:
Don't worry about it, we will squash and merge as one commit at the end π |
| elif all(x in str(constructor_eager) for x in ["pandas", "nullable"]): | ||
| floordiv_result = df["a"] // df["b"] | ||
| assert_equal_data({"a": floordiv_result}, {"a": [0]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli this might be worth reporting upstream as well. WDYT? Is this expected?
Repro:
pd.Series([-1, 0, 1]).convert_dtypes("numpy_nullable")//0
Out[4]:
0 0
1 0
2 0
dtype: Int64There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi upstream is aware: pandas-dev/pandas#30188 on this one. This corner of NaN producing operations on "numpy_nullable" backed values likely won't be resolved (pandas may be at yet another crossroads on this) so I think we should just xfail for this specific test.
Pinging @MarcoGorelli to verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@camriddell, @FBruzzesi, and @MarcoGorelli. Thank you for all of the review, I am happy to change this case to xfail if needed so just let me know,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought it let's keep the current code as it captures the behaviors of each of these backends, so if they change in the future we'll be aware of this shift.
Would it be possible to also add another variant of these tests that works with floating point values as the inputs for the numerator/denominator? We should see much more consistent results returned (e.g. see [inf, NaN, -inf]) across each of the backends.
Some thoughts on the oddities you observed. Feel free to ignore, I wanted to capture this in case we ever need to revisit this decision.
It seems that there are a few camps on the set of results that one would obtain from floor dividing two integer arrays (where the denom is 0):
- return i64 of nulls (Polars): if you divide floats by 0.0, you end with values of inf, -inf, or NaN which do not exist in integer dtypes, therefore the result is (shortcutted?) i64 with Null values.
- return i64 all 0s (pandas[nullable], numpy). Tough to reason about, numpy issues a RuntimeWarning but pandas[nullable] does not.
- return f64 output (pandas). Perhaps more mathematically sound, but some may be surprised at seeing a floats returned when floor-dividing two integers.
- raise (PyArrow pandas[pyarrow]) this is an opt-in behavior and we followed pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting with the floating point differences, I will add that in. Given that there are now going to be 8 different test functions, should I create a test_divsion_by_zero.py file to house the all these tests instead of bulking up arithmetic_tests.py even more?
|
thanks all for discussions i haven't read everything yet, just a quick comment:
happy to keep this PR focused to the |
|
@MarcoGorelli, the treudiv behavior is consistent across all backends for |
for more information, see https://pre-commit.ci
FBruzzesi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
this is failing for pyspark, see #2753 |

What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
>>> import narwhals as nw >>> import pyarrow as pa >>> data = {"a":[3], "b":[0]} >>> test_pa_table = pa.table(data) >>> test_df = nw.from_native(test_pa_table) >>> test_df["a"] / / test_df["b"]returns
pyarrow.lib.ArrowInvalid: divide by zero