Skip to content

Commit 449cba2

Browse files
ym-pettMarcoGorellidangotbanned
authored
docs: Nan non compliance (#3093)
* docs: Update Null vs NaN handling * changed to mkdocs admonition * added pandas doc link, corrected tip text * took out repetition * docs(suggestion): Use a table with links Was out of range on this PR, but changed in #3037 * added 2nd link and changed text --------- Co-authored-by: Marco Gorelli <[email protected]> Co-authored-by: dangotbanned <[email protected]>
1 parent f47abd5 commit 449cba2

File tree

1 file changed

+71
-40
lines changed

1 file changed

+71
-40
lines changed

docs/concepts/null_handling.md

Lines changed: 71 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,51 @@
11
# Null/NaN handling
22

3-
pandas doesn't distinguish between Null and NaN values as Polars and PyArrow do.
3+
## TL;DR
44

5-
Depending on the data type of the underlying data structure, `np.nan`, `pd.NaT`, `None` and `pd.NA` all encode missing data in pandas.
5+
All dataframe tools, except for those which piggy-back off of pandas, make a clear
6+
distinction between NaN and null values.
67

7-
Polars and PyArrow, instead, treat `NaN` as a valid floating point value which is rare to encounter and more often produced as the result of a computation than explicitly set during data initialization; they treat `null` as the missing data indicator, regardless of the data type.
8+
!!! tip
9+
**We recommend only handling null values in applications and leaving NaN values as an
10+
edge case resulting from users having performed undefined mathematical operations.**
811

9-
In Narwhals, then, `is_null` behaves differently across backends (and so do `drop_nulls`, `fill_null` and `null_count`):
12+
## What's the difference?
13+
14+
Most data tools except pandas make a clear distinction between:
15+
16+
- Null values, representing missing data.
17+
- NaN values, resulting from "illegal" mathematical operations like `0/0`.
18+
19+
In Narwhals, this is reflected in separate methods for Null/NaN values:
20+
21+
| Operation | Null | NaN |
22+
| --------- | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
23+
| is | [`Expr.is_null`][narwhals.Expr.is_null] | [`Expr.is_nan`][narwhals.Expr.is_nan] |
24+
| fill | [`Expr.fill_null`][narwhals.Expr.fill_null] | [`Expr.fill_nan`][narwhals.Expr.fill_nan] |
25+
| drop | [`Expr.drop_nulls`][narwhals.Expr.drop_nulls] | *Not yet implemented (See [discussion](https://github.com/narwhals-dev/narwhals/issues/3031#issuecomment-3219910366))*<br>[`polars.Expr.drop_nans`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.drop_nans.html) |
26+
| count | [`Expr.null_count`][narwhals.Expr.null_count] | *No upstream equivalent* |
27+
28+
In pandas however the concepts are muddied, as different sentinel values represent *missing* [depending on the data type](https://pandas.pydata.org/docs/user_guide/missing_data.html).
29+
30+
Check how different tools distinguish them (or don't) in the following example:
1031

1132
```python exec="1" source="above" session="null_handling"
1233
import narwhals as nw
1334
import numpy as np
1435
from narwhals.typing import IntoFrameT
1536

16-
data = {"a": [1.4, float("nan"), np.nan, 4.2, None]}
37+
data = {"a": [1.0, 0.0, None]}
1738

1839

1940
def check_null_behavior(df: IntoFrameT) -> IntoFrameT:
20-
return nw.from_native(df).with_columns(a_is_null=nw.col("a").is_null()).to_native()
41+
return (
42+
nw.from_native(df)
43+
.with_columns(a=nw.col("a") / nw.col("a"))
44+
.with_columns(
45+
a_is_null=nw.col("a").is_null(),
46+
a_is_nan=nw.col("a").is_nan(),
47+
)
48+
).to_native()
2149
```
2250

2351
=== "pandas"
@@ -28,6 +56,14 @@ def check_null_behavior(df: IntoFrameT) -> IntoFrameT:
2856
print(check_null_behavior(df))
2957
```
3058

59+
=== "pandas (pyarrow-backed)"
60+
```python exec="true" source="material-block" result="python" session="null_handling"
61+
import pandas as pd
62+
63+
df = pd.DataFrame(data).convert_dtypes(dtype_backend="pyarrow")
64+
print(check_null_behavior(df))
65+
```
66+
3167
=== "Polars (eager)"
3268
```python exec="true" source="material-block" result="python" session="null_handling"
3369
import polars as pl
@@ -44,47 +80,42 @@ def check_null_behavior(df: IntoFrameT) -> IntoFrameT:
4480
print(check_null_behavior(df))
4581
```
4682

47-
Conversely, `is_nan` is consistent across backends. This consistency comes from Narwhals exploiting its native implementations
48-
in Polars and PyArrow, while ensuring that pandas only identifies the floating-point NaN values and not those encoding the missing value indicator.
83+
Notice how the classic pandas dtypes make no distinction between the concepts, whereas the other
84+
libraries do. Note however that discussion on what PyArrow-backed pandas dataframe should do
85+
[is ongoing](https://github.com/pandas-dev/pandas/issues/32265).
4986

50-
```python exec="1" source="above" session="null_handling"
51-
import narwhals as nw
52-
from narwhals.typing import IntoFrameT
87+
## NaN comparisons
5388

54-
data = {"a": [0.0, None, 2.0]}
89+
According to the IEEE-754 standard, NaN should compare as not equal to itself, and cannot
90+
be compared with other floating point numbers. Python and PyArrow follow these rules:
5591

92+
```python exec="1" source="above" session="nan-comparisons" result="python"
93+
import pyarrow as pa
94+
import pyarrow.compute as pc
5695

57-
def check_nan_behavior(df: IntoFrameT) -> IntoFrameT:
58-
return (
59-
nw.from_native(df)
60-
.with_columns(
61-
a_div_a=(nw.col("a") / nw.col("a")),
62-
a_div_a_is_nan=(nw.col("a") / nw.col("a")).is_nan(),
63-
)
64-
.to_native()
65-
)
96+
print("Python result:")
97+
print(float("nan") == float("nan"), 0.0 == 0.0)
98+
print()
99+
print("PyArrow result:")
100+
arr = pa.array([float("nan"), 0.0])
101+
print(pc.equal(arr, arr))
66102
```
67103

68-
=== "pandas"
69-
```python exec="true" source="material-block" result="python" session="null_handling"
70-
import pandas as pd
104+
Polars and DuckDB, however, don't follow this rule, and treat NaN as equal to itself.
71105

72-
df = pd.DataFrame(data).astype({"a": "Float64"})
73-
print(check_nan_behavior(df))
74-
```
106+
```python exec="1" source="above" session="nan-comparisons" result="python"
107+
import polars as pl
108+
import duckdb
75109

76-
=== "Polars (eager)"
77-
```python exec="true" source="material-block" result="python" session="null_handling"
78-
import polars as pl
110+
print("Polars result")
111+
df = pl.DataFrame({"a": [float("nan"), 0.0]})
112+
print(df.with_columns(a_equals_a=pl.col("a") == pl.col("a")))
113+
print()
114+
print("DuckDB result")
115+
print(duckdb.sql("from df select a, a == a as a_equals_a"))
116+
```
79117

80-
df = pl.DataFrame(data)
81-
print(check_nan_behavior(df))
82-
```
118+
Furthermore, Polars [excludes NaN values in `max`](https://github.com/pola-rs/polars/issues/23635)
119+
whereas DuckDB treats them as larger than any other floating-point value.
83120

84-
=== "PyArrow"
85-
```python exec="true" source="material-block" result="python" session="null_handling"
86-
import pyarrow as pa
87-
88-
df = pa.table(data)
89-
print(check_nan_behavior(df))
90-
```
121+
For all these reasons it bears reiterating that our recommendation is to only handle null values in applications, and leave NaN values as an edge case resulting from users having performed undefined mathematical operations.

0 commit comments

Comments
 (0)