You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* docs: Update Null vs NaN handling
* changed to mkdocs admonition
* added pandas doc link, corrected tip text
* took out repetition
* docs(suggestion): Use a table with links
Was out of range on this PR, but changed in #3037
* added 2nd link and changed text
---------
Co-authored-by: Marco Gorelli <[email protected]>
Co-authored-by: dangotbanned <[email protected]>
pandas doesn't distinguish between Null and NaN values as Polars and PyArrow do.
3
+
## TL;DR
4
4
5
-
Depending on the data type of the underlying data structure, `np.nan`, `pd.NaT`, `None` and `pd.NA` all encode missing data in pandas.
5
+
All dataframe tools, except for those which piggy-back off of pandas, make a clear
6
+
distinction between NaN and null values.
6
7
7
-
Polars and PyArrow, instead, treat `NaN` as a valid floating point value which is rare to encounter and more often produced as the result of a computation than explicitly set during data initialization; they treat `null` as the missing data indicator, regardless of the data type.
8
+
!!! tip
9
+
**We recommend only handling null values in applications and leaving NaN values as an
10
+
edge case resulting from users having performed undefined mathematical operations.**
8
11
9
-
In Narwhals, then, `is_null` behaves differently across backends (and so do `drop_nulls`, `fill_null` and `null_count`):
12
+
## What's the difference?
13
+
14
+
Most data tools except pandas make a clear distinction between:
15
+
16
+
- Null values, representing missing data.
17
+
- NaN values, resulting from "illegal" mathematical operations like `0/0`.
18
+
19
+
In Narwhals, this is reflected in separate methods for Null/NaN values:
| is |[`Expr.is_null`][narwhals.Expr.is_null]|[`Expr.is_nan`][narwhals.Expr.is_nan]|
24
+
| fill |[`Expr.fill_null`][narwhals.Expr.fill_null]|[`Expr.fill_nan`][narwhals.Expr.fill_nan]|
25
+
| drop |[`Expr.drop_nulls`][narwhals.Expr.drop_nulls]|*Not yet implemented (See [discussion](https://github.com/narwhals-dev/narwhals/issues/3031#issuecomment-3219910366))*<br>[`polars.Expr.drop_nans`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.drop_nans.html)|
In pandas however the concepts are muddied, as different sentinel values represent *missing*[depending on the data type](https://pandas.pydata.org/docs/user_guide/missing_data.html).
29
+
30
+
Check how different tools distinguish them (or don't) in the following example:
For all these reasons it bears reiterating that our recommendation is to only handle null values in applications, and leave NaN values as an edge case resulting from users having performed undefined mathematical operations.
0 commit comments