API: mode.nan_is_na to consistently distinguish NaN-vs-NA #62040

jbrockmendel · 2025-08-04T15:42:47Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

As discussed on the last dev call, this implements "mode.nan_is_na" (default True) to consider NaN as either always-equivalent or never-equivalent to NA.

This sits on top of

TST: nan->NA in non-construction tests #62021, which trims the diff here by updating some tests to use NA instead of NaN.
API: consistent NaN treatment for pyarrow dtypes #61732 which implements the option but only for pyarrow dtypes.
API: improve dtype in df.where with EA other #62038 which addresses an issue in DataFrame.where
BUG: read_csv with engine=pyarrow and numpy-nullable dtype #62053 which addresses a kludge in read_csv with engine="pyarrow"

Still need to

Add docs for the new option, including whatsnew section
deal with a kludge in algorithms.rank; fixed by API: rank with nullable dtypes preserve NA #62043
deal with a kludge in read_csv with engine="pyarrow"; fixed by BUG: read_csv with engine=pyarrow and numpy-nullable dtype #62053
Add tests for the issues this addresses

…estamp type

mroeschke · 2025-09-10T19:02:36Z

pandas/core/config_init.py


+    cf.register_option(
+        "nan_is_na",
+        os.environ.get("PANDAS_NAN_IS_NA", "1") == "1",


Curious, I thought you were not fond of the environment variable pattern?

that does sound like the kind of opinion i would have, but ATM i don't find myself bothered by it

jbrockmendel · 2025-09-26T00:24:01Z

Gentle ping

mroeschke · 2025-09-26T02:53:19Z

Thanks @jbrockmendel

…#62040)

jorisvandenbossche · 2025-10-07T18:23:01Z

Sorry for the late timing of this comment (I wasn't aware that this PR was planning to be merged), but: personally I don't think this PR should have been merged without more discussion.

In general we should strive for transparent and public discussions as much as possible, but especially for a significant change like this.

The only reference to any discussion was:

Discussed in the dev call before last where I, @mroeschke, and @Dr-Irv were +1. Joris was unenthused but "not necessarily opposed". On slack @rhshadrach expressed a +1. All those opinions were to the concept, not the execution.

(and I wasn't pinged so I wasn't aware that I was being quoted ...)
While the dev meetings are very useful, we should try to ensure discussions/decisions are still documented on github (while here we only have that mention above and some brief notes from the meeting (included them below), and even the people that gave a +1 privately didn't participate in the discussion on github).
But in this case, as far as I can see, there is just no public discussion whatsoever, not even an explanation of why this change is made. Yes, we already have had a lot of discussions around NA vs NaN (eg #32265), but I don't think this particular solution has come up in those. And those PRs were mentioned in the broad pyarrow dtypes discussion (#61618), but again not exactly discussed specifically. So why this solution and not another one? Why including a breaking change? Do we intend to keep the flag long term? Which of the two modes do we want in the future?

When going forward with an impactful global option and a breaking change, we should at least consider and try to answer those questions.

And while at the dev meeting I said to not necessarily be opposed to a global option, I don't think this version of a global option is useful (and by adding it now, it prevents us from using it for a (IMO) better global option in the future) and doesn't necessarily warrant a breaking change.

So far for a comment on the process, will do a follow-up post below detailing why I am personally think the change in this PR is not the best solution (or at least requires a discussion of those concerns before moving forward).

For completeness, the terse notes from the mentioned dev meeting (click to expand):

Brock’s proposal is to do both behind a global flag
Joris says the inconsistent behavior is intentional. Not necessarily opposed to a global option. “I think in practice its not going to be that useful”
Irv likes long-term plan to do global flag, start with not-distinguish for now, eventually move to always-distinguish
Matt “leans to Irv and Brock’s camp”
Probably need a pd.isNaN() method to test for NaN
Need methods for converting from NaN to NA and NA to NaN
Issue raised on data integrity with numpy operation where pd.NA and np.NaN specifiers would be lost due to numpy only using np.NaN.

jbrockmendel · 2025-10-07T18:37:09Z

(and I wasn't pinged so I wasn't aware that I was being quoted ...)

That one is on me, my bad.

jorisvandenbossche · 2025-10-07T18:40:10Z

First, I definitely understand people want to be able to move forward on the NaN vs NA topic, as it is something that has been discussed for a long time without much movement (#32265) and has a big impact on various related issues (i.e. the many interlinked open issues about users seeing unexpected behaviour / our API not being sufficient for distinguishing them).

But adding a global mode supporting both behaviours is a significant change (although the code change to support both is not that huge).

I think my main concerns are:

When adding this option, we should ideally already have some clear idea about the longer term intentions with the option: is it meant to be here to stay? (i.e. keep supporting both) And which behaviour do we want to be the default in the future?
While at some point a global option can make sense, I am not convinced that the current scope of the global option (i.e what it controls) is what we want it to do. Personally I don't think it should control both behaviour while coercing to the dtype (e.g. constructor) and operations of the dtype (e.g. division by 0). And I think it should also not influence conversion to numpy.
As a consequence of the above two (and if we want the distinguishing behaviour as the default in the future), I am not convinced the breaking change aspect of this PR is worth the change for users.

In more detail:

On the first item, I personally think we should (eventually) go with the simpler model of a single behaviour within pandas, and not indefinitely support both (this aspect has seen some discussion in the pyarrow dtypes issue #61618)). And if we choose one, I would personally go for distinguishing NaN and NA (the long discussion in #32265 ..).

Those are questions we should discuss, but if go in that direction, I don't think it is for example useful to introduce a breaking change right now for the people already using the nullable Float64 or pyarrow dtype. Right now (during operations) it already distinguishes, now after this PR with 3.0 it will start to not distinguish (turn NaNs from operations into NA), and then later we would switch back to distinguishing.
If we want the distinguishing behaviour to be the default in the future, this feels like an unnecessary churn. In that sense I think we should have some idea on what behaviour we want in the future before making this change. And if we want the distinguishing behaviour in the future, I would just not make that breaking change right now (and I would rather focus on adding APIs to check for NaNs and change NaNs to something else).

Again if we go in the direction of distinguishing at some point in the future, I certainly think it makes sense to have some nan_is_na option that can be disabled (I would also do it at the function/method level, but then it probably makes sense to have a global override as well). But if the idea is to migrate from current numpy-not-distinguishing to nullable-distinguishing world, in the future such option would (in my mind) only control how to treat NaN in conversions (constructors/setitem, when receiving scalars or numpy data and coercing to the nullable dtype), and not in operations (as at that point we are already "inside" pandas and consistently distinguish).
And personally I think having the global option only control that part, i.e. the conversion, is going to be essential for any migration. People constantly use NaN in input to mean a missing value (whether it is by typing it explicitly, or by getting numpy data through some method and passing that to pandas). But the switch in general from a (non-distinguishing) numpy to (distinguishing) nullable dtype would probably be controlled by a separate option.

(this also ties into the main disagreement about the current state of Float64Dtype (before this PR), as I understand it: while Brock and others say it has inconsistent behaviour, I say that this inconsistency is by design: once the data is in an array of the dtype, it distinguishes NaN/NA, but for coercing to the dtype it treats NaN as missing for backwards compatibility. There are definitely a whole bunch of issues with that current state, but IMO that is in large part because of missing coverage in our APIs to deal with NaNs as not a NA, i.e. in the distinguishing world, you currently cannot check for NaNs, you cannot easily set NaNs (setitem, replace) without having them turn into NA, etc. Those issues are now "solved" for the default NaN-is-NA behaviour, but if you set the option to False, you still run into all those issues)

Some additional points:

The nan_is_na option introduced here is also impacting conversion to numpy (giving object dtype with pd.NA). I am not sure this is the desired behaviour (just as a comparison, also polars or pyarrow will convert a floating array-like with nulls to numpy float with NaNs). Any libary that currently converts (implicitly or explicitly) pandas data to numpy using numeric dtypes will break because of this (typical example being scikit-learn, where they expect numeric data with NaNs, not object dtype with pd.NA). You can of course require every such library to do the conversion to numpy specifically for pandas objects with custom APIs (to_numpy() instead of np.asarray()), but I think that is going to be cumbersome, while an array with NaNs is what by large most people will want (I think).
(minor comment) the disallowing of NaN for nullable integer (when the option is set to False), that should probably also apply to other non-float nullable dtypes like "string"?

jbrockmendel · 2025-10-07T20:34:17Z

i.e. in the distinguishing world, you currently cannot check for NaNs, you cannot easily set NaNs (setitem, replace) without having them turn into NA, etc. Those issues are now "solved" for the default NaN-is-NA behaviour, but if you set the option to False, you still run into all those issues)

The checking for NaNs part is accurate but easily solved (I haven't made a 5-minute PR to give someone else the chance to. If no one does before long, I will). The rest is inaccurate:

import numpy as np
import pandas as pd

pd.set_option("mode.nan_is_na", False)

ser = pd.Series([1, pd.NA, np.nan], dtype="Float64")

>>> ser.replace(pd.NA, np.nan)
0    1.0
1    NaN
2    NaN
dtype: Float64

>>> ser.replace(np.nan, pd.NA)
0     1.0
1    <NA>
2    <NA>
dtype: Float64

>>> ser[1] = np.nan
>>> ser[1]
np.float64(nan)

(minor comment) the disallowing of NaN for nullable integer (when the option is set to False), that should probably also apply to other non-float nullable dtypes like "string"?

I'd be fine with that.

Big picture, 4ish meetings ago when we discussed this I asked you if you had an alternative path (to get us to consistent-behavior + nullable-by-default). You said you had to give it some thought. Do you have anything in mind now? Because it seems like you are calling for this to be reverted and to keep the inconsistent behavior indefinitely. (which I think will prevent us getting to nullable-by-default, but im willing to be convinced otherwise).

jorisvandenbossche · 2025-10-07T21:28:16Z

The checking for NaNs part is accurate but easily solved (I haven't made a 5-minute PR to give someone else the chance to. If no one does before long, I will). The rest is inaccurate:

Apologies, that is completely correct!
(I still don't think this is the behaviour we want while migrating, but that is a separate discussion from the fact that this is now indeed working as it should be on the long term)

Big picture, 4ish meetings ago when we discussed this I asked you if you had an alternative path (to get us to consistent-behavior + nullable-by-default). You said you had to give it some thought. Do you have anything in mind now?

I don't remember the exact details of the context there, but in general I have always argued for all nullable dtypes by default (PDEP-16), using a single logical dtype system (potentially with multiple backends, where it makes sense and needs to be discussed, but at least with generally one "type" of behaviour), and then specifically on the NaN vs NA I have expressed my personal preference for distinguishing (but wanted to compromise on this aspect to get everyone on board with the nullable dtypes, but nowadays more people might generally be in favor of the distinguishing behaviour so that this is maybe not needed?).

Many of those aspects have come up in recent discussions, but I still don't think there is a clear decision on whether we only want a single type system, or keep independent numpy like and nullable/arrow like type systems (and the change to the default behaviour in this PR doesn't actually get us closer to that, because now we have something in between that is nullable but does not follow the NaN/NA distinguishing we might want for the nullable/arrow type system).
I think that is a fundamental disscussion/decision that needs to happen first before we know if an option to choose the NaN vs NA behaviour is a good long term choice or not.

For a more concrete plan, personally I think we first need to generally focus on developing the nullable dtypes how we want them to be when enabling them by default in the future (there are still lots of missing gaps (eg categorical), and we need to decide how to handle the pyarrow backed dtypes, i.e. StringDtype vs ArrowDtype model), and then have a global flag like pd.options.future.use_nullable_dtypes for people to opt in. And then when ready in some future major release switch the default.

But for the specifics about how to ease the migration to those nullable dtypes, and then especially how NaN gets treated in this migration, I think the exact order of changes is important, which I explained a bit in the last part of #61618 (comment).
I personally want to end up with a float dtype that distinguishes, but for migrating I think we will have to treat NaNs as missing in context of construction or other input / coercion.

That means I want the nan_is_na = False behaviour for operations, but I am convinced that for most construcors we need the nan_is_na = True behaviour for quite a while. The current option introduced in this PR does not allow that distinction, so I think it is 1) not very useful for the path towards all nullable dtypes by default and 2) if we only want a single behaviour in the future, I think it is confusing to seemingly give the user the choice (if we would not actually be planning to keep that choice).

jorisvandenbossche · 2025-10-07T21:46:02Z

.. especially how NaN gets treated in this migration, I think the exact order of changes is important, ..

That means I want the nan_is_na = False behaviour for operations, but I am convinced that for most construcors we need the nan_is_na = True behaviour for quite a while. The

Expanding on this a bit more, it is true one could also go from numpy to never distinguishing NaN and NA to always distinguishing, which is closer to what this PR does.
For a migration path, I think the choice is essentially between (there are probably other options! but those two are in my head):

numpy dtype (only NaN, current default) -> nullable dtype (distinguish, but treat NaN in input as missing for back compat) -> nullable dtype (fully distinguish)
numpy dtype (only NaN, current default) -> nullable dtype (never distinguish, i.e. only NA and coerce any NaN to NA; default for nullable dtypes after this PR) -> nullable dtype (fully distinguish)

The second option gives initially a smaller change when enabling the nullable dtypes, at the cost of having to do another breaking change for moving to distinguishing NaN and NA (if this is a change we want to do for the default behaviour).

Personally I think the first option is better, because I expect it will give a better experience overall (and because I think we will have to keep the "treat NaN in input as missing for back compat" for a very long time, potentially even forever at least as an option in constructors, and then that would also delay the "distinguish" behaviour a long time in option 2 if we only want to introduce it after deprecating NaN as missing in input)

Dr-Irv · 2025-10-07T22:14:20Z

That means I want the nan_is_na = False behaviour for operations, but I am convinced that for most construcors we need the nan_is_na = True behaviour for quite a while. The current option introduced in this PR does not allow that distinction, so I think it is 1) not very useful for the path towards all nullable dtypes by default and 2) if we only want a single behaviour in the future, I think it is confusing to seemingly give the user the choice (if we would not actually be planning to keep that choice).

Not sure if the following idea makes things too complex, but what if the mode.nan_is_na option had 3 choices:

"always" (like what True does now)
"convert" (convert np.NaN to pd.NA on input, but NOT for operations)
"never" (like what False does now)

Then Joris can get his desired behavior with the "convert" option.

rhshadrach · 2025-10-08T12:24:43Z

My understanding is that the core team has general alignment on distinguishing NaN from NA.

I certainly think it makes sense to have some nan_is_na option that can be disabled (I would also do it at the function/method level, but then it probably makes sense to have a global override as well).

This sounds like too much to me. Users can use a context manager if they want to apply one behavior to some sections of code and not others. I do not think we need to introduce arguments to functions/methods only to have to deprecate them later. If we're not planning to deprecate them after the transition, I can be on board with arguments, but in such a case (this is perhaps a nit pick) it should be a global underride and not override - that is, it only takes effect if the user has not specified the argument to the function/method.

Many of those aspects have come up in recent discussions, but I still don't think there is a clear decision on whether we only want a single type system, or keep independent numpy like and nullable/arrow like type systems (and the change to the default behaviour in this PR doesn't actually get us closer to that, because now we have something in between that does is nullable but does not follow the NaN/NA distinguishing we might want for the nullable/arrow type system).
I think that is a fundamental disscussion/decision that needs to happen first before we know if an option to choose the NaN vs NA behaviour is a good long term choice or not.

I would guess that making this a requirement would result in more years of no progress. In particular, I discourage use of nullable dtypes at work because of the inconsistent NA treatment.

Building off of @Dr-Irv's proposal, I would suggest having "never" and "on_construction" options (just a bikeshed of "convert") and removing "always". @jorisvandenbossche - other than general hesitancy, does this satisfy desires indicated above?

jorisvandenbossche · 2025-10-08T15:33:39Z

This sounds like too much to me. Users can use a context manager if they want to apply one behavior to some sections of code and not others. I do not think we need to introduce arguments to functions/methods only to have to deprecate them later. If we're not planning to deprecate them after the transition, ...

Indeed, I don't think we would ever deprecate such an argument. To make it a bit more concrete, suppose we add nan_is_na option to the pd.Series(..) constructor to control, when creating a Series from a numpy array, whether the NaNs should be converted to NA or kept as NaN.
This is something that, as long as people combine numpy and pandas, will always need to be configurable just for this conversion, regardless of how NaN/NA are distinguished in operations (as comparison, the polars Series constructor has a keyword to that effect, and the pyarrow array constructor essentially as well (it is a bit confusingly called from_pandas=True/False, but essentially has that effect)). And for our main constructors, I personally think it is worth adding an explicit option (in addition to a possible global option)

(and regarding override vs underride, I think we are talking about the same thing but just using a different term: the local keyword has priority over the global option, and thus overrides the global option. But that local keyword has also still some default, and if it is not explicitly specified by the user, the global option overrides the default of the local keyword)

Building off of @Dr-Irv's proposal, I would suggest having "never" and "on_construction" options (just a bikeshed of "convert") and removing "always". @jorisvandenbossche - other than general hesitancy, does this satisfy desires indicated above?

I think that additional option would indeed be useful.
But I assume the default would still be "never", i.e. as with this PR? And I still have the question, why choosing that default? You mention "My understanding is that the core team has general alignment on distinguishing NaN from NA.", but so we currently have nullable/pyarrow float that do (although imperfectly) distinguish. So why change to not distinguish if we agree we want to distinguish? As an easier migration?

Dr-Irv · 2025-10-08T16:14:10Z

But I assume the default would still be "never", i.e. as with this PR? And I still have the question, why choosing that default? You mention "My understanding is that the core team has general alignment on distinguishing NaN from NA.", but so we currently have nullable/pyarrow float that do (although imperfectly) distinguish. So why change to not distinguish if we agree we want to distinguish? As an easier migration?

My take on this is that it makes the migration easier. For now, we don't distinguish. Then, we change the default in the future to distinguish.

jbrockmendel · 2025-10-08T16:55:02Z

To make it a bit more concrete, suppose we add nan_is_na option to the pd.Series(..) constructor to control, when creating a Series from a numpy array, whether the NaNs should be converted to NA or kept as NaN.

If that's what it takes to move forward here, I'll make my peace with it. But I'll be grumpy about it because obj.replace(np.nan, pd.NA) is a viable alternative that doesn't require new API.

I do not think we need to introduce arguments to functions/methods only to have to deprecate them later

I don't like adding a keyword to isna etc bc it means adding the keyword to everything that calls isna anywhere in the call stack. And Just Be Consistent is a much simpler alternative.

My take on this is that it makes the migration easier. For now, we don't distinguish. Then, we change the default in the future to distinguish.

That's my thought. Never-distinguish is the only variant in which transitioning users to nullable-by-default is viable.

…#62040)

jbrockmendel mentioned this pull request Aug 4, 2025

POC: NA-only behavior for numpy-nullable dtypes #61708

Closed

jbrockmendel force-pushed the api-nan-vs-na branch 2 times, most recently from 1d85ad8 to 1ccaaa4 Compare August 4, 2025 20:41

This was referenced Aug 5, 2025

API: NaN vs NA in mixed reduction #62024

Open

BUG: read_csv loses precision when engine='pyarrow' and dtype Int64 #56136

Closed

BUG: read_csv with engine=pyarrow and numpy-nullable dtype #62053

Merged

jbrockmendel force-pushed the api-nan-vs-na branch 3 times, most recently from f0e5e34 to 71d1c03 Compare August 6, 2025 14:45

jbrockmendel added 21 commits August 12, 2025 09:07

BUG: read_csv with engine=pyarrow and numpy-nullable dtype

5e88fde

mypy fixup, error message compat for 32bit builds

eae6f64

minimum version compat

2861b16

not-infer-string compat

5369afa

mypy fixup

db35a9c

update usage

505bfb6

CLN: remove redundant check

febe83c

Use Matts idea

c81cbec

re-xfail

26a3049

API: rank with nullable dtypes preserve NA

a70b429

API: improve dtype in df.where with EA other

99a71b7

GH refs

c86747d

doc fixup

9d222d8

BUG: Decimal(NaN) incorrectly allowed in ArrowEA constructor with tim…

6f800b3

…estamp type

GH ref

514a56f

BUG: ArrowEA constructor with timestamp type

fca3c7c

POC: consistent NaN treatment for pyarrow dtypes

f20758a

comment

cc416fa

Down to 40 failing tests

7094d85

Fix rank, json tests

eeb0d32

CLN: remove outdated

814d001

jbrockmendel added 2 commits September 9, 2025 13:21

Merge branch 'main' into api-nan-vs-na

7dcf2eb

NA->dtype.na_value

32a2041

mroeschke reviewed Sep 10, 2025

View reviewed changes

mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 10, 2025

mroeschke approved these changes Sep 10, 2025

View reviewed changes

jbrockmendel added 3 commits September 15, 2025 14:47

Merge branch 'main' into api-nan-vs-na

3382678

Merge branch 'main' into api-nan-vs-na

d2473ab

Merge branch 'main' into api-nan-vs-na

b4dcfa6

mroeschke merged commit e4ca405 into pandas-dev:main Sep 26, 2025
42 checks passed

jbrockmendel deleted the api-nan-vs-na branch September 26, 2025 14:21

MarcoGorelli mentioned this pull request Sep 28, 2025

deal with pandas new nan_is_na mode narwhals-dev/narwhals#3160

Open

jzwick pushed a commit to jzwick/pandas that referenced this pull request Oct 1, 2025

API: mode.nan_is_na to consistently distinguish NaN-vs-NA (pandas-dev…

c59f312

…#62040)

jorisvandenbossche mentioned this pull request Oct 8, 2025

BUG: np.isnan raises on pyarrow dtypes #62506

Open

3 tasks

MarcoGorelli mentioned this pull request Oct 9, 2025

chore: pandas-nightly and duckdb-nightly fixes narwhals-dev/narwhals#3158

Merged

10 tasks

eicchen pushed a commit to eicchen/pandas that referenced this pull request Oct 18, 2025

API: mode.nan_is_na to consistently distinguish NaN-vs-NA (pandas-dev…

42d06ed

…#62040)

Alvaro-Kothe mentioned this pull request Nov 20, 2025

BUG: skipna=True operations don't skip NaN in FloatingArrays #59965

Open

3 tasks

Uh oh!

API: mode.nan_is_na to consistently distinguish NaN-vs-NA #62040

API: mode.nan_is_na to consistently distinguish NaN-vs-NA #62040

Uh oh!

Conversation

jbrockmendel commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Sep 26, 2025

Uh oh!

Uh oh!

mroeschke commented Sep 26, 2025

Uh oh!

jorisvandenbossche commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Oct 7, 2025

Uh oh!

jorisvandenbossche commented Oct 7, 2025

Uh oh!

jbrockmendel commented Oct 7, 2025

Uh oh!

jorisvandenbossche commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dr-Irv commented Oct 7, 2025

Uh oh!

rhshadrach commented Oct 8, 2025

Uh oh!

jorisvandenbossche commented Oct 8, 2025

Uh oh!

Dr-Irv commented Oct 8, 2025

Uh oh!

jbrockmendel commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbrockmendel commented Aug 4, 2025 •

edited

Loading

jorisvandenbossche commented Oct 7, 2025 •

edited

Loading

jorisvandenbossche commented Oct 7, 2025 •

edited

Loading

jorisvandenbossche commented Oct 7, 2025 •

edited

Loading