Skip to content

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Aug 4, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

As discussed on the last dev call, this implements "mode.nan_is_na" (default True) to consider NaN as either always-equivalent or never-equivalent to NA.

This sits on top of

Still need to


cf.register_option(
"nan_is_na",
os.environ.get("PANDAS_NAN_IS_NA", "1") == "1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, I thought you were not fond of the environment variable pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that does sound like the kind of opinion i would have, but ATM i don't find myself bothered by it

@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Sep 10, 2025
@jbrockmendel
Copy link
Member Author

Gentle ping

@mroeschke mroeschke merged commit e4ca405 into pandas-dev:main Sep 26, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 7, 2025

Sorry for the late timing of this comment (I wasn't aware that this PR was planning to be merged), but: personally I don't think this PR should have been merged without more discussion.

In general we should strive for transparent and public discussions as much as possible, but especially for a significant change like this.

The only reference to any discussion was:

Discussed in the dev call before last where I, @mroeschke, and @Dr-Irv were +1. Joris was unenthused but "not necessarily opposed". On slack @rhshadrach expressed a +1. All those opinions were to the concept, not the execution.

(and I wasn't pinged so I wasn't aware that I was being quoted ...)
While the dev meetings are very useful, we should try to ensure discussions/decisions are still documented on github (while here we only have that mention above and some brief notes from the meeting (included them below), and even the people that gave a +1 privately didn't participate in the discussion on github).
But in this case, as far as I can see, there is just no public discussion whatsoever, not even an explanation of why this change is made. Yes, we already have had a lot of discussions around NA vs NaN (eg #32265), but I don't think this particular solution has come up in those. And those PRs were mentioned in the broad pyarrow dtypes discussion (#61618), but again not exactly discussed specifically. So why this solution and not another one? Why including a breaking change? Do we intend to keep the flag long term? Which of the two modes do we want in the future?

When going forward with an impactful global option and a breaking change, we should at least consider and try to answer those questions.

And while at the dev meeting I said to not necessarily be opposed to a global option, I don't think this version of a global option is useful (and by adding it now, it prevents us from using it for a (IMO) better global option in the future) and doesn't necessarily warrant a breaking change.

So far for a comment on the process, will do a follow-up post below detailing why I am personally think the change in this PR is not the best solution (or at least requires a discussion of those concerns before moving forward).


For completeness, the terse notes from the mentioned dev meeting (click to expand):
  • Brock’s proposal is to do both behind a global flag
  • Joris says the inconsistent behavior is intentional. Not necessarily opposed to a global option. “I think in practice its not going to be that useful”
  • Irv likes long-term plan to do global flag, start with not-distinguish for now, eventually move to always-distinguish
  • Matt “leans to Irv and Brock’s camp”
  • Probably need a pd.isNaN() method to test for NaN
  • Need methods for converting from NaN to NA and NA to NaN
  • Issue raised on data integrity with numpy operation where pd.NA and np.NaN specifiers would be lost due to numpy only using np.NaN.

@jbrockmendel
Copy link
Member Author

(and I wasn't pinged so I wasn't aware that I was being quoted ...)

That one is on me, my bad.

@jorisvandenbossche
Copy link
Member

First, I definitely understand people want to be able to move forward on the NaN vs NA topic, as it is something that has been discussed for a long time without much movement (#32265) and has a big impact on various related issues (i.e. the many interlinked open issues about users seeing unexpected behaviour / our API not being sufficient for distinguishing them).

But adding a global mode supporting both behaviours is a significant change (although the code change to support both is not that huge).

I think my main concerns are:

  • When adding this option, we should ideally already have some clear idea about the longer term intentions with the option: is it meant to be here to stay? (i.e. keep supporting both) And which behaviour do we want to be the default in the future?

  • While at some point a global option can make sense, I am not convinced that the current scope of the global option (i.e what it controls) is what we want it to do. Personally I don't think it should control both behaviour while coercing to the dtype (e.g. constructor) and operations of the dtype (e.g. division by 0). And I think it should also not influence conversion to numpy.

  • As a consequence of the above two (and if we want the distinguishing behaviour as the default in the future), I am not convinced the breaking change aspect of this PR is worth the change for users.


In more detail:

On the first item, I personally think we should (eventually) go with the simpler model of a single behaviour within pandas, and not indefinitely support both (this aspect has seen some discussion in the pyarrow dtypes issue #61618)). And if we choose one, I would personally go for distinguishing NaN and NA (the long discussion in #32265 ..).

Those are questions we should discuss, but if go in that direction, I don't think it is for example useful to introduce a breaking change right now for the people already using the nullable Float64 or pyarrow dtype. Right now (during operations) it already distinguishes, now after this PR with 3.0 it will start to not distinguish (turn NaNs from operations into NA), and then later we would switch back to distinguishing.
If we want the distinguishing behaviour to be the default in the future, this feels like an unnecessary churn. In that sense I think we should have some idea on what behaviour we want in the future before making this change. And if we want the distinguishing behaviour in the future, I would just not make that breaking change right now (and I would rather focus on adding APIs to check for NaNs and change NaNs to something else).

Again if we go in the direction of distinguishing at some point in the future, I certainly think it makes sense to have some nan_is_na option that can be disabled (I would also do it at the function/method level, but then it probably makes sense to have a global override as well). But if the idea is to migrate from current numpy-not-distinguishing to nullable-distinguishing world, in the future such option would (in my mind) only control how to treat NaN in conversions (constructors/setitem, when receiving scalars or numpy data and coercing to the nullable dtype), and not in operations (as at that point we are already "inside" pandas and consistently distinguish).
And personally I think having the global option only control that part, i.e. the conversion, is going to be essential for any migration. People constantly use NaN in input to mean a missing value (whether it is by typing it explicitly, or by getting numpy data through some method and passing that to pandas). But the switch in general from a (non-distinguishing) numpy to (distinguishing) nullable dtype would probably be controlled by a separate option.

(this also ties into the main disagreement about the current state of Float64Dtype (before this PR), as I understand it: while Brock and others say it has inconsistent behaviour, I say that this inconsistency is by design: once the data is in an array of the dtype, it distinguishes NaN/NA, but for coercing to the dtype it treats NaN as missing for backwards compatibility. There are definitely a whole bunch of issues with that current state, but IMO that is in large part because of missing coverage in our APIs to deal with NaNs as not a NA, i.e. in the distinguishing world, you currently cannot check for NaNs, you cannot easily set NaNs (setitem, replace) without having them turn into NA, etc. Those issues are now "solved" for the default NaN-is-NA behaviour, but if you set the option to False, you still run into all those issues)

Some additional points:

  • The nan_is_na option introduced here is also impacting conversion to numpy (giving object dtype with pd.NA). I am not sure this is the desired behaviour (just as a comparison, also polars or pyarrow will convert a floating array-like with nulls to numpy float with NaNs). Any libary that currently converts (implicitly or explicitly) pandas data to numpy using numeric dtypes will break because of this (typical example being scikit-learn, where they expect numeric data with NaNs, not object dtype with pd.NA). You can of course require every such library to do the conversion to numpy specifically for pandas objects with custom APIs (to_numpy() instead of np.asarray()), but I think that is going to be cumbersome, while an array with NaNs is what by large most people will want (I think).

  • (minor comment) the disallowing of NaN for nullable integer (when the option is set to False), that should probably also apply to other non-float nullable dtypes like "string"?

@jbrockmendel
Copy link
Member Author

i.e. in the distinguishing world, you currently cannot check for NaNs, you cannot easily set NaNs (setitem, replace) without having them turn into NA, etc. Those issues are now "solved" for the default NaN-is-NA behaviour, but if you set the option to False, you still run into all those issues)

The checking for NaNs part is accurate but easily solved (I haven't made a 5-minute PR to give someone else the chance to. If no one does before long, I will). The rest is inaccurate:

import numpy as np
import pandas as pd

pd.set_option("mode.nan_is_na", False)

ser = pd.Series([1, pd.NA, np.nan], dtype="Float64")

>>> ser.replace(pd.NA, np.nan)
0    1.0
1    NaN
2    NaN
dtype: Float64

>>> ser.replace(np.nan, pd.NA)
0     1.0
1    <NA>
2    <NA>
dtype: Float64

>>> ser[1] = np.nan
>>> ser[1]
np.float64(nan)

(minor comment) the disallowing of NaN for nullable integer (when the option is set to False), that should probably also apply to other non-float nullable dtypes like "string"?

I'd be fine with that.

Big picture, 4ish meetings ago when we discussed this I asked you if you had an alternative path (to get us to consistent-behavior + nullable-by-default). You said you had to give it some thought. Do you have anything in mind now? Because it seems like you are calling for this to be reverted and to keep the inconsistent behavior indefinitely. (which I think will prevent us getting to nullable-by-default, but im willing to be convinced otherwise).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 7, 2025

The checking for NaNs part is accurate but easily solved (I haven't made a 5-minute PR to give someone else the chance to. If no one does before long, I will). The rest is inaccurate:

Apologies, that is completely correct!
(I still don't think this is the behaviour we want while migrating, but that is a separate discussion from the fact that this is now indeed working as it should be on the long term)

Big picture, 4ish meetings ago when we discussed this I asked you if you had an alternative path (to get us to consistent-behavior + nullable-by-default). You said you had to give it some thought. Do you have anything in mind now?

I don't remember the exact details of the context there, but in general I have always argued for all nullable dtypes by default (PDEP-16), using a single logical dtype system (potentially with multiple backends, where it makes sense and needs to be discussed, but at least with generally one "type" of behaviour), and then specifically on the NaN vs NA I have expressed my personal preference for distinguishing (but wanted to compromise on this aspect to get everyone on board with the nullable dtypes, but nowadays more people might generally be in favor of the distinguishing behaviour so that this is maybe not needed?).

Many of those aspects have come up in recent discussions, but I still don't think there is a clear decision on whether we only want a single type system, or keep independent numpy like and nullable/arrow like type systems (and the change to the default behaviour in this PR doesn't actually get us closer to that, because now we have something in between that is nullable but does not follow the NaN/NA distinguishing we might want for the nullable/arrow type system).
I think that is a fundamental disscussion/decision that needs to happen first before we know if an option to choose the NaN vs NA behaviour is a good long term choice or not.

For a more concrete plan, personally I think we first need to generally focus on developing the nullable dtypes how we want them to be when enabling them by default in the future (there are still lots of missing gaps (eg categorical), and we need to decide how to handle the pyarrow backed dtypes, i.e. StringDtype vs ArrowDtype model), and then have a global flag like pd.options.future.use_nullable_dtypes for people to opt in. And then when ready in some future major release switch the default.

But for the specifics about how to ease the migration to those nullable dtypes, and then especially how NaN gets treated in this migration, I think the exact order of changes is important, which I explained a bit in the last part of #61618 (comment).
I personally want to end up with a float dtype that distinguishes, but for migrating I think we will have to treat NaNs as missing in context of construction or other input / coercion.

That means I want the nan_is_na = False behaviour for operations, but I am convinced that for most construcors we need the nan_is_na = True behaviour for quite a while. The current option introduced in this PR does not allow that distinction, so I think it is 1) not very useful for the path towards all nullable dtypes by default and 2) if we only want a single behaviour in the future, I think it is confusing to seemingly give the user the choice (if we would not actually be planning to keep that choice).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 7, 2025

.. especially how NaN gets treated in this migration, I think the exact order of changes is important, ..

That means I want the nan_is_na = False behaviour for operations, but I am convinced that for most construcors we need the nan_is_na = True behaviour for quite a while. The

Expanding on this a bit more, it is true one could also go from numpy to never distinguishing NaN and NA to always distinguishing, which is closer to what this PR does.
For a migration path, I think the choice is essentially between (there are probably other options! but those two are in my head):

  • numpy dtype (only NaN, current default) -> nullable dtype (distinguish, but treat NaN in input as missing for back compat) -> nullable dtype (fully distinguish)
  • numpy dtype (only NaN, current default) -> nullable dtype (never distinguish, i.e. only NA and coerce any NaN to NA; default for nullable dtypes after this PR) -> nullable dtype (fully distinguish)

The second option gives initially a smaller change when enabling the nullable dtypes, at the cost of having to do another breaking change for moving to distinguishing NaN and NA (if this is a change we want to do for the default behaviour).

Personally I think the first option is better, because I expect it will give a better experience overall (and because I think we will have to keep the "treat NaN in input as missing for back compat" for a very long time, potentially even forever at least as an option in constructors, and then that would also delay the "distinguish" behaviour a long time in option 2 if we only want to introduce it after deprecating NaN as missing in input)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 7, 2025

That means I want the nan_is_na = False behaviour for operations, but I am convinced that for most construcors we need the nan_is_na = True behaviour for quite a while. The current option introduced in this PR does not allow that distinction, so I think it is 1) not very useful for the path towards all nullable dtypes by default and 2) if we only want a single behaviour in the future, I think it is confusing to seemingly give the user the choice (if we would not actually be planning to keep that choice).

Not sure if the following idea makes things too complex, but what if the mode.nan_is_na option had 3 choices:

  • "always" (like what True does now)
  • "convert" (convert np.NaN to pd.NA on input, but NOT for operations)
  • "never" (like what False does now)

Then Joris can get his desired behavior with the "convert" option.

@rhshadrach
Copy link
Member

My understanding is that the core team has general alignment on distinguishing NaN from NA.

I certainly think it makes sense to have some nan_is_na option that can be disabled (I would also do it at the function/method level, but then it probably makes sense to have a global override as well).

This sounds like too much to me. Users can use a context manager if they want to apply one behavior to some sections of code and not others. I do not think we need to introduce arguments to functions/methods only to have to deprecate them later. If we're not planning to deprecate them after the transition, I can be on board with arguments, but in such a case (this is perhaps a nit pick) it should be a global underride and not override - that is, it only takes effect if the user has not specified the argument to the function/method.

Many of those aspects have come up in recent discussions, but I still don't think there is a clear decision on whether we only want a single type system, or keep independent numpy like and nullable/arrow like type systems (and the change to the default behaviour in this PR doesn't actually get us closer to that, because now we have something in between that does is nullable but does not follow the NaN/NA distinguishing we might want for the nullable/arrow type system).
I think that is a fundamental disscussion/decision that needs to happen first before we know if an option to choose the NaN vs NA behaviour is a good long term choice or not.

I would guess that making this a requirement would result in more years of no progress. In particular, I discourage use of nullable dtypes at work because of the inconsistent NA treatment.

Building off of @Dr-Irv's proposal, I would suggest having "never" and "on_construction" options (just a bikeshed of "convert") and removing "always". @jorisvandenbossche - other than general hesitancy, does this satisfy desires indicated above?

@jorisvandenbossche
Copy link
Member

This sounds like too much to me. Users can use a context manager if they want to apply one behavior to some sections of code and not others. I do not think we need to introduce arguments to functions/methods only to have to deprecate them later. If we're not planning to deprecate them after the transition, ...

Indeed, I don't think we would ever deprecate such an argument. To make it a bit more concrete, suppose we add nan_is_na option to the pd.Series(..) constructor to control, when creating a Series from a numpy array, whether the NaNs should be converted to NA or kept as NaN.
This is something that, as long as people combine numpy and pandas, will always need to be configurable just for this conversion, regardless of how NaN/NA are distinguished in operations (as comparison, the polars Series constructor has a keyword to that effect, and the pyarrow array constructor essentially as well (it is a bit confusingly called from_pandas=True/False, but essentially has that effect)). And for our main constructors, I personally think it is worth adding an explicit option (in addition to a possible global option)

(and regarding override vs underride, I think we are talking about the same thing but just using a different term: the local keyword has priority over the global option, and thus overrides the global option. But that local keyword has also still some default, and if it is not explicitly specified by the user, the global option overrides the default of the local keyword)

Building off of @Dr-Irv's proposal, I would suggest having "never" and "on_construction" options (just a bikeshed of "convert") and removing "always". @jorisvandenbossche - other than general hesitancy, does this satisfy desires indicated above?

I think that additional option would indeed be useful.
But I assume the default would still be "never", i.e. as with this PR? And I still have the question, why choosing that default? You mention "My understanding is that the core team has general alignment on distinguishing NaN from NA.", but so we currently have nullable/pyarrow float that do (although imperfectly) distinguish. So why change to not distinguish if we agree we want to distinguish? As an easier migration?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 8, 2025

But I assume the default would still be "never", i.e. as with this PR? And I still have the question, why choosing that default? You mention "My understanding is that the core team has general alignment on distinguishing NaN from NA.", but so we currently have nullable/pyarrow float that do (although imperfectly) distinguish. So why change to not distinguish if we agree we want to distinguish? As an easier migration?

My take on this is that it makes the migration easier. For now, we don't distinguish. Then, we change the default in the future to distinguish.

@jbrockmendel
Copy link
Member Author

To make it a bit more concrete, suppose we add nan_is_na option to the pd.Series(..) constructor to control, when creating a Series from a numpy array, whether the NaNs should be converted to NA or kept as NaN.

If that's what it takes to move forward here, I'll make my peace with it. But I'll be grumpy about it because obj.replace(np.nan, pd.NA) is a viable alternative that doesn't require new API.

I do not think we need to introduce arguments to functions/methods only to have to deprecate them later

I don't like adding a keyword to isna etc bc it means adding the keyword to everything that calls isna anywhere in the call stack. And Just Be Consistent is a much simpler alternative.

My take on this is that it makes the migration easier. For now, we don't distinguish. Then, we change the default in the future to distinguish.

That's my thought. Never-distinguish is the only variant in which transitioning users to nullable-by-default is viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants