-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
API: mode.nan_is_na to consistently distinguish NaN-vs-NA #62040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1d85ad8
to
1ccaaa4
Compare
f0e5e34
to
71d1c03
Compare
|
||
cf.register_option( | ||
"nan_is_na", | ||
os.environ.get("PANDAS_NAN_IS_NA", "1") == "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, I thought you were not fond of the environment variable pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that does sound like the kind of opinion i would have, but ATM i don't find myself bothered by it
Gentle ping |
Thanks @jbrockmendel |
Sorry for the late timing of this comment (I wasn't aware that this PR was planning to be merged), but: personally I don't think this PR should have been merged without more discussion. In general we should strive for transparent and public discussions as much as possible, but especially for a significant change like this. The only reference to any discussion was:
(and I wasn't pinged so I wasn't aware that I was being quoted ...) When going forward with an impactful global option and a breaking change, we should at least consider and try to answer those questions. And while at the dev meeting I said to not necessarily be opposed to a global option, I don't think this version of a global option is useful (and by adding it now, it prevents us from using it for a (IMO) better global option in the future) and doesn't necessarily warrant a breaking change. So far for a comment on the process, will do a follow-up post below detailing why I am personally think the change in this PR is not the best solution (or at least requires a discussion of those concerns before moving forward). For completeness, the terse notes from the mentioned dev meeting (click to expand):
|
That one is on me, my bad. |
First, I definitely understand people want to be able to move forward on the NaN vs NA topic, as it is something that has been discussed for a long time without much movement (#32265) and has a big impact on various related issues (i.e. the many interlinked open issues about users seeing unexpected behaviour / our API not being sufficient for distinguishing them). But adding a global mode supporting both behaviours is a significant change (although the code change to support both is not that huge). I think my main concerns are:
In more detail: On the first item, I personally think we should (eventually) go with the simpler model of a single behaviour within pandas, and not indefinitely support both (this aspect has seen some discussion in the pyarrow dtypes issue #61618)). And if we choose one, I would personally go for distinguishing NaN and NA (the long discussion in #32265 ..). Those are questions we should discuss, but if go in that direction, I don't think it is for example useful to introduce a breaking change right now for the people already using the nullable Float64 or pyarrow dtype. Right now (during operations) it already distinguishes, now after this PR with 3.0 it will start to not distinguish (turn NaNs from operations into NA), and then later we would switch back to distinguishing. Again if we go in the direction of distinguishing at some point in the future, I certainly think it makes sense to have some (this also ties into the main disagreement about the current state of Float64Dtype (before this PR), as I understand it: while Brock and others say it has inconsistent behaviour, I say that this inconsistency is by design: once the data is in an array of the dtype, it distinguishes NaN/NA, but for coercing to the dtype it treats NaN as missing for backwards compatibility. There are definitely a whole bunch of issues with that current state, but IMO that is in large part because of missing coverage in our APIs to deal with NaNs as not a NA, i.e. in the distinguishing world, you currently cannot check for NaNs, you cannot easily set NaNs (setitem, replace) without having them turn into NA, etc. Those issues are now "solved" for the default NaN-is-NA behaviour, but if you set the option to False, you still run into all those issues) Some additional points:
|
The checking for NaNs part is accurate but easily solved (I haven't made a 5-minute PR to give someone else the chance to. If no one does before long, I will). The rest is inaccurate:
I'd be fine with that. Big picture, 4ish meetings ago when we discussed this I asked you if you had an alternative path (to get us to consistent-behavior + nullable-by-default). You said you had to give it some thought. Do you have anything in mind now? Because it seems like you are calling for this to be reverted and to keep the inconsistent behavior indefinitely. (which I think will prevent us getting to nullable-by-default, but im willing to be convinced otherwise). |
Apologies, that is completely correct!
I don't remember the exact details of the context there, but in general I have always argued for all nullable dtypes by default (PDEP-16), using a single logical dtype system (potentially with multiple backends, where it makes sense and needs to be discussed, but at least with generally one "type" of behaviour), and then specifically on the NaN vs NA I have expressed my personal preference for distinguishing (but wanted to compromise on this aspect to get everyone on board with the nullable dtypes, but nowadays more people might generally be in favor of the distinguishing behaviour so that this is maybe not needed?). Many of those aspects have come up in recent discussions, but I still don't think there is a clear decision on whether we only want a single type system, or keep independent numpy like and nullable/arrow like type systems (and the change to the default behaviour in this PR doesn't actually get us closer to that, because now we have something in between that is nullable but does not follow the NaN/NA distinguishing we might want for the nullable/arrow type system). For a more concrete plan, personally I think we first need to generally focus on developing the nullable dtypes how we want them to be when enabling them by default in the future (there are still lots of missing gaps (eg categorical), and we need to decide how to handle the pyarrow backed dtypes, i.e. StringDtype vs ArrowDtype model), and then have a global flag like But for the specifics about how to ease the migration to those nullable dtypes, and then especially how NaN gets treated in this migration, I think the exact order of changes is important, which I explained a bit in the last part of #61618 (comment). That means I want the |
Expanding on this a bit more, it is true one could also go from numpy to never distinguishing NaN and NA to always distinguishing, which is closer to what this PR does.
The second option gives initially a smaller change when enabling the nullable dtypes, at the cost of having to do another breaking change for moving to distinguishing NaN and NA (if this is a change we want to do for the default behaviour). Personally I think the first option is better, because I expect it will give a better experience overall (and because I think we will have to keep the "treat NaN in input as missing for back compat" for a very long time, potentially even forever at least as an option in constructors, and then that would also delay the "distinguish" behaviour a long time in option 2 if we only want to introduce it after deprecating NaN as missing in input) |
Not sure if the following idea makes things too complex, but what if the
Then Joris can get his desired behavior with the "convert" option. |
My understanding is that the core team has general alignment on distinguishing NaN from NA.
This sounds like too much to me. Users can use a context manager if they want to apply one behavior to some sections of code and not others. I do not think we need to introduce arguments to functions/methods only to have to deprecate them later. If we're not planning to deprecate them after the transition, I can be on board with arguments, but in such a case (this is perhaps a nit pick) it should be a global underride and not override - that is, it only takes effect if the user has not specified the argument to the function/method.
I would guess that making this a requirement would result in more years of no progress. In particular, I discourage use of nullable dtypes at work because of the inconsistent NA treatment. Building off of @Dr-Irv's proposal, I would suggest having |
Indeed, I don't think we would ever deprecate such an argument. To make it a bit more concrete, suppose we add (and regarding override vs underride, I think we are talking about the same thing but just using a different term: the local keyword has priority over the global option, and thus overrides the global option. But that local keyword has also still some default, and if it is not explicitly specified by the user, the global option overrides the default of the local keyword)
I think that additional option would indeed be useful. |
My take on this is that it makes the migration easier. For now, we don't distinguish. Then, we change the default in the future to distinguish. |
If that's what it takes to move forward here, I'll make my peace with it. But I'll be grumpy about it because
I don't like adding a keyword to isna etc bc it means adding the keyword to everything that calls isna anywhere in the call stack. And Just Be Consistent is a much simpler alternative.
That's my thought. Never-distinguish is the only variant in which transitioning users to nullable-by-default is viable. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.As discussed on the last dev call, this implements
"mode.nan_is_na"
(defaultTrue
) to consider NaN as either always-equivalent or never-equivalent to NA.This sits on top of
DataFrame.where
Still need to