-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
API/ERR: allow iterators in df.set_index & improve errors #24984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/frame.py
Outdated
# make sure we have a container of keys/arrays we can iterate over | ||
# tuples can appear as valid column keys! | ||
if not isinstance(keys, list): | ||
keys = [keys] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If supporting custom types, we need to go back to the state pre-#22486 that just puts everything that's not listlike in list. Otherwise we can't guarantee that iteration below will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is best, let's just revert entirely to pre 22486
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-#22486 had some other problems we do not need to re-instate (e.g. weird KeyErrors if it's an iter).
I agree that the code complexity must stay maintainable, and so the most reasonable thing (IMO), ist to just not bother trying to "fix" unhashable custom types being used as keys. At that point, the user is so far gone off the reservation that it's really not our job to give them a good error message (again, even the repr of such a frame would crash).
pandas/core/frame.py
Outdated
# everything else gets tried as a key; see GH 24969 | ||
try: | ||
self[col] | ||
str(col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can str(col)
normally raise a KeyError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error might be different, that's true. On second thought, str
defaults to repr
if it's not implemented, so unless the class manages to break its own repr
(haha), this should not be necessary.
Codecov Report
@@ Coverage Diff @@
## master #24984 +/- ##
==========================================
- Coverage 92.38% 92.38% -0.01%
==========================================
Files 166 166
Lines 52400 52405 +5
==========================================
+ Hits 48409 48413 +4
- Misses 3991 3992 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24984 +/- ##
==========================================
- Coverage 91.73% 91.73% -0.01%
==========================================
Files 173 173
Lines 52856 52877 +21
==========================================
+ Hits 48490 48509 +19
- Misses 4366 4368 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
I'm not sure what should be done going forward, or whether we should have a deprecation now. @toobaz may have thoughts here too. I'm fine with leaving that for a comprehensive "what's allowed an Index" overhaul.
Reposting #24969 (comment) here in case the discussion of the deprecation of custom label types proceeds here instead of at the original ticket.
|
I'm boring, I always have the same thoughts, every time the discussion comes out :-) And they are:
|
Don't get me wrong: it is good to have this discussion once and for all if it results in cleaner docs/assumptions. But my opinion is that the docs should say "you can use as keys anything which is not mutable". |
"Not mutable" is a bit fuzzy in Python. If, instead, keys are required to be hashable (which usually implies some sort of recursive immutability), then users already familiar with Python dicts would immediately get it, you can test for key-eligibility with The documentation should then also have a sentence or two about the importance of immutability and compatibility between the object's In the Python 3.7.2 session below, >>> a = []
>>> b = (a,)
>>> hash(b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list' |
Yep, my mistake, "hashability" is the right property. |
I'm not a core dev, but FWIW, here are some counter points to @toobaz arguments:
If the code is not designed and tested for it, such a carte blanche is (not in this example, but across 1000s of small cuts) an immense burden down the road when something that wasn't explicitly forbidden suddenly breaks.
See lack of testing above. Duck typing really bites us in the ass here. Stability (a possible SemVer after 1.0) even more.
The previous behaviour was specifically causing some weird KeyErrors for stuff that should have never been tested as a key (hence I'm adding a try-except in this PR), see #22484. I could have easily maintained this capability, if it had been tested, or even documented.
That - in contrast to the above - would make it a controlled (non-breaking) expansion where necessary (IMO far preferable). I'm also thinking the ecosystem has matured enough by now to broadly have a common understanding what "scalar" means.
That line can be blurred arbitrarily far (see my example in the issue).
In #24702, it sounded like you're advocating against using tuples as keys (which I would agree with), which also IMO, the API should have a clearly defined surface, not an arbitrary "whatever happens to work". The latter would mean (and already does to some extent) a disproportionate amount of dev time to chase down weird corner cases, rather than focus on a consistent API that's useful and unambiguous for 99.99% of the cases. |
You test lines of code, not user behavior. Otherwise tests suites would have to be infinite. You write code that satisfies properties, not just that doesn't break your tests. Tests are a very useful addition, but you will never test any possible combination of parameters, even if you list one by one the possible values they can take. Otherwise, since duck typing is the norm in Python, you would be accusing a very large number of projects of not being able to write effective tests. Notice, by the way, that "custom types" is not a precise definition. People might feed our indices with types that they did not create, but other libraries did.
Feel free to expand, I'm not following you. Notice that
Not at all. I said multiple times in many other discussions that tuples (with hashable content) are perfectly valid keys. I actually used exactly the same arguments I'm using here, that's why I'm boring :-)
That's sure. I'm being very clear (although @wkschwartz beat me at it) on what the API should support. And it should work on what it supports.
Not following you. I'm replying there. |
Not if there's a well-defined set of input types. ;-)
The iterator that gets passed to
Ok, that's a more nuanced discussion then (vs. "everything that works is allowed"). If all hashable objects should be usable as keys, we'd have to drastically improve our testing for it, but it's not impossible to try to support it, certainly. I just think it's far too wide a line to draw. I mentioned in the issue that it would be more feasible to add |
Again, non sequitur. Testing is good, but we want to write code that works even in those cases in which it was not tested.
OK, let's cut it short: will you soon be asking us to limit the values of If you need to test any possible combination of allowed input types with every possible behavior of every method in the pandas API, tests will be just ''slightly short'' of infinite.
I see three points there, and not sure if any of them is a problem. Can you assume I'm even more stupid than I am? General comment: I still wasn't able to understand if this regression was a mistake or was made on purpose. In the second case, while I can only blame myself for not being able to keep up with the PRs, I still think such a change deserved some more discussion, e.g. in mailing list. |
To expand on @toobaz's #24984 (comment): in a duck-typed language, APIs should be based on protocols, interfaces, or capabilities, not on exact class hierarchies. For example, Python core does fine defining the API surface of Regarding whether that which is not forbidden is allowed (TWINFIA): Pandas' having a version < 1 means the maintainers have the right to change semantics. Please balance that right against the mass of established code bases in production that rely on Pandas. I couldn't be the only author to assume TWINFIA. Thank you, @h-vetinari and @toobaz, for taking an interest in this! |
I agree that properly written code, that only ever depends on hashability for keys, is possible. But many methods (not just
Let me clarify my language here: it's certainly possible to test methods against various container types that it should accept. What's in those containers is another story.
I know you're not (stupid), but ok, here's another try:
There's a couple things that are confusing here. How did an iterator get turned into an
There was no informed decision on this regression because it was neither tested nor documented (if the documented standard had been "all hashable objects can be key", it would have been easy to conform to that). The problems were much more mundane, i.e. having to inspect the items of the outer-most container and determine whether they are keys / The common ground here is (as we've seen above) to have a well-defined API. If what can be keys is explicitly documented somewhere, it can be kept in mind for whatever code that's being written, and hashability would be one of several valid choices for that. My main point above was that enforcing code to stay runnable that only works because of an oversight (and there's a bunch of that in pandas, no matter how much we strive for clean code) is way too generous. |
Claim: identifying lists (more in general, containers) requres gymnastics (and, incidentally, detailed docs), while identifying valid keys is extremely simple if their definition is "hashable object". And requires more complicated code, more complicated docs, and more waste of the user's memory, if the definition is more complicated than that. As in: ExtensionArray (and having even our own types rely on them) is already making (I think) the pandas codebase cleaner, and easier to maintain, precisely because we gave up setting the admissible values.
Thanks, I appreciate. You are perfectly right that pandas' definition of "list-like" is a mess, and that's a ground where we desperately need to set standards. Or maybe, we already all agree on standards (iterators are list-like, tuples are not), and there is just annoying legacy code that doesn't follow them (for sure I think that's not the case for keys, where things are... simple. EDIT: sure, list-likes and keys definitions are not so orthogonal as I am putting them. I'm clearly assuming that if something is a list-like, then it is not a key, by definition. So if the user has hashable list-likes, the user will need to know that they are not valid keys (as they are valid containers). But "outlawing" types which are not list-likes is orthogonal. |
Steering the discussion back to the PR: what needs to be done for 0.24.1? IMO, the changes and tests look good, just need a release note under "fixed regressions". Anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a minor comment, and the whatsnew note that Tom asked for.
pandas/core/frame.py
Outdated
else: | ||
# everything else gets tried as a key; see GH 24969 | ||
try: | ||
self[col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing col in self.columns
is a bit more efficient? (doesn't need construct the series)
Although then you also need to care about the False case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, but col in self.columns
works without the hashing, and even if check
hash(col) and col in self.columns
this does not catch the iter
case (an iterator is hashable, who would have thought...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the hashing needed? We just need to know whether it is a column name or not?
Or you want to raise a different error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it certainly checks hashing for some cases:
In [5]: {1} in pd.Index([1, 2, 3])
...
TypeError: unhashable type: 'set'
(just not an iter because as you said, it is hashable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback PTAL
doc/source/whatsnew/v0.25.0.rst
Outdated
|
||
- :meth:`Timestamp.replace` now supports the ``fold`` argument to disambiguate DST transition times (:issue:`25017`) | ||
- :meth:`DataFrame.at_time` and :meth:`Series.at_time` now support :meth:`datetime.time` objects with timezones (:issue:`24043`) | ||
- :meth:`DataFrame.set_index` now works for instances of ``abc.Iterator``, provided their output is of the same length as the calling frame (:issue:`22484`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use this issue number, what about the note for the closed issue in the PR itself? #24969
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this still close #24969 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the OP to clarify, and added the this PR as you asked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, PTAL
doc/source/whatsnew/v0.25.0.rst
Outdated
|
||
- :meth:`Timestamp.replace` now supports the ``fold`` argument to disambiguate DST transition times (:issue:`25017`) | ||
- :meth:`DataFrame.at_time` and :meth:`Series.at_time` now support :meth:`datetime.time` objects with timezones (:issue:`24043`) | ||
- :meth:`DataFrame.set_index` now works for instances of ``abc.Iterator``, provided their output is of the same length as the calling frame (:issue:`22484`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the OP to clarify, and added the this PR as you asked
One of the azure jobs failed with a
Could someone please restart it? @jreback @TomAugspurger @jorisvandenbossche @gfyoung |
Thanks to whoever restarted the job. Unfortunately, it seems to be a non-transient break (which is good, at least from the point of hunting My operating assumption is that a new boto release broke something, but unfortunately I can't see the version in the build script. The |
Added the |
Nevermind, this is passing again. Seems those |
Since it is passing again, can you then remove the conda list changes here? |
@jorisvandenbossche |
source activate pandas-dev | ||
set -v | ||
|
||
# Display pandas-dev environment (for debugging) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah pls revert all of this
@jreback @jorisvandenbossche |
- Indexing of ``DataFrame`` and ``Series`` now accepts zerodim ``np.ndarray`` (:issue:`24919`) | ||
- :meth:`Timestamp.replace` now supports the ``fold`` argument to disambiguate DST transition times (:issue:`25017`) | ||
- :meth:`DataFrame.at_time` and :meth:`Series.at_time` now support :meth:`datetime.time` objects with timezones (:issue:`24043`) | ||
- :meth:`DataFrame.set_index` now works for instances of ``abc.Iterator``, provided their output is of the same length as the calling frame (:issue:`22484`, :issue:`24984`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a code reorg yes? with slightly better error messages in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback:
Minimal reorg, just better errors and enabling iterators (due to your review).
The custom classes were re-enabled by #25085 (which took over the tests from this PR), which closed the regression #24969, and has a corresponding whatsnew note. I guess the issue didn't get closed yet, because I only noted that #25085 was an alternative to this PR (at the time, for solving #24969), but didn't add the "closes #24969" explicitly - sorry.
@jreback |
closes Regression in DataFrame.set_index with class instance column keys #24969git diff upstream/master -u -- "*.py" | flake8 --diff
This is a quick fix for the regression - however, I think this should be immediately (i.e. 0.24.1) be deprecated. I haven't yet added a deprecation warning here, pending further discussion in the issue.
@jorisvandenbossche @TomAugspurger @jreback