-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
I ran into the following surprising behavior:
We can set a MultiIndex with a list of column names:
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
# Works fine
df = df.set_index(['A', 'B'])
However, attempting to set a MultiIndex with a tuple of column names raises a KeyError:
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
# Raises KeyError: "None of [('A', 'B')] are in the columns"
df = df.set_index(('A', 'B'))
This is technically consistent with the documentation of the keys
parameter (my emphasis):
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays.
But nonetheless, this is quite surprising behavior in Python.
(This occurs in the latest release as well as on the main branch)
The culprit appears to be this isinstance check at the beginning of the set_index method:
if not isinstance(keys, list):
keys = [keys]
My suggestion would be to use the pandas is_list_like
helper function instead.
if not is_list_like(keys):
keys = [keys]
To unambiguously demonstrate what I mean, I've put up a branch for this with tests: https://github.com/pandas-dev/pandas/compare/main...tadamcz:pandas:tadamcz/set-index-multiindex-from-tuple?expand=1.
If you think this would indeed be an improvement, I'd be happy to see this turned into a PR. But since I can't promise I'll have time to shepherd this through to the finish line, I thought I'd hold off on officially opening a PR?