-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Given a data frame and a multi index object, it would be nice to select the entries of the data frame matching the multi index object, even if the multi index is a subset of the index of the data frame. Consider this example:
df = pandas.DataFrame(
dict(
a = [1,2,3,4,5,6,7,8,9],
b = [9,8,7,6,5,4,3,2,1],
c = [4,2,5,4,6,5,7,6,8],
d = [1,4,2,5,3,6,4,7,5],
)
).set_index(['a','b','c'])
select_this = pandas.MultiIndex.from_tuples([(1, 9), (2, 8), (3, 7), (4, 6), (9, 1)], names=['a', 'b'])
selected = df.loc[select_this]
The expected output is obvious and unambiguous to me. I mean, the names of select_this
are present in df.index
so there is no ambiguity in what is desired. Though this functionality is easy to achieve in just a couple of lines, it would be very natural, from my point of view, to have it included in the .loc
method. The current implementation of pandas allows this only when the names in the multi index object perfectly match those in the data frame, i.e. if there are extra levels, it fails, even in cases like the example above in which it is pretty obvious what the user wants to do.
Feature Description
The implementation of this is probably close to trivial for anyone who knows where .loc
is defined. I tried to find it but couldn't.
Alternative Solutions
An example of a function that does exactly what I suggest:
def select_by_multiindex(df:pandas.DataFrame, idx:pandas.MultiIndex)->pandas.DataFrame:
"""Given a DataFrame and a MultiIndex object, selects the entries
from the data frame matching the multi index. Example:
DataFrame:
```
d
a b c
1 9 4 1
2 8 2 4
3 7 5 2
4 6 4 5
5 5 6 3
6 4 5 6
7 3 7 4
8 2 6 7
9 1 8 5
```
MultiIndex:
```
MultiIndex([(1, 9),
(2, 8),
(3, 7),
(4, 6),
(9, 1)],
names=['a', 'b'])
```
Output:
```
d
a b c
1 9 4 1
2 8 2 4
3 7 5 2
4 6 4 5
9 1 8 5
```
"""
if not set(idx.names) <= set(df.index.names):
raise ValueError('Names in `idx` not present in `df.index`')
if not isinstance(df, pandas.DataFrame) or not isinstance(idx, pandas.MultiIndex):
raise TypeError('`df` or `idx` are of the wrong type.')
df_original_index_names = df.index.names
return df.reset_index(drop=False).set_index(idx.names).loc[idx].reset_index(drop=False).set_index(df_original_index_names)
Additional Context
No response