Skip to content

ENH: Allow selection with multi index subset matching names #55279

@SengerM

Description

@SengerM

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Given a data frame and a multi index object, it would be nice to select the entries of the data frame matching the multi index object, even if the multi index is a subset of the index of the data frame. Consider this example:

df = pandas.DataFrame(
	dict(
		a = [1,2,3,4,5,6,7,8,9],
		b = [9,8,7,6,5,4,3,2,1],
		c = [4,2,5,4,6,5,7,6,8],
		d = [1,4,2,5,3,6,4,7,5],
	)
).set_index(['a','b','c'])
select_this = pandas.MultiIndex.from_tuples([(1, 9), (2, 8), (3, 7), (4, 6), (9, 1)], names=['a', 'b'])

selected = df.loc[select_this]

The expected output is obvious and unambiguous to me. I mean, the names of select_this are present in df.index so there is no ambiguity in what is desired. Though this functionality is easy to achieve in just a couple of lines, it would be very natural, from my point of view, to have it included in the .loc method. The current implementation of pandas allows this only when the names in the multi index object perfectly match those in the data frame, i.e. if there are extra levels, it fails, even in cases like the example above in which it is pretty obvious what the user wants to do.

Feature Description

The implementation of this is probably close to trivial for anyone who knows where .loc is defined. I tried to find it but couldn't.

Alternative Solutions

An example of a function that does exactly what I suggest:

def select_by_multiindex(df:pandas.DataFrame, idx:pandas.MultiIndex)->pandas.DataFrame:
	"""Given a DataFrame and a MultiIndex object, selects the entries
	from the data frame matching the multi index. Example:
	DataFrame:
	```
	       d
	a b c   
	1 9 4  1
	2 8 2  4
	3 7 5  2
	4 6 4  5
	5 5 6  3
	6 4 5  6
	7 3 7  4
	8 2 6  7
	9 1 8  5
	```
	MultiIndex:
	```
	MultiIndex([(1, 9),
            (2, 8),
            (3, 7),
            (4, 6),
            (9, 1)],
           names=['a', 'b'])
	```
	Output:
	```
	       d
	a b c   
	1 9 4  1
	2 8 2  4
	3 7 5  2
	4 6 4  5
	9 1 8  5

	```
	"""
	if not set(idx.names) <= set(df.index.names):
		raise ValueError('Names in `idx` not present in `df.index`')
	if not isinstance(df, pandas.DataFrame) or not isinstance(idx, pandas.MultiIndex):
		raise TypeError('`df` or `idx` are of the wrong type.')
	df_original_index_names = df.index.names
	return df.reset_index(drop=False).set_index(idx.names).loc[idx].reset_index(drop=False).set_index(df_original_index_names)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions