Skip to content

ENH: Implement DataFrame.select #61527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
12 changes: 7 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4535,12 +4535,12 @@ def _get_item(self, item: Hashable) -> Series:
# ----------------------------------------------------------------------
# Unsorted

def select(self, *args):
def select(self, *args: Hashable | list[Hashable]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem with this typing is that this will accept select(["a,"b"], ["c", "d"])

That's why I suggested the following:

@overload
def select(self, arg0: list[Hashable] | Hashable = ...) -> pd.DataFrame: ...
@overload
def select(self, *args: Hashable) -> pd.DataFrame: ...

def select(self, arg0: list[Hashable] | Hashable = [], *args: Hashable) -> pd.DataFrame: ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your proposal is using a default mutable parameter, which is considered a bad practice (for good reason), and I assume will also break the CI as ruff has a rule for it. I understand that the current implementation typing isn't perfect, and it could be more strict. I added a better error message if someone uses select(["a,"b"], ["c", "d"]), but I think this is the best we can do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way of handling that is to make the last part of the sequence of declarations:

def select(self, arg0: list[Hashable] | Hashable | None = None, *args: Hashable) -> pd.DataFrame: ...

Then in the code if arg0 is None, then either len(args)==0 (in which case it is an empty DF), or you just use args

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this the typing allows df.select(None, "col1", "col2"), which I don't see as an improvement to the what you are trying to solve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this the typing allows df.select(None, "col1", "col2"), which I don't see as an improvement to the what you are trying to solve.

That's not correct IF you include the overloads. If overloads are included, then only the overloads can be matched.

@overload
def select(self, arg0: list[Hashable] | Hashable = ...) -> pd.DataFrame: ...
@overload
def select(self, *args: Hashable) -> pd.DataFrame: ...

def select(self, arg0: list[Hashable] | Hashable | None = None, *args: Hashable) -> pd.DataFrame: ...

The type checkers ONLY check the overloads, not the final declaration. So select(None, "col1", "col2") would be flagged by the type checker.

"""
Select a subset of columns from the DataFrame.

Select can be used to return a DataFrame with some specific columns.
This can be used to remove unwanted columns, as well as to return a
This can be select a subset of the columns, as well as to return a
DataFrame with the columns sorted in a specific order.

Parameters
Expand All @@ -4549,7 +4549,7 @@ def select(self, *args):
The names of the columns to return. In general this will be strings,
but pandas supports other types of column names, if they are hashable.
If only one argument of type list is provided, the elements of the
list will be considered the named of the columns to be returned
list will be considered the names of the columns to be returned

Returns
-------
Expand Down Expand Up @@ -4641,7 +4641,7 @@ def select(self, *args):
"""
if args and isinstance(args[0], list):
if len(args) == 1:
args = args[0]
columns = args[0]
else:
raise ValueError(
"`DataFrame.select` supports individual columns "
Expand All @@ -4650,8 +4650,10 @@ def select(self, *args):
"You can unpack the list if you have a mix: "
"`df.select(*['col1', 'col2'], 'col3')`."
)
else:
columns = list(args)

indexer = self.columns._get_indexer_strict(list(args), "columns")[1]
indexer = self.columns._get_indexer_strict(columns, "columns")[1]
return self.take(indexer, axis=1)

@overload
Expand Down
Loading