Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion pandas/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import re
import subprocess
import sys
from typing import Any


def get_keywords():
Expand Down Expand Up @@ -640,7 +641,7 @@ def render(pieces, style):
}


def get_versions():
def get_versions() -> dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_versions() -> dict[str, Any]:
def get_versions() -> dict[str, str | None]:

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is invariant, so it's generally speaking not a good idea to use unions as type arguments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t understand this point. Why would being more specific not be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to be more specific, but you should keep in mind that dict[str, str] is not assignable to dict[str, str | None] due to its invariance. Depending on how this will be used, that could lead to unexpected typing errors down the line.

If you want to be specific here, then the best you can do is use a TypedDict, or a union thereof. If this function is frequently used, then the effort might be worth it.

Another option would be to return a Mapping[str, str | None] instead. The Mapping value type parameter is covariant, so both dict[str, str] and dict[str, str | None] can be assigned to it (since dict <: Mapping).

The "Any trick" might also work, i.e. dict[str, str | Any]. Personally I'm not a fan of this "Any trick"` and don't think it should be promoted like that, but in this specific case it's not all that bad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all for the input.

A few observations

  1. the goal of this PR is to remove some #type: ignore from the pandas codebase. This annotation is on vendored code. So firstly, I wasn't particularly concerned about using Any here, but that's just my opinion and maybe because we effectively maintain this code we should treat it equally with the pandas code. Secondly, because it is vendored i would have probably have just created a .pyi file. However, i noticed that others have added type hints to the file directly. adding a .pyi would shadow these. However, i could have added the existing to the .pyi as well so also an option.

  2. i've not contributed greatly to the type annotations of late so my understanding of policies may be outdated. But IIRC we always used to be precise in the return types, wrt to type and therefore would not have used Mapping for return types in the past. Happy to change if this is now acceptable and used elsewhere. If the reason for the original comment here is the reluctance to use Any then I would also be happy using just dict without any type arguments. Still better than untyped function.

  3. Strictness flags and other mypy command line options, or even grep, easily allows us to audit the use of Any. In the past i've not be adverse to use Any. To me this is a lesser offence than a #type: ignore. Hence adding a Any to remove a #type: ignore is IMO a win?

If you want to be specific here, then the best you can do is use a TypedDict, or a union thereof. If this function is frequently used, then the effort might be worth it.

I guess this is likely to be the most acceptable to all?

"""Get version information or return default if unable to do so."""
# I am in _version.py, which lives at ROOT/VERSIONFILE_SOURCE. If we have
# __file__, we can work backwards from there to the root. Some
Expand Down
13 changes: 13 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
TYPE_CHECKING,
Literal,
cast,
overload,
)
import warnings

Expand Down Expand Up @@ -314,6 +315,18 @@ def _check_object_for_strings(values: np.ndarray) -> str:
# --------------- #


@overload
def unique(values: np.ndarray) -> np.ndarray: ...
@overload
def unique(values: Index) -> Index: ...
@overload
def unique(values: Series) -> np.ndarray: ...
@overload
def unique(values: Categorical) -> Categorical: ...
@overload
def unique(values: ExtensionArray) -> ExtensionArray: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Series -> ndarray and ndarray -> ndarray overloads can be combined, e.g.:

Suggested change
@overload
def unique(values: np.ndarray) -> np.ndarray: ...
@overload
def unique(values: Index) -> Index: ...
@overload
def unique(values: Series) -> np.ndarray: ...
@overload
def unique(values: Categorical) -> Categorical: ...
@overload
def unique(values: ExtensionArray) -> ExtensionArray: ...
@overload
def unique(values: Index) -> Index: ...
@overload
def unique(values: Categorical) -> Categorical: ...
@overload
def unique(values: ExtensionArray) -> ExtensionArray: ...
@overload
def unique(values: np.ndarray | Series) -> np.ndarray: ...

Additionally, you could merge the three other overloads by using a TypeVar(..., bound=Index | Categorical | ExtensionArray)



def unique(values):
"""
Return unique values based on a hash table.
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1100,7 +1100,7 @@ def unique(self):
values = self._values
if not isinstance(values, np.ndarray):
# i.e. ExtensionArray
result = values.unique()
result: np.ndarray | ExtensionArray = values.unique()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run mypy with --local-partial-types --allow-redefinition-new this wouldn't be needed anymore.

https://mypy.readthedocs.io/en/latest/command_line.html#cmdoption-mypy-allow-redefinition-new

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started to look into this, and our global settings were set a long while ago now so a review is perhaps long overdue. Also bearing in mind that these flags will become the defaults in the future then also worth being ahead of the game. My initial impression is that i'm +1 on this. After a quick review it looks like in most cases people have being adding ignores which is IMO worse than adding the union variable annotation up front. But if mypy can now infer these without explicitly defining them then i assume that'll be a plus. 200-300 mypy errors at the moment. Adding the explicit type annotations for the variables defined outside the local scope will probably take a while but i've started on that and will do this as a precursor PR. Will probably also want to remove the variable annotations used like here where mypy can now infer the types. The only issue here maybe that mypy has flagged this as experimental and others may be against. Just to reiterate this looks like a good idea. Thanks for the suggestion.

else:
result = algorithms.unique1d(values)
return result
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/groupby/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def recode_for_groupby(c: Categorical, sort: bool, observed: bool) -> Categorica
# In cases with c.ordered, this is equivalent to
# return c.remove_unused_categories(), c

take_codes = unique1d(c.codes[c.codes != -1]) # type: ignore[no-untyped-call]
take_codes = unique1d(c.codes[c.codes != -1])

if sort:
take_codes = np.sort(take_codes)
Expand All @@ -68,7 +68,7 @@ def recode_for_groupby(c: Categorical, sort: bool, observed: bool) -> Categorica

# GH:46909: Re-ordering codes faster than using (set|add|reorder)_categories
# GH 38140: exclude nan from indexer for categories
unique_notnan_codes = unique1d(c.codes[c.codes != -1]) # type: ignore[no-untyped-call]
unique_notnan_codes = unique1d(c.codes[c.codes != -1])
if sort:
unique_notnan_codes = np.sort(unique_notnan_codes)
if (num_cat := len(c.categories)) > len(unique_notnan_codes):
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -3301,7 +3301,9 @@ def _intersection(self, other: Index, sort: bool = False):
if is_numeric_dtype(self.dtype):
# This is faster, because Index.unique() checks for uniqueness
# before calculating the unique values.
res = algos.unique1d(res_indexer)
res: Index | ExtensionArray | np.ndarray = algos.unique1d(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so for now have added ignore[assignment] instead, since this PR was specifically targeting ignore[no-untyped-call]. These should either be unnecessary with a precursor PR or can merge before if consider these the lesser of two evils.

res_indexer
)
else:
result = self.take(indexer)
res = result.drop_duplicates()
Expand Down
2 changes: 1 addition & 1 deletion pandas/util/_print_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def _get_commit_hash() -> str | None:
except ImportError:
from pandas._version import get_versions

versions = get_versions() # type: ignore[no-untyped-call]
versions = get_versions()
return versions["full-revisionid"]


Expand Down
Loading