Skip to content

Conversation

Tolker-KU
Copy link
Contributor

@Tolker-KU Tolker-KU commented Aug 20, 2024

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Add native version of is_hashable(...) for some marginal performance improvements.

@Tolker-KU Tolker-KU requested a review from WillAyd as a code owner August 20, 2024 18:43
@Tolker-KU Tolker-KU marked this pull request as draft August 20, 2024 18:50
# fail this test.

# Reconsider this decision once this numpy bug is fixed:
# https://github.com/numpy/numpy/issues/5562
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this issue has been closed. maybe this can be reconsidered?

Copy link
Contributor Author

@Tolker-KU Tolker-KU Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've benchmarked a version using isinstance(obj, collections.abc.Hashable). It is considerably slower than the current python implementation (~2x)

@jbrockmendel
Copy link
Member

Can you provide some measurements on the perf impact?

@Tolker-KU Tolker-KU force-pushed the feature/fast-is_hashable branch from 215fe25 to 40f6965 Compare August 21, 2024 19:14
@Tolker-KU
Copy link
Contributor Author

Tolker-KU commented Aug 21, 2024

Can you provide some measurements on the perf impact?

I've benchmarked the current pure Python implementation against two different Cython versions with the following results. Overall the proposed cython version is twice a fast as the pure python version in the happy path and 10x faster in the exceptional path. The proposed cython version is ~33 % faster than the simple Cython version in the happy and 10x faster in the exceptional path

def is_hashable(obj: object) -> bool:
    try:
        hash(obj)
    except TypeError:
        return False
    else:
        return True


cpdef bint c_is_hashable(object obj) noexcept:
    if PyLong_CheckExact(obj) or PyFloat_CheckExact(obj) or PyUnicode_CheckExact(obj):
        return True

    if PyTuple_CheckExact(obj) or PyFrozenSet_CheckExact(obj):
        for o in obj:
            if not c_is_hashable(o):
                return False
        return True

    if PyAnySet_CheckExact(obj) or PyDict_CheckExact(obj) or PyList_CheckExact(obj):
        return False

    try:
        hash(obj)
    except TypeError:
        return False
    else:
        return True


cpdef bint c_is_hashable2(object obj) noexcept:
    try:
        hash(obj)
    except TypeError:
        return False
    else:
        return True


for obj in (2, "hello world", ("hello", "world", 1, 2.0), ([],), np.int64(5), []):
    print(repr(obj))
    %timeit is_hashable(obj)
    %timeit c_is_hashable(obj)
    %timeit c_is_hashable2(obj)
    print("")
 
>>> 2
>>> 58.8 ns ± 0.344 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 33.3 ns ± 0.333 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 47.9 ns ± 0.284 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

>>> 'hello world'
>>> 75.4 ns ± 0.703 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 33.7 ns ± 0.561 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 42.1 ns ± 0.508 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

>>> ('hello', 'world', 1, 2.0)
>>> 101 ns ± 0.421 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 46.8 ns ± 0.493 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 71.9 ns ± 0.244 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

>>> ([],)
>>> 392 ns ± 2.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
>>> 39.4 ns ± 0.87 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 464 ns ± 1.95 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

>>> np.int64(5)
>>> 66 ns ± 1.82 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 47.8 ns ± 1.08 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 46.6 ns ± 0.377 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

>>> []
>>> 396 ns ± 6.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
>>> 36 ns ± 0.578 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> 477 ns ± 4.88 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@Tolker-KU Tolker-KU force-pushed the feature/fast-is_hashable branch 3 times, most recently from 9c80d17 to befea90 Compare August 21, 2024 19:35
@Tolker-KU Tolker-KU marked this pull request as ready for review August 22, 2024 04:01
@Tolker-KU Tolker-KU requested a review from mroeschke as a code owner August 22, 2024 04:01
@Tolker-KU Tolker-KU force-pushed the feature/fast-is_hashable branch from 59fca35 to f9a3950 Compare August 22, 2024 04:09
return util.is_float_object(obj)


cpdef bint is_hashable(object obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does moving this to Cython make a difference for performance? I assume most of the gain is just from short circuiting a hash attempt through the isinstance checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some benchmarking - but not addressing exact this concern. However the checks for tuple and frozenset will probably be slow in pure Python even using all()

#59565 (comment)

@rhshadrach rhshadrach added Performance Memory or execution speed performance hashing hash_pandas_object labels Aug 25, 2024
@Tolker-KU Tolker-KU force-pushed the feature/fast-is_hashable branch from f9a3950 to 5f79b59 Compare August 28, 2024 19:39
is_long = PyLong_CheckExact(obj)
is_float = PyFloat_CheckExact(obj)
is_unicode = PyUnicode_CheckExact(obj)
is_tuple = PyTuple_CheckExact(obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason to use CheckExact over Check for the container types? I think the latter is preferable, given subclasses of these aren't that uncommon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for this is pandas' own FrozenList which is hashable and a subclass of list. I agree that PyXXXX_Check(...) would be more intuitive but due to corner cases like FrozenList I think we need the exact checks to maintain the correct behaviour


# tuple and frozenset is hashable if and only if all elements are hashable
if is_tuple:
for o in <tuple>obj:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the cast actually matter? I again would think Cython would just always defer to PySequence_GetItem for iteration.

If that's true, I suppose there could be some difference between PySequence_GetItem and a more exact PyTuple_GETITEM call that could be dispatched to, with the latter not having to deal with reference counting. However, the point of writing in Cython is to happily trade some performance compared to the raw C code for some higher level abstractions.

Where that line really matters is up for debate, but I generally worry about trying to be too exact in this function. I'm also a little unclear why the non-hashable items can't short-circuit to quickly throw an error during a PyObject_Hash call. I would think the Python built-in tuple and frozen set should be able to do this faster than us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I've looked into the generated C-code for this version with the casts and a version without the cast. Assuming that the C-compiler is able to hoist some conditions out of the generated loops, the generated assembly for the two versions should be exactly the same. I went too far down the path of micro optimisations.

As to whether the built-in tuple and frozenset can do this faster than us I've looked into the implementations in CPython. tuple essentially call the hash function on all the elements in a loop and do some bit-twiddling to combine the hashes into one hash. The C-code of that loop is very similar to that generated by cython but we save calls to PyObject_Hash(...) on the elements which on the micro level should give a nice speed-up. For frozenset I've discover that it always is hashable, which makes perfect sense as elements must hashable to be put in a set. This discovery allows me to make some simplifications to the code.

Generated C-code

With casts

image
image

Without casts

image
CPython implementation

tuple

https://github.com/python/cpython/blob/0cba289870d5cd41f24b2f63b9480e4593aa2330/Objects/tupleobject.c#L318-L342

Note PyTuple_GET_ITEM(...) is essentially just indexing into the underlying array
https://github.com/python/cpython/blob/0cba289870d5cd41f24b2f63b9480e4593aa2330/Include/cpython/tupleobject.h#L27

@WillAyd
Copy link
Member

WillAyd commented Aug 31, 2024

I've benchmarked the current pure Python implementation against two different Cython versions with the following results.

Is there anything user-facing that would be improved by this? I see the performance improvements in the benchmarks you posted, but being down in the range of nanoseconds I'm unsure if its worth the added complexity

@Tolker-KU
Copy link
Contributor Author

I've benchmarked the current pure Python implementation against two different Cython versions with the following results.

Is there anything user-facing that would be improved by this? I see the performance improvements in the benchmarks you posted, but being down in the range of nanoseconds I'm unsure if its worth the added complexity

Fair point! See below user-facing benchmark. I cannot judge if it is worth it or not

>>> idx = pd.MultiIndex([[] for _ in range(20)], [[] for _ in range(20)])
>>> new_names = [f"new_name_{i}" for i in range(20)]
>>> %timeit idx.set_names(new_names, inplace=True)
8.68 μs ± 1.71 μs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)  <-- before
7.79 μs ± 3.86 μs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)  <-- after

Copy link
Contributor

github-actions bot commented Oct 2, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 2, 2024
@mroeschke
Copy link
Member

Thanks for the PR, but agreed that I'm not sure this maintenance burden of this new implementation is worth the minimal speedup. Thanks for the effort here but closing

@mroeschke mroeschke closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hashing hash_pandas_object Performance Memory or execution speed performance Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants