Skip to content

Conversation

@ajpotts
Copy link
Contributor

@ajpotts ajpotts commented Jan 5, 2026

PR: Align ExtensionArray factorize and argsort with pandas expectations

Summary

Align pandas ExtensionArray behavior with pandas expectations by returning NumPy arrays
(not Arkouda arrays) for factorize codes and argsort indices, while keeping all grouping
and sorting computation server-side in Arkouda.

This improves pandas compatibility, simplifies downstream pandas internals
(e.g. groupby, take, iloc), and clarifies API semantics.


Key changes

ArkoudaExtensionArray.factorize

  • Return type updated to:
    Tuple[np.ndarray, ArkoudaExtensionArray]
  • codes are now returned as a NumPy array of dtype np.intp, as expected by pandas.
  • uniques are returned as an ExtensionArray of the same type as self.
  • Removed sort argument:
    • Factorization now consistently uses first-appearance order, matching pandas’ default behavior.
  • Clarified missing-value handling:
    • Only floating dtypes treat NaN as missing.
    • use_na_sentinel controls whether missing values map to -1 or len(uniques).

ArkoudaExtensionArray.argsort

  • Return type changed from Arkouda pdarray to NumPy ndarray[np.intp].
  • Sorting computation remains server-side; only the permutation indices are materialized client-side.
  • na_position is now accepted via **kwargs for pandas compatibility.
  • Updated docstring to reflect pandas ExtensionArray contract more precisely.

Tests

  • Updated tests to assert against NumPy arrays instead of Arkouda arrays for:
    • factorize codes
    • argsort indices
  • Removed sort-parameter test cases and adjusted expectations to first-appearance semantics.
  • Added explicit NumPy assertions (numpy.testing.assert_equal) for clarity and correctness.

Motivation

Pandas internals expect:

  • factorize → NumPy integer codes
  • argsort → NumPy permutation indices

Returning Arkouda arrays in these paths caused unnecessary friction and divergence from the
pandas ExtensionArray contract. This PR preserves Arkouda’s distributed execution model
while presenting pandas-native results at the API boundary.

Closes #5228: remove type ignore from factorize in extension module

@ajpotts ajpotts force-pushed the 5228_remove_type_ignore_from_factorize_in_extension_module branch from 74ee5f0 to 937d368 Compare January 5, 2026 21:30
@ajpotts ajpotts marked this pull request as ready for review January 6, 2026 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

remove type: ignore from factorize in extension module

1 participant