-
Notifications
You must be signed in to change notification settings - Fork 180
equi joins improvement #1567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
equi joins improvement #1567
Conversation
Added win-64 platform dependencies to the 'default' and 'chemistry' environments in the pixi.lock file, enabling Windows compatibility. Updated pyproject.toml to reflect these changes.
Refactored _multiple_conditional_join_ne to use _not_equal_indices directly and improved handling of empty index results. Removed unused parameters and streamlined output formatting for consistency.
Major refactor of conditional join internals for improved performance and maintainability. Adds optimized index calculation for equi and non-equi joins, introduces binary search helpers, and removes legacy pandas merge code. Updates error handling, code style, and test coverage for new join logic.
Lowered the max_examples parameter from 600 to 10 in all hypothesis-based tests in test_conditional_join.py. This change speeds up test execution, likely for faster development cycles or to avoid long runtimes during CI.
Replaces local janitor_rs wheel references with platform-specific URLs from PyPI in pixi.lock. This ensures that the correct prebuilt wheels are used for various environments and platforms.
Moved several helper imports from utils to _helpers module for better organization. Updated type hints for clarity and consistency. Added deprecation notes and comments regarding numba support, indicating that numba-based implementations are no longer maintained or supported.
|
Deprecated warnings for the df_columns and right_columns arguments have been removed from the conditional_join function's docstring. This streamlines the documentation and removes redundant warning messages.
Added a deprecation warning for numba support in _conditional_join_preliminary_checks. Replaced deprecated 'select' method with 'select_columns' in _create_frame to ensure compatibility with updated DataFrame API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request improves the performance of equi joins in the conditional_join function by moving slow loop parts to Rust (via the janitor-rs library) and deprecating numba support. The PR demonstrates significant performance improvements, with non-numba execution time dropping from 4.51s to ~93ms in the provided benchmark.
Key Changes:
- Moved equi join logic to Rust implementations via new
_get_indices_equi.pymodule - Added binary search helper functions that call janitor-rs Rust functions
- Deprecated numba support with deprecation warnings
- Updated janitor-rs dependency from 0.3.x to 0.4.x
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Updated janitor-rs dependency to 0.4.x, added development dependencies (incorrectly placed), added win-64 platform support |
| janitor/functions/conditional_join.py | Added deprecation warning for numba, refactored code to use new Rust-based implementations, modernized type hints (Union → |), updated method calls from select to select_columns |
| janitor/functions/_conditional_join/_get_indices_equi.py | New file containing Rust-accelerated equi join logic extracted from conditional_join.py |
| janitor/functions/_conditional_join/_helpers.py | Added binary search and comparison helper functions that dispatch to Rust implementations based on dtype |
| janitor/functions/_conditional_join/_get_indices_non_equi.py | Fixed column assignment order for range joins (swapped r1_col and r2_col) |
| janitor/functions/_conditional_join/_not_equal_indices.py | Whitespace-only formatting change |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@samukweku I got copilot to do a first pass check, given that the change set is pretty large. |
|
@samukweku give me a bit more time to look through the changes in the PR, I'll get back to you asap! |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
PR Description
Please describe the changes proposed in the pull request:
This PR relates #1497 .
performance example:
dev:
this PR:
Please tag maintainers to review.