Skip to content

Conversation

@samukweku
Copy link
Collaborator

@samukweku samukweku commented Jan 7, 2026

PR Description

Please describe the changes proposed in the pull request:

  • move slow loop parts of equi join to rust
  • remove performance dependency on numba

This PR relates #1497 .

performance example:

dev:

In [3]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=False)
   ...: )
   ...:
   ...:

4.51 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=True)
   ...: )
   ...:
   ...:

81.5 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

this PR:

In [1]: import pandas as pd; import janitor as jn; import  
      ⋮ numpy as np

In [2]: url = "https://raw.githubusercontent.com/samukweku 
      ⋮ /data-wrangling-blog/master/notebooks/Data_files/f 
      ⋮ lights.csv"
   ...: flights = pd.read_csv(url, sep = '\t', names=['ori 
      ⋮ g','dest','orig_time', 'dest_time'], parse_dates   
      ⋮ = ['orig_time', 'dest_time'])
   ...: flights = flights.factorize_columns(['orig','dest' 
      ⋮ ]).iloc[:, 2:]
   ...: flights.columns = [ent.split("_")[0] if ent.endswi 
      ⋮ th('enc') else ent for ent in flights]
   ...: flights.columns = ['takeoff','landing','orig','des 
      ⋮ t']
   ...: flights = flights.assign(start=flights.landing+pd. 
      ⋮ Timedelta(minutes=45), end=flights.landing+pd.Time 
      ⋮ delta(hours=3))
   ...: flights.head()
Out[2]: 
              takeoff  ...                 end
0 2021-11-27 07:15:00  ... 2021-11-27 11:55:00
1 2021-11-27 20:05:00  ... 2021-11-28 03:50:00
2 2021-11-27 21:00:00  ... 2021-11-28 00:35:00
3 2021-11-27 21:15:00  ... 2021-11-28 01:25:00
4 2021-11-26 11:40:00  ... 2021-11-26 17:45:00

[5 rows x 6 columns]

In [3]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],        
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=False)
   ...: )
   ...:
   ...:
93.2 ms ± 5.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],        
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=True)
   ...: )
   ...:
   ...:


73.2 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],        
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=False)
   ...: )
   ...:
   ...:
81.8 ms ± 3.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]:

Please tag maintainers to review.

Added win-64 platform dependencies to the 'default' and 'chemistry' environments in the pixi.lock file, enabling Windows compatibility. Updated pyproject.toml to reflect these changes.
Refactored _multiple_conditional_join_ne to use _not_equal_indices directly and improved handling of empty index results. Removed unused parameters and streamlined output formatting for consistency.
wip
Major refactor of conditional join internals for improved performance and maintainability. Adds optimized index calculation for equi and non-equi joins, introduces binary search helpers, and removes legacy pandas merge code. Updates error handling, code style, and test coverage for new join logic.
Lowered the max_examples parameter from 600 to 10 in all hypothesis-based tests in test_conditional_join.py. This change speeds up test execution, likely for faster development cycles or to avoid long runtimes during CI.
Replaces local janitor_rs wheel references with platform-specific URLs from PyPI in pixi.lock. This ensures that the correct prebuilt wheels are used for various environments and platforms.
Moved several helper imports from utils to _helpers module for better organization. Updated type hints for clarity and consistency. Added deprecation notes and comments regarding numba support, indicating that numba-based implementations are no longer maintained or supported.
@samukweku samukweku self-assigned this Jan 7, 2026
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

PR Preview Action v1.8.0

🚀 View preview at
https://pyjanitor-devs.github.io/pyjanitor/pr-preview/pr-1567/

Built to branch gh-pages at 2026-01-13 12:22 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Deprecated warnings for the df_columns and right_columns arguments have been removed from the conditional_join function's docstring. This streamlines the documentation and removes redundant warning messages.
Added a deprecation warning for numba support in _conditional_join_preliminary_checks. Replaced deprecated 'select' method with 'select_columns' in _create_frame to ensure compatibility with updated DataFrame API.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request improves the performance of equi joins in the conditional_join function by moving slow loop parts to Rust (via the janitor-rs library) and deprecating numba support. The PR demonstrates significant performance improvements, with non-numba execution time dropping from 4.51s to ~93ms in the provided benchmark.

Key Changes:

  • Moved equi join logic to Rust implementations via new _get_indices_equi.py module
  • Added binary search helper functions that call janitor-rs Rust functions
  • Deprecated numba support with deprecation warnings
  • Updated janitor-rs dependency from 0.3.x to 0.4.x

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
pyproject.toml Updated janitor-rs dependency to 0.4.x, added development dependencies (incorrectly placed), added win-64 platform support
janitor/functions/conditional_join.py Added deprecation warning for numba, refactored code to use new Rust-based implementations, modernized type hints (Union → |), updated method calls from select to select_columns
janitor/functions/_conditional_join/_get_indices_equi.py New file containing Rust-accelerated equi join logic extracted from conditional_join.py
janitor/functions/_conditional_join/_helpers.py Added binary search and comparison helper functions that dispatch to Rust implementations based on dtype
janitor/functions/_conditional_join/_get_indices_non_equi.py Fixed column assignment order for range joins (swapped r1_col and r2_col)
janitor/functions/_conditional_join/_not_equal_indices.py Whitespace-only formatting change

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ericmjl
Copy link
Member

ericmjl commented Jan 10, 2026

@samukweku I got copilot to do a first pass check, given that the change set is pretty large.

@ericmjl
Copy link
Member

ericmjl commented Jan 10, 2026

@samukweku give me a bit more time to look through the changes in the PR, I'll get back to you asap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants