equi joins improvement #1567

samukweku · 2026-01-07T16:14:49Z

PR Description

Please describe the changes proposed in the pull request:

move slow loop parts of equi join to rust
remove performance dependency on numba

This PR relates #1497 .

performance example:

dev:

In [3]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=False)
   ...: )
   ...:
   ...:

4.51 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=True)
   ...: )
   ...:
   ...:

81.5 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

this PR:

In [1]: import pandas as pd; import janitor as jn; import  
      ⋮ numpy as np

In [2]: url = "https://raw.githubusercontent.com/samukweku 
      ⋮ /data-wrangling-blog/master/notebooks/Data_files/f 
      ⋮ lights.csv"
   ...: flights = pd.read_csv(url, sep = '\t', names=['ori 
      ⋮ g','dest','orig_time', 'dest_time'], parse_dates   
      ⋮ = ['orig_time', 'dest_time'])
   ...: flights = flights.factorize_columns(['orig','dest' 
      ⋮ ]).iloc[:, 2:]
   ...: flights.columns = [ent.split("_")[0] if ent.endswi 
      ⋮ th('enc') else ent for ent in flights]
   ...: flights.columns = ['takeoff','landing','orig','des 
      ⋮ t']
   ...: flights = flights.assign(start=flights.landing+pd. 
      ⋮ Timedelta(minutes=45), end=flights.landing+pd.Time 
      ⋮ delta(hours=3))
   ...: flights.head()
Out[2]: 
              takeoff  ...                 end
0 2021-11-27 07:15:00  ... 2021-11-27 11:55:00
1 2021-11-27 20:05:00  ... 2021-11-28 03:50:00
2 2021-11-27 21:00:00  ... 2021-11-28 00:35:00
3 2021-11-27 21:15:00  ... 2021-11-28 01:25:00
4 2021-11-26 11:40:00  ... 2021-11-26 17:45:00

[5 rows x 6 columns]

In [3]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],        
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=False)
   ...: )
   ...:
   ...:
93.2 ms ± 5.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],        
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=True)
   ...: )
   ...:
   ...:


73.2 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %%timeit
   ...: out = (flights
   ...:  .conditional_join(
   ...:      flights,
   ...:      ('end', 'takeoff' ,'>='),
   ...:     ('start', 'takeoff', '<='),
   ...:      ('orig','orig','!='),
   ...:      ('dest', 'orig', '=='),
   ...:      df_columns = ['start', 'end', 'dest'],        
   ...:      right_columns = ['takeoff', 'orig'],
   ...:      use_numba=False)
   ...: )
   ...:
   ...:
81.8 ms ± 3.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]:

Please tag maintainers to review.

@ericmjl

Added win-64 platform dependencies to the 'default' and 'chemistry' environments in the pixi.lock file, enabling Windows compatibility. Updated pyproject.toml to reflect these changes.

Refactored _multiple_conditional_join_ne to use _not_equal_indices directly and improved handling of empty index results. Removed unused parameters and streamlined output formatting for consistency.

wip

Major refactor of conditional join internals for improved performance and maintainability. Adds optimized index calculation for equi and non-equi joins, introduces binary search helpers, and removes legacy pandas merge code. Updates error handling, code style, and test coverage for new join logic.

Lowered the max_examples parameter from 600 to 10 in all hypothesis-based tests in test_conditional_join.py. This change speeds up test execution, likely for faster development cycles or to avoid long runtimes during CI.

Replaces local janitor_rs wheel references with platform-specific URLs from PyPI in pixi.lock. This ensures that the correct prebuilt wheels are used for various environments and platforms.

Moved several helper imports from utils to _helpers module for better organization. Updated type hints for clarity and consistency. Added deprecation notes and comments regarding numba support, indicating that numba-based implementations are no longer maintained or supported.

…pgrade

github-actions · 2026-01-07T16:50:01Z

PR Preview Action v1.8.0
🚀 View preview at https://pyjanitor-devs.github.io/pyjanitor/pr-preview/pr-1567/
Built to branch `gh-pages` at 2026-01-13 12:22 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Deprecated warnings for the df_columns and right_columns arguments have been removed from the conditional_join function's docstring. This streamlines the documentation and removes redundant warning messages.

Added a deprecation warning for numba support in _conditional_join_preliminary_checks. Replaced deprecated 'select' method with 'select_columns' in _create_frame to ensure compatibility with updated DataFrame API.

Copilot

Pull request overview

This pull request improves the performance of equi joins in the conditional_join function by moving slow loop parts to Rust (via the janitor-rs library) and deprecating numba support. The PR demonstrates significant performance improvements, with non-numba execution time dropping from 4.51s to ~93ms in the provided benchmark.

Key Changes:

Moved equi join logic to Rust implementations via new _get_indices_equi.py module
Added binary search helper functions that call janitor-rs Rust functions
Deprecated numba support with deprecation warnings
Updated janitor-rs dependency from 0.3.x to 0.4.x

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
pyproject.toml	Updated janitor-rs dependency to 0.4.x, added development dependencies (incorrectly placed), added win-64 platform support
janitor/functions/conditional_join.py	Added deprecation warning for numba, refactored code to use new Rust-based implementations, modernized type hints (Union → \|), updated method calls from select to select_columns
janitor/functions/_conditional_join/_get_indices_equi.py	New file containing Rust-accelerated equi join logic extracted from conditional_join.py
janitor/functions/_conditional_join/_helpers.py	Added binary search and comparison helper functions that dispatch to Rust implementations based on dtype
janitor/functions/_conditional_join/_get_indices_non_equi.py	Fixed column assignment order for range joins (swapped r1_col and r2_col)
janitor/functions/_conditional_join/_not_equal_indices.py	Whitespace-only formatting change

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pyproject.toml

janitor/functions/_conditional_join/_helpers.py

janitor/functions/_conditional_join/_get_indices_equi.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ericmjl · 2026-01-10T11:48:15Z

@samukweku I got copilot to do a first pass check, given that the change set is pretty large.

ericmjl · 2026-01-10T12:02:30Z

@samukweku give me a bit more time to look through the changes in the PR, I'll get back to you asap!

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

samukweku added 8 commits December 29, 2025 03:35

Add Windows support to environments in pixi config

144ff14

Added win-64 platform dependencies to the 'default' and 'chemistry' environments in the pixi.lock file, enabling Windows compatibility. Updated pyproject.toml to reflect these changes.

Refactor multiple condition not-equal join logic

5e2c2a2

Refactored _multiple_conditional_join_ne to use _not_equal_indices directly and improved handling of empty index results. Removed unused parameters and streamlined output formatting for consistency.

wip

d54d7e4

wip

wip

42b17a7

Reduce max_examples in hypothesis tests to 10

0abf92c

Lowered the max_examples parameter from 600 to 10 in all hypothesis-based tests in test_conditional_join.py. This change speeds up test execution, likely for faster development cycles or to avoid long runtimes during CI.

Update janitor_rs wheel URLs in lock file

2fa1cec

Replaces local janitor_rs wheel references with platform-specific URLs from PyPI in pixi.lock. This ensures that the correct prebuilt wheels are used for various environments and platforms.

samukweku requested review from ericmjl and thatlittleboy January 7, 2026 16:14

samukweku self-assigned this Jan 7, 2026

Merge remote-tracking branch 'origin/dev' into sammywemmy-equi-join-u…

a59d3b2

…pgrade

samukweku added 2 commits January 8, 2026 07:30

Remove deprecation warnings for df_columns and right_columns

c2edfef

Deprecated warnings for the df_columns and right_columns arguments have been removed from the conditional_join function's docstring. This streamlines the documentation and removes redundant warning messages.

Deprecate numba support and fix select_columns usage

1dc7dcf

Added a deprecation warning for numba support in _conditional_join_preliminary_checks. Replaced deprecated 'select' method with 'select_columns' in _create_frame to ensure compatibility with updated DataFrame API.

ericmjl requested a review from Copilot January 10, 2026 00:24

Copilot started reviewing on behalf of ericmjl January 10, 2026 00:24 View session

Copilot AI reviewed Jan 10, 2026

View reviewed changes

Update pyproject.toml

4734ed6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

samukweku and others added 7 commits January 12, 2026 12:04

Update janitor/functions/_conditional_join/_get_indices_equi.py

552e668

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update janitor/functions/_conditional_join/_get_indices_equi.py

6ae9156

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

updates

4ea2ef6

fixes

6addf4a

updates based on copilot feedback

2206372

remove unused variables

935ec16

keep return_matching_indices simple

44664fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

equi joins improvement #1567

equi joins improvement #1567

Uh oh!

samukweku commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-01-13 12:22 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Copilot AI left a comment •

edited by samukweku

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericmjl commented Jan 10, 2026

Uh oh!

ericmjl commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

equi joins improvement #1567

Are you sure you want to change the base?

equi joins improvement #1567

Uh oh!

Conversation

samukweku commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-01-13 12:22 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Copilot AI left a comment • edited by samukweku Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericmjl commented Jan 10, 2026

Uh oh!

ericmjl commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samukweku commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-01-13 12:22 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copilot AI left a comment •

edited by samukweku

Loading