Skip to content

Commit 9027852

Browse files
samukwekuericmjlCopilotsamuel.oranyeli
authored
equi joins improvement (#1567)
* Add Windows support to environments in pixi config Added win-64 platform dependencies to the 'default' and 'chemistry' environments in the pixi.lock file, enabling Windows compatibility. Updated pyproject.toml to reflect these changes. * Refactor multiple condition not-equal join logic Refactored _multiple_conditional_join_ne to use _not_equal_indices directly and improved handling of empty index results. Removed unused parameters and streamlined output formatting for consistency. * wip wip * wip * Refactor and optimize conditional join logic Major refactor of conditional join internals for improved performance and maintainability. Adds optimized index calculation for equi and non-equi joins, introduces binary search helpers, and removes legacy pandas merge code. Updates error handling, code style, and test coverage for new join logic. * Reduce max_examples in hypothesis tests to 10 Lowered the max_examples parameter from 600 to 10 in all hypothesis-based tests in test_conditional_join.py. This change speeds up test execution, likely for faster development cycles or to avoid long runtimes during CI. * Update janitor_rs wheel URLs in lock file Replaces local janitor_rs wheel references with platform-specific URLs from PyPI in pixi.lock. This ensures that the correct prebuilt wheels are used for various environments and platforms. * Refactor conditional_join imports and deprecate numba usage Moved several helper imports from utils to _helpers module for better organization. Updated type hints for clarity and consistency. Added deprecation notes and comments regarding numba support, indicating that numba-based implementations are no longer maintained or supported. * Remove deprecation warnings for df_columns and right_columns Deprecated warnings for the df_columns and right_columns arguments have been removed from the conditional_join function's docstring. This streamlines the documentation and removes redundant warning messages. * Deprecate numba support and fix select_columns usage Added a deprecation warning for numba support in _conditional_join_preliminary_checks. Replaced deprecated 'select' method with 'select_columns' in _create_frame to ensure compatibility with updated DataFrame API. * Update pyproject.toml Co-authored-by: Copilot <[email protected]> * Update janitor/functions/_conditional_join/_get_indices_equi.py Co-authored-by: Copilot <[email protected]> * Update janitor/functions/_conditional_join/_get_indices_equi.py Co-authored-by: Copilot <[email protected]> * updates * fixes * updates based on copilot feedback * remove unused variables * keep return_matching_indices simple * update pyproject * update pyproject * fix doctests * fix doctests * fix doctests * fix tests * fix tests * fix tests * test fixes * fix tests --------- Co-authored-by: Eric Ma <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: samuel.oranyeli <[email protected]>
1 parent 25db133 commit 9027852

37 files changed

+17366
-4770
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ repos:
66
- id: end-of-file-fixer
77
- id: trailing-whitespace
88
- id: check-added-large-files
9-
args: ['--maxkb=1024'] # Limits files to 1MB
9+
args: ['--maxkb=3000'] # Limits files to ~ 3MB
1010
- repo: https://github.com/kynan/nbstripout
1111
rev: 0.6.1
1212
hooks:
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
2+
3+
import numpy as np
4+
import janitor_rs
5+
6+
7+
8+
9+
def _binary_search_lt(
10+
left: np.ndarray,
11+
right: np.ndarray,
12+
starts: np.ndarray,
13+
ends: np.ndarray,
14+
) -> tuple:
15+
"""
16+
Get starts for < joins
17+
"""
18+
mapping = {
19+
"int64": janitor_rs.binary_search_lt_int64,
20+
"int32": janitor_rs.binary_search_lt_int32,
21+
"int16": janitor_rs.binary_search_lt_int16,
22+
"int8": janitor_rs.binary_search_lt_int8,
23+
"uint64": janitor_rs.binary_search_lt_uint64,
24+
"uint32": janitor_rs.binary_search_lt_uint32,
25+
"uint16": janitor_rs.binary_search_lt_uint16,
26+
"uint8": janitor_rs.binary_search_lt_uint8,
27+
"float64": janitor_rs.binary_search_lt_float64,
28+
"float32": janitor_rs.binary_search_lt_float32,
29+
}
30+
dtype_name = left.dtype.name
31+
try:
32+
func = mapping[dtype_name]
33+
except KeyError:
34+
raise KeyError(f"Unsupported data type -> {dtype_name}")
35+
return func(left, right, starts, ends)
36+
37+
38+
def _binary_search_le(
39+
left: np.ndarray,
40+
right: np.ndarray,
41+
starts: np.ndarray,
42+
ends: np.ndarray,
43+
) -> tuple:
44+
"""
45+
Get starts for <= joins
46+
"""
47+
mapping = {
48+
"int64": janitor_rs.binary_search_le_int64,
49+
"int32": janitor_rs.binary_search_le_int32,
50+
"int16": janitor_rs.binary_search_le_int16,
51+
"int8": janitor_rs.binary_search_le_int8,
52+
"uint64": janitor_rs.binary_search_le_uint64,
53+
"uint32": janitor_rs.binary_search_le_uint32,
54+
"uint16": janitor_rs.binary_search_le_uint16,
55+
"uint8": janitor_rs.binary_search_le_uint8,
56+
"float64": janitor_rs.binary_search_le_float64,
57+
"float32": janitor_rs.binary_search_le_float32,
58+
}
59+
dtype_name = left.dtype.name
60+
try:
61+
func = mapping[dtype_name]
62+
except KeyError:
63+
raise KeyError(f"Unsupported data type -> {dtype_name}")
64+
return func(left, right, starts, ends)
65+
66+
67+
def _binary_search_gt(
68+
left: np.ndarray,
69+
right: np.ndarray,
70+
starts: np.ndarray,
71+
ends: np.ndarray,
72+
) -> tuple:
73+
"""
74+
Get ends for > joins
75+
"""
76+
mapping = {
77+
"int64": janitor_rs.binary_search_gt_int64,
78+
"int32": janitor_rs.binary_search_gt_int32,
79+
"int16": janitor_rs.binary_search_gt_int16,
80+
"int8": janitor_rs.binary_search_gt_int8,
81+
"uint64": janitor_rs.binary_search_gt_uint64,
82+
"uint32": janitor_rs.binary_search_gt_uint32,
83+
"uint16": janitor_rs.binary_search_gt_uint16,
84+
"uint8": janitor_rs.binary_search_gt_uint8,
85+
"float64": janitor_rs.binary_search_gt_float64,
86+
"float32": janitor_rs.binary_search_gt_float32,
87+
}
88+
dtype_name = left.dtype.name
89+
try:
90+
func = mapping[dtype_name]
91+
except KeyError:
92+
raise KeyError(f"Unsupported data type -> {dtype_name}")
93+
return func(left, right, starts, ends)
94+
95+
96+
def _binary_search_ge(
97+
left: np.ndarray,
98+
right: np.ndarray,
99+
starts: np.ndarray,
100+
ends: np.ndarray,
101+
) -> tuple:
102+
"""
103+
Get ends for >= joins
104+
"""
105+
mapping = {
106+
"int64": janitor_rs.binary_search_ge_int64,
107+
"int32": janitor_rs.binary_search_ge_int32,
108+
"int16": janitor_rs.binary_search_ge_int16,
109+
"int8": janitor_rs.binary_search_ge_int8,
110+
"uint64": janitor_rs.binary_search_ge_uint64,
111+
"uint32": janitor_rs.binary_search_ge_uint32,
112+
"uint16": janitor_rs.binary_search_ge_uint16,
113+
"uint8": janitor_rs.binary_search_ge_uint8,
114+
"float64": janitor_rs.binary_search_ge_float64,
115+
"float32": janitor_rs.binary_search_ge_float32,
116+
}
117+
dtype_name = left.dtype.name
118+
try:
119+
func = mapping[dtype_name]
120+
except KeyError:
121+
raise KeyError(f"Unsupported data type -> {dtype_name}")
122+
return func(left, right, starts, ends)

0 commit comments

Comments
 (0)