Skip to content

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Aug 4, 2025

PR Description

Please describe the changes proposed in the pull request:

  • add cythonised code for improved performance
  • improve perf significantly with cython
  • take advantage of cython 3.0 which supports writing .py files in pure python mode, while still getting the benefits of a lower level language - in this case C.
  • the Cython code is mainly for loops.
  • precursor to adding support for aggregations within conditional join
  • fairly large PR; tests remain the same - mostly refactoring and the cython functions

This PR resolves #1490 .

Example (as always with benchmarks/tests, take with a pinch of salt YMMV):

url = "https://raw.githubusercontent.com/samukweku/data-wrangling-blog/master/notebooks/Data_files/flights.csv"
flights = pd.read_csv(url, sep = '\t', names=['orig','dest','orig_time', 'dest_time'], parse_dates  = ['orig_time', 'dest_time'])
flights = flights.factorize_columns(['orig','dest']).iloc[:, 2:]
flights.columns = [ent.split("_")[0] if ent.endswith('enc') else ent for ent in flights]
flights.columns = ['takeoff','landing','orig','dest']
flights = flights.assign(start=flights.landing+pd.Timedelta(minutes=45), end=flights.landing+pd.Timedelta(hours=3))
flights.head()

                takeoff	                 landing	orig	dest	start	                  end
0	2021-11-27 07:15:00	2021-11-27 08:55:00	0	0	2021-11-27 09:40:00	2021-11-27 11:55:00
1	2021-11-27 20:05:00	2021-11-28 00:50:00	1	1	2021-11-28 01:35:00	2021-11-28 03:50:00
2	2021-11-27 21:00:00	2021-11-27 21:35:00	2	2	2021-11-27 22:20:00	2021-11-28 00:35:00
3	2021-11-27 21:15:00	2021-11-27 22:25:00	0	2	2021-11-27 23:10:00	2021-11-28 01:25:00
4	2021-11-26 11:40:00	2021-11-26 14:45:00	3	3	2021-11-26 15:30:00	2021-11-26 17:45:00

# classic Pandas merge and filter
%%timeit 
outt = (flights
        .merge(flights, left_on='dest', right_on='orig')
        .loc[lambda f: f.takeoff_y.between(f.start_x, f.end_x) & (f.orig_x != f.orig_y), 
             ['start_x', 'takeoff_y', 'start_y', 'dest_x', 'orig_y']
             ]
        )
9.85 s ± 69.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# dev, with use_numba=False
%%timeit
outer = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
     ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
    ('dest', 'orig', '=='), 
    df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     use_numba=False)
)
2.03 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



# this PR, with use_numba=False
%%timeit
outerr = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
     ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
    ('dest', 'orig', '=='), 
    df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     use_numba=False)
)
29.9 ms ± 470 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)





# dev, with force=True
%%timeit
out = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
    ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
     ('dest', 'orig', '=='), 
     df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     force=True,
     use_numba=False)
)
255 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR, with force=True
%%timeit
outt = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
    ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
     ('dest', 'orig', '=='), 
     df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     force=True,
     use_numba=False)
)
83.7 ms ± 552 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@samukweku samukweku self-assigned this Aug 4, 2025
@samukweku samukweku linked an issue Aug 4, 2025 that may be closed by this pull request
@samukweku samukweku force-pushed the 1490-set-up-cython branch from 806b08f to 4152bf7 Compare August 9, 2025 13:34
@samukweku samukweku marked this pull request as draft August 17, 2025 13:14
@samukweku samukweku marked this pull request as ready for review August 27, 2025 12:51
@samukweku samukweku marked this pull request as draft August 27, 2025 13:26
@samukweku samukweku marked this pull request as ready for review August 30, 2025 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set up cython
2 participants