PERF: replace _first_non_null with builtin "first" aggregation#292
PERF: replace _first_non_null with builtin "first" aggregation#292bnaul wants to merge 4 commits intouscuni:mainfrom
Conversation
Since pandas 2.2.1, GroupBy.first() skips NaN by default (skipna=True), making _first_non_null redundant. The builtin "first" string uses the optimized C path instead of the pure-Python aggregation path, reducing wall-clock time on Aleppo (78K edges) from ~508s to ~150s (3.4x). The example from uscuni#261 continues to print 0: import osmnx as ox import neatnet G = ox.graph_from_bbox((-73.86, 40.73,-73.85, 40.74)) Gedges = ox.convert.graph_to_gdfs(G, nodes=False)[['geometry']].reset_index(drop=True).to_crs('EPSG:3857') Gedges['attribute'] = Gedges.index.astype(int) neatified_edges = neatnet.neatify(Gedges) bug_changed = (neatified_edges._status == 'changed')&(neatified_edges.attribute.isna()) print(bug_changed.sum()) # 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1340f25 to
ae34c35
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #292 +/- ##
=======================================
- Coverage 98.8% 98.8% -0.0%
=======================================
Files 7 7
Lines 1282 1277 -5
=======================================
- Hits 1267 1262 -5
Misses 15 15
🚀 New features to boost your workflow:
|
martinfleis
left a comment
There was a problem hiding this comment.
Thanks, nice catch! Given this is the only place where _first_non_null is used, can you also remove the function?
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jGaboardi
left a comment
There was a problem hiding this comment.
@bnaul -- Thanks for the contribution! This is indeed an impressive speed up!.
We need to isolate if the failing CI in dev is due to this change or not; here it was green several days ago.
|
|
|
It may simply be differing |
|
I think we just need to relax those tests so they won't fail on dtype difference, while checking the value match. If it fails on values, then we have a problem. |
Agreed. |
|
Shall we do that in a quick preliminary PR before merging here? |
|
Following #293, the failures in But I can't reproduce locally for some reason. Probably an easy fix I am missing. @martinfleis -- thoughts? |
|
It is just a matter of the series dtype. One is pandas string dtype using nan, other is object using None. I am not sure what is causing it to appear in the dev build and not elsewhere, given latest comes with pandas 3 as well. |
I think this is simply because we are only running that specific block of tests if both on Ubuntu and in the |
We are passing in |
|
I don't think so. I'd cast obtained to stringdtype after we get it. It is caused by Parquet IO which does not preserve the object dtype neatnet returns. |
|
I think Claude's analysis here is right about a difference in handling all-null columns but I'm not sure what the cleanest fix is. It proposes adding this |
|
That is dangerous as you can have source dtype int, then we may introduce missing values and casting that to int will fail. I'll just adjust this test, I think the implementation is fine. |
Hi, I was curious about speeding up the
neatify()function and this helper jumped out as the foremost bottleneck. Seemed like an easy fix since the relevant pandas change was made and the min version here was bumped.Summary
GroupBy.first()skips NaN by default (skipna=True), making_first_non_nullredundant"first"string uses pandas' optimized C path instead of the pure-Python aggregation pathDetails
The
_first_non_nullcallable was introduced in #264 to fix #261 (changed edges losing attributes). At the time, the intent was to ensure NaN values from new edges don't shadow real attributes from existing edges during groupby aggregation.However, pandas' builtin
"first"already skips NaN by default since 2.2.1, and neatnet requirespandas >= 2.2.3, so the custom callable is redundant. Using a Python callable forces pandas into_aggregate_series_pure_python, which on Aleppo results in 1.3M calls to_first_non_nulland ~207Misinstancechecks from pandas type-checking overhead.The example from #261 continues to print
0:🤖 Generated with Claude Code