Skip to content

PERF: replace _first_non_null with builtin "first" aggregation#292

Open
bnaul wants to merge 4 commits intouscuni:mainfrom
bnaul:perf/replace-first-non-null-with-builtin-first
Open

PERF: replace _first_non_null with builtin "first" aggregation#292
bnaul wants to merge 4 commits intouscuni:mainfrom
bnaul:perf/replace-first-non-null-with-builtin-first

Conversation

@bnaul
Copy link

@bnaul bnaul commented Feb 11, 2026

Hi, I was curious about speeding up the neatify() function and this helper jumped out as the foremost bottleneck. Seemed like an easy fix since the relevant pandas change was made and the min version here was bumped.

Summary

  • Since pandas 2.2.1, GroupBy.first() skips NaN by default (skipna=True), making _first_non_null redundant
  • The builtin "first" string uses pandas' optimized C path instead of the pure-Python aggregation path
  • On Aleppo (78K edges), wall-clock time drops from ~508s to ~150s (3.4x speedup)

Details

The _first_non_null callable was introduced in #264 to fix #261 (changed edges losing attributes). At the time, the intent was to ensure NaN values from new edges don't shadow real attributes from existing edges during groupby aggregation.

However, pandas' builtin "first" already skips NaN by default since 2.2.1, and neatnet requires pandas >= 2.2.3, so the custom callable is redundant. Using a Python callable forces pandas into _aggregate_series_pure_python, which on Aleppo results in 1.3M calls to _first_non_null and ~207M isinstance checks from pandas type-checking overhead.

The example from #261 continues to print 0:

import osmnx as ox
import neatnet

G = ox.graph_from_bbox((-73.86, 40.73,-73.85, 40.74))
Gedges = ox.convert.graph_to_gdfs(G, nodes=False)[['geometry']].reset_index(drop=True).to_crs('EPSG:3857')
Gedges['attribute'] = Gedges.index.astype(int)

neatified_edges = neatnet.neatify(Gedges)

bug_changed = (neatified_edges._status == 'changed')&(neatified_edges.attribute.isna())
print(bug_changed.sum())  # 0

🤖 Generated with Claude Code

Since pandas 2.2.1, GroupBy.first() skips NaN by default (skipna=True),
making _first_non_null redundant. The builtin "first" string uses the
optimized C path instead of the pure-Python aggregation path, reducing
wall-clock time on Aleppo (78K edges) from ~508s to ~150s (3.4x).

The example from uscuni#261 continues to print 0:

    import osmnx as ox
    import neatnet

    G = ox.graph_from_bbox((-73.86, 40.73,-73.85, 40.74))
    Gedges = ox.convert.graph_to_gdfs(G, nodes=False)[['geometry']].reset_index(drop=True).to_crs('EPSG:3857')
    Gedges['attribute'] = Gedges.index.astype(int)

    neatified_edges = neatnet.neatify(Gedges)

    bug_changed = (neatified_edges._status == 'changed')&(neatified_edges.attribute.isna())
    print(bug_changed.sum())  # 0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bnaul bnaul force-pushed the perf/replace-first-non-null-with-builtin-first branch from 1340f25 to ae34c35 Compare February 11, 2026 22:45
@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.8%. Comparing base (47f6429) to head (86df318).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #292     +/-   ##
=======================================
- Coverage   98.8%   98.8%   -0.0%     
=======================================
  Files          7       7             
  Lines       1282    1277      -5     
=======================================
- Hits        1267    1262      -5     
  Misses        15      15             
Files with missing lines Coverage Δ
neatnet/nodes.py 98.9% <ø> (+1.1%) ⬆️
neatnet/simplify.py 98.0% <100.0%> (-1.2%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice catch! Given this is the only place where _first_non_null is used, can you also remove the function?

bnaul and others added 2 commits February 12, 2026 09:43
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bnaul bnaul requested a review from martinfleis February 12, 2026 14:52
Copy link
Collaborator

@jGaboardi jGaboardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaul -- Thanks for the contribution! This is indeed an impressive speed up!.

We need to isolate if the failing CI in dev is due to this change or not; here it was green several days ago.

@jGaboardi
Copy link
Collaborator

ubuntu-latest, py314_dev is passing on current main. The failure here may be a fluke, but the differences in the tracebacks do seem to indicate a change in behavior.

@jGaboardi
Copy link
Collaborator

jGaboardi commented Feb 12, 2026

It may simply be differing pandas dtypes behavior, but we should be certain before merging.

FAILED neatnet/tests/test_simplify.py::test_neatify_full_fua[liege_1656-0.3-2350782] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="highway") are different

Attribute "dtype" are different
[left]:  <StringDtype(na_value=nan)>
[right]: object
FAILED neatnet/tests/test_simplify.py::test_neatify_full_fua[slc_4881-0.3-1762456] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="highway") are different

Attribute "dtype" are different
[left]:  <StringDtype(na_value=nan)>
[right]: object
FAILED neatnet/tests/test_simplify.py::test_neatify_full_fua[auckland_869-0.3-1268048] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="highway") are different

Attribute "dtype" are different
[left]:  <StringDtype(na_value=nan)>
[right]: object

@martinfleis
Copy link
Contributor

I think we just need to relax those tests so they won't fail on dtype difference, while checking the value match. If it fails on values, then we have a problem.

@jGaboardi
Copy link
Collaborator

I think we just need to relax those tests so they won't fail on dtype difference, while checking the value match. If it fails on values, then we have a problem.

Agreed.

@jGaboardi
Copy link
Collaborator

Shall we do that in a quick preliminary PR before merging here?

@jGaboardi
Copy link
Collaborator

Following #293, the failures in dev here only seem to be pandas differences representations of 'nan' vs. None.

['residential',  'residential',            nan,            nan,  ...]
[residential, residential, None, None, ...]

But I can't reproduce locally for some reason. Probably an easy fix I am missing. @martinfleis -- thoughts?

@martinfleis
Copy link
Contributor

It is just a matter of the series dtype. One is pandas string dtype using nan, other is object using None. I am not sure what is causing it to appear in the dev build and not elsewhere, given latest comes with pandas 3 as well.

@jGaboardi
Copy link
Collaborator

I am not sure what is causing it to appear in the dev build and not elsewhere, given latest comes with pandas 3 as well.

I think this is simply because we are only running that specific block of tests if both on Ubuntu and in the dev environment, and now that I am remembering this it makes sense why tests pass locally for me.

@jGaboardi
Copy link
Collaborator

It is just a matter of the series dtype

We are passing in check_dtype=False for the comparison, so is there some other keyword in assert_frame_equal that I am overlooking?

@martinfleis
Copy link
Contributor

I don't think so. I'd cast obtained to stringdtype after we get it. It is caused by Parquet IO which does not preserve the object dtype neatnet returns.

@bnaul
Copy link
Author

bnaul commented Feb 13, 2026

I think Claude's analysis here is right about a difference in handling all-null columns but I'm not sure what the cleanest fix is. It proposes adding this source_dtypes param to propagate the type information through:

⏺ The problem is at line 83: gpd.GeoDataFrame(geometry=lines_split, crs=crs) creates a frame with only geometry. Then _fill_attrs assigns scalars
  one-by-one, and pandas infers the dtype from the scalar — if it's pd.NA, you get float64 instead of str.

  The cleanest fix is in _fill_attrs — pass the source dtypes so each column gets the right type. But _fill_attrs only gets a Series (a single row),
  which doesn't carry per-column dtype info.

  The simplest targeted fix is to align dtypes in split() right before each concat:

⏺ Read 1 file (ctrl+o to expand)

⏺ Both concat sites (line 86 and 113) have the same issue. The fix should cast gdf_split columns to match cleaned_streets dtypes before the concat.
  Let me do it cleanly by adding a helper right after _fill_attrs:

      15  from sklearn.cluster import DBSCAN
      16
      17
      18 -def _fill_attrs(gdf: gpd.GeoDataFrame, source_row: pd.Series) -> gpd.GeoDataFrame:
      18 +def _fill_attrs(
      19 +    gdf: gpd.GeoDataFrame,
      20 +    source_row: pd.Series,
      21 +    source_dtypes: pd.Series | None = None,
      22 +) -> gpd.GeoDataFrame:
      23      """Thoughtful attribute assignment to lines split into segments by new nodes –
      24      taking list-like values into consideration. See gh#213. Regarding iterables,
      25      currently only supports list values – others can be added based on input type
...
      43      for col in source_row.index.drop(["geometry", "_status"], errors="ignore"):
      44          gdf[col] = _populate_column(source_row[col])
      45
      46 +    if source_dtypes is not None:
      47 +        for col in gdf.columns.intersection(source_dtypes.index):
      48 +            if col != "geometry" and gdf[col].dtype != source_dtypes[col]:
      49 +                gdf[col] = gdf[col].astype(source_dtypes[col])
      50 +
      51      return gdf

@martinfleis
Copy link
Contributor

That is dangerous as you can have source dtype int, then we may introduce missing values and casting that to int will fail. I'll just adjust this test, I think the implementation is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tracking attributes of changed edges after neatify

3 participants

Comments