PERF: replace _first_non_null with builtin "first" aggregation by bnaul · Pull Request #292 · uscuni/neatnet

bnaul · 2026-02-11T22:03:18Z

Hi, I was curious about speeding up the neatify() function and this helper jumped out as the foremost bottleneck. Seemed like an easy fix since the relevant pandas change was made and the min version here was bumped.

Summary

Since pandas 2.2.1, GroupBy.first() skips NaN by default (skipna=True), making _first_non_null redundant
The builtin "first" string uses pandas' optimized C path instead of the pure-Python aggregation path
On Aleppo (78K edges), wall-clock time drops from ~508s to ~150s (3.4x speedup)

Details

The _first_non_null callable was introduced in #264 to fix #261 (changed edges losing attributes). At the time, the intent was to ensure NaN values from new edges don't shadow real attributes from existing edges during groupby aggregation.

However, pandas' builtin "first" already skips NaN by default since 2.2.1, and neatnet requires pandas >= 2.2.3, so the custom callable is redundant. Using a Python callable forces pandas into _aggregate_series_pure_python, which on Aleppo results in 1.3M calls to _first_non_null and ~207M isinstance checks from pandas type-checking overhead.

The example from #261 continues to print 0:

import osmnx as ox
import neatnet

G = ox.graph_from_bbox((-73.86, 40.73,-73.85, 40.74))
Gedges = ox.convert.graph_to_gdfs(G, nodes=False)[['geometry']].reset_index(drop=True).to_crs('EPSG:3857')
Gedges['attribute'] = Gedges.index.astype(int)

neatified_edges = neatnet.neatify(Gedges)

bug_changed = (neatified_edges._status == 'changed')&(neatified_edges.attribute.isna())
print(bug_changed.sum())  # 0

🤖 Generated with Claude Code

Since pandas 2.2.1, GroupBy.first() skips NaN by default (skipna=True), making _first_non_null redundant. The builtin "first" string uses the optimized C path instead of the pure-Python aggregation path, reducing wall-clock time on Aleppo (78K edges) from ~508s to ~150s (3.4x). The example from uscuni#261 continues to print 0: import osmnx as ox import neatnet G = ox.graph_from_bbox((-73.86, 40.73,-73.85, 40.74)) Gedges = ox.convert.graph_to_gdfs(G, nodes=False)[['geometry']].reset_index(drop=True).to_crs('EPSG:3857') Gedges['attribute'] = Gedges.index.astype(int) neatified_edges = neatnet.neatify(Gedges) bug_changed = (neatified_edges._status == 'changed')&(neatified_edges.attribute.isna()) print(bug_changed.sum()) # 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-02-12T06:22:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.8%. Comparing base (47f6429) to head (86df318).

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #292     +/-   ##
=======================================
- Coverage   98.8%   98.8%   -0.0%     
=======================================
  Files          7       7             
  Lines       1282    1277      -5     
=======================================
- Hits        1267    1262      -5     
  Misses        15      15

Files with missing lines	Coverage Δ
neatnet/nodes.py	`98.9% <ø> (+1.1%)`	⬆️
neatnet/simplify.py	`98.0% <100.0%> (-1.2%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

martinfleis

Thanks, nice catch! Given this is the only place where _first_non_null is used, can you also remove the function?

neatnet/simplify.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jGaboardi

@bnaul -- Thanks for the contribution! This is indeed an impressive speed up!.

We need to isolate if the failing CI in dev is due to this change or not; here it was green several days ago.

jGaboardi · 2026-02-12T20:39:27Z

ubuntu-latest, py314_dev is passing on current main. The failure here may be a fluke, but the differences in the tracebacks do seem to indicate a change in behavior.

jGaboardi · 2026-02-12T20:45:20Z

It may simply be differing pandas dtypes behavior, but we should be certain before merging.

FAILED neatnet/tests/test_simplify.py::test_neatify_full_fua[liege_1656-0.3-2350782] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="highway") are different

Attribute "dtype" are different
[left]:  <StringDtype(na_value=nan)>
[right]: object
FAILED neatnet/tests/test_simplify.py::test_neatify_full_fua[slc_4881-0.3-1762456] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="highway") are different

Attribute "dtype" are different
[left]:  <StringDtype(na_value=nan)>
[right]: object
FAILED neatnet/tests/test_simplify.py::test_neatify_full_fua[auckland_869-0.3-1268048] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="highway") are different

Attribute "dtype" are different
[left]:  <StringDtype(na_value=nan)>
[right]: object

martinfleis · 2026-02-12T21:31:58Z

I think we just need to relax those tests so they won't fail on dtype difference, while checking the value match. If it fails on values, then we have a problem.

jGaboardi · 2026-02-12T21:34:13Z

I think we just need to relax those tests so they won't fail on dtype difference, while checking the value match. If it fails on values, then we have a problem.

Agreed.

jGaboardi · 2026-02-12T21:34:42Z

Shall we do that in a quick preliminary PR before merging here?

jGaboardi · 2026-02-13T00:27:52Z

Following #293, the failures in dev here only seem to be pandas differences representations of 'nan' vs. None.

['residential',  'residential',            nan,            nan,  ...]
[residential, residential, None, None, ...]

But I can't reproduce locally for some reason. Probably an easy fix I am missing. @martinfleis -- thoughts?

martinfleis · 2026-02-13T09:42:26Z

It is just a matter of the series dtype. One is pandas string dtype using nan, other is object using None. I am not sure what is causing it to appear in the dev build and not elsewhere, given latest comes with pandas 3 as well.

jGaboardi · 2026-02-13T14:11:06Z

I am not sure what is causing it to appear in the dev build and not elsewhere, given latest comes with pandas 3 as well.

I think this is simply because we are only running that specific block of tests if both on Ubuntu and in the dev environment, and now that I am remembering this it makes sense why tests pass locally for me.

jGaboardi · 2026-02-13T14:19:38Z

It is just a matter of the series dtype

We are passing in check_dtype=False for the comparison, so is there some other keyword in assert_frame_equal that I am overlooking?

martinfleis · 2026-02-13T15:05:58Z

I don't think so. I'd cast obtained to stringdtype after we get it. It is caused by Parquet IO which does not preserve the object dtype neatnet returns.

bnaul · 2026-02-13T15:56:26Z

I think Claude's analysis here is right about a difference in handling all-null columns but I'm not sure what the cleanest fix is. It proposes adding this source_dtypes param to propagate the type information through:

⏺ The problem is at line 83: gpd.GeoDataFrame(geometry=lines_split, crs=crs) creates a frame with only geometry. Then _fill_attrs assigns scalars
  one-by-one, and pandas infers the dtype from the scalar — if it's pd.NA, you get float64 instead of str.

  The cleanest fix is in _fill_attrs — pass the source dtypes so each column gets the right type. But _fill_attrs only gets a Series (a single row),
  which doesn't carry per-column dtype info.

  The simplest targeted fix is to align dtypes in split() right before each concat:

⏺ Read 1 file (ctrl+o to expand)

⏺ Both concat sites (line 86 and 113) have the same issue. The fix should cast gdf_split columns to match cleaned_streets dtypes before the concat.
  Let me do it cleanly by adding a helper right after _fill_attrs:

      15  from sklearn.cluster import DBSCAN
      16
      17
      18 -def _fill_attrs(gdf: gpd.GeoDataFrame, source_row: pd.Series) -> gpd.GeoDataFrame:
      18 +def _fill_attrs(
      19 +    gdf: gpd.GeoDataFrame,
      20 +    source_row: pd.Series,
      21 +    source_dtypes: pd.Series | None = None,
      22 +) -> gpd.GeoDataFrame:
      23      """Thoughtful attribute assignment to lines split into segments by new nodes –
      24      taking list-like values into consideration. See gh#213. Regarding iterables,
      25      currently only supports list values – others can be added based on input type
...
      43      for col in source_row.index.drop(["geometry", "_status"], errors="ignore"):
      44          gdf[col] = _populate_column(source_row[col])
      45
      46 +    if source_dtypes is not None:
      47 +        for col in gdf.columns.intersection(source_dtypes.index):
      48 +            if col != "geometry" and gdf[col].dtype != source_dtypes[col]:
      49 +                gdf[col] = gdf[col].astype(source_dtypes[col])
      50 +
      51      return gdf

martinfleis · 2026-02-13T16:02:31Z

That is dangerous as you can have source dtype int, then we may introduce missing values and casting that to int will fail. I'll just adjust this test, I think the implementation is fine.

bnaul force-pushed the perf/replace-first-non-null-with-builtin-first branch from 1340f25 to ae34c35 Compare February 11, 2026 22:45

martinfleis reviewed Feb 12, 2026

View reviewed changes

neatnet/simplify.py Show resolved Hide resolved

bnaul and others added 2 commits February 12, 2026 09:43

CLN: remove unused _first_non_null function

6c1dd3f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restore comment

29ce3b5

bnaul requested a review from martinfleis February 12, 2026 14:52

martinfleis approved these changes Feb 12, 2026

View reviewed changes

martinfleis requested a review from jGaboardi February 12, 2026 19:49

jGaboardi assigned bnaul Feb 12, 2026

jGaboardi added algo/method rough edge labels Feb 12, 2026

jGaboardi reviewed Feb 12, 2026

View reviewed changes

jGaboardi mentioned this pull request Feb 12, 2026

check_dtype=False for full fua test #293

Merged

Merge branch 'main' into perf/replace-first-non-null-with-builtin-first

86df318

Conversation

bnaul commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Uh oh!

codecov bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

martinfleis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jGaboardi left a comment

Choose a reason for hiding this comment

Uh oh!

jGaboardi commented Feb 12, 2026

Uh oh!

jGaboardi commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinfleis commented Feb 12, 2026

Uh oh!

jGaboardi commented Feb 12, 2026

Uh oh!

jGaboardi commented Feb 12, 2026

Uh oh!

jGaboardi commented Feb 13, 2026

Uh oh!

martinfleis commented Feb 13, 2026

Uh oh!

jGaboardi commented Feb 13, 2026

Uh oh!

jGaboardi commented Feb 13, 2026

Uh oh!

martinfleis commented Feb 13, 2026

Uh oh!

bnaul commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinfleis commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

bnaul commented Feb 11, 2026 •

edited

Loading

codecov bot commented Feb 12, 2026 •

edited

Loading

jGaboardi commented Feb 12, 2026 •

edited

Loading

bnaul commented Feb 13, 2026 •

edited

Loading