Cache XSSFRow.FirstCellNum/LastCellNum to avoid O(n) LINQ scans by ken-swyfft · Pull Request #1745 · nissl-lab/npoi

ken-swyfft · 2026-03-21T21:57:23Z

Summary

XSSFRow.GetFirstKey() and GetLastKey() used LINQ Min()/Max() which enumerate all dictionary keys on every call — O(n) per access
FirstCellNum and LastCellNum are called in tight loops throughout the codebase (cell range iteration during copy, shift, auto-size, formula evaluation)
Added cached min/max column indices, updated O(1) on CreateCell(), with lazy invalidation on RemoveCell() (only re-scans when a boundary cell is removed)
Also adds 5 safety-net tests for previously uncovered patterns: sparse rows with distant columns, large column indices (16383), interleaved add/remove sequences, .Cells property ordering, and save/reload ordering persistence

Benchmark results (10,000 repeated `FirstCellNum` + `LastCellNum` accesses)

Scenario	Before	After	Speedup	Alloc Before
Sparse row (3 cells)	799 μs	11 μs	71×	2,813 KB
Dense row (200 cells)	25,724 μs	11 μs	2,276×	4,375 KB
Full cell range iteration (200 cells)	277 μs	3.1 μs	89×	44 KB

All 4,542 existing tests pass (OOXML: 1,792, Main: 2,750).

Test plan

All existing XSSFRow tests pass (inherited from BaseTestRow + BaseTestXRow)
New tests cover sparse rows, large column indices, interleaved add/remove, ordering after save/reload
Benchmarks confirm significant speedup and zero allocation
Builds clean on both net8.0 and net472

🤖 Generated with Claude Code

GetFirstKey() and GetLastKey() used LINQ Min()/Max() which enumerate all dictionary keys on every call. FirstCellNum and LastCellNum are called in tight loops (e.g., cell range iteration during copy/shift operations), making this O(n) per access a significant bottleneck. Cache the min/max column indices and update them O(1) on add, with lazy invalidation on remove (only re-scans when a boundary cell is removed). Also adds safety-net tests for sparse rows, large column indices, interleaved add/remove, and ordering persistence. Benchmark results (10,000 repeated FirstCellNum+LastCellNum accesses): - Sparse row (3 cells): 799 us -> 11 us (71x faster), 2813 KB -> 0 B - Dense row (200 cells): 25,724 us -> 11 us (2,276x faster), 4375 KB -> 0 B Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ken-swyfft

Code Review

Overall: Approve. Clean, well-targeted performance optimization. Caching logic is correct, all mutation points are covered, no API/ABI changes.

Correctness — verified complete

All four mutation sites for _cells are properly handled:

Mutation site	Cache action
Constructor (`_cells.Add`)	`UpdateCacheOnAdd` called
`CreateCell` (`_cells[idx] = ...`)	`UpdateCacheOnAdd` called
`RemoveCell` (`_cells.Remove`)	`InvalidateCacheOnRemove` called
`RebuildCells` (`_cells.Clear` + `_cells.Add`)	Full invalidation

The sentinel value of -1 is safe — column indices are always non-negative (0–16383), and _cells.Count == 0 is checked before accessing the cache. The short cast is safe because the maximum value is 16384 (max_column + 1), well within short range.

Performance claims — justified

SortedDictionary.Keys.Min() and .Max() are both O(n) via LINQ (LINQ has no way to know the collection is sorted). Cached access is O(1) with zero allocation. The claimed speedups are credible.

Tests

The interleaved add/remove test is particularly well-designed, covering the full lifecycle: empty → add → add → add → remove middle → remove first → add new first → remove last → remove all.

Minor observations (non-blocking)

GetFirstKey() could use First() instead of Min() — Since _cells is a SortedDictionary, the enumerator yields keys in ascending order, so _cells.Keys.First() is O(1) vs Min() being O(n). With caching this re-scan is rare, so the practical impact is minimal.
RebuildCells could populate the cache eagerly — Since it already iterates all cells, it could track min/max during the rebuild loop instead of invalidating. But RebuildCells is called infrequently, so the lazy approach is fine.

ken-swyfft commented Mar 21, 2026

View reviewed changes

tonyqus added the performance label Mar 22, 2026

tonyqus added this to the NPOI 2.8.0 milestone Mar 22, 2026

tonyqus added the xlsx label Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache XSSFRow.FirstCellNum/LastCellNum to avoid O(n) LINQ scans#1745

Cache XSSFRow.FirstCellNum/LastCellNum to avoid O(n) LINQ scans#1745
ken-swyfft wants to merge 1 commit intonissl-lab:masterfrom
swyfft-insurance:perf/ks/20260321_xssfrow-cellnum-caching

ken-swyfft commented Mar 21, 2026

Uh oh!

ken-swyfft left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ken-swyfft commented Mar 21, 2026

Summary

Benchmark results (10,000 repeated FirstCellNum + LastCellNum accesses)

Test plan

Uh oh!

ken-swyfft left a comment

Choose a reason for hiding this comment

Code Review

Correctness — verified complete

Performance claims — justified

Tests

Minor observations (non-blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmark results (10,000 repeated `FirstCellNum` + `LastCellNum` accesses)