Skip to content

Cache XSSFSheet.MergedRegions to eliminate repeated XML parsing#1744

Merged
tonyqus merged 4 commits intonissl-lab:masterfrom
swyfft-insurance:perf/ks/20260321_merged-regions-cache
Mar 22, 2026
Merged

Cache XSSFSheet.MergedRegions to eliminate repeated XML parsing#1744
tonyqus merged 4 commits intonissl-lab:masterfrom
swyfft-insurance:perf/ks/20260321_merged-regions-cache

Conversation

@ken-swyfft
Copy link
Contributor

Summary

  • Cache the MergedRegions property result to avoid rebuilding a List<CellRangeAddress> from XML on every access
  • Invalidate cache on Add/Remove mutations (3 code paths)
  • Add MergedRegionsBenchmark for measuring repeated read and add scenarios

The MergedRegions property was parsing every CT_MergeCell.@ref string via CellRangeAddress.ValueOf() on each call. This is called repeatedly during row copy, auto-size, shift, and validation operations.

Benchmark Results

ReadMergedRegions (10 repeated reads per invocation)

RegionCount Master Optimized Speedup Master Alloc Optimized Alloc
100 60.38 μs 21.50 ns 2,808× 193 KB 0 B
1,000 591.37 μs 20.23 ns 29,232× 1,881 KB 0 B
5,000 3,399.71 μs 17.86 ns 190,355× 9,877 KB 0 B

AddMergedRegions (validated adds)

RegionCount Master Optimized Change
100 402.6 μs 360.6 μs ~10% faster
1,000 36,557 μs 34,178 μs ~7% faster
5,000 983,947 μs 865,449 μs ~12% faster

Test plan

  • All 1,787 OOXML tests pass (net8.0 and net472)
  • All 2,750+ main tests pass (net8.0 and net472)
  • Existing TestMergeRegionsSpeed test (50k regions) passes
  • Merge, remove, shift, copy, and auto-size tests all pass

🤖 Generated with Claude Code

The MergedRegions property was rebuilding a List<CellRangeAddress> from
XML on every access, parsing each "ref" string via CellRangeAddress.ValueOf().
This caused significant overhead in scenarios with repeated reads (row copy,
auto-size, validation) especially on sheets with many merged regions.

Add a cached list with invalidation on Add/Remove mutations. Benchmark
with 5,000 regions shows ReadMergedRegions going from 3,400us to ~18ns
(190,000x faster) with zero allocations on cache hits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor Author

@ken-swyfft ken-swyfft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Needs changes before merge. Cache invalidation coverage is complete, but there's a correctness risk from returning the mutable cached list directly.

Critical: Returning the mutable cached list allows callers to corrupt the cache

The MergedRegions property returns the same List<CellRangeAddress> reference on every call. This is a public API. Any external consumer who does:

var regions = sheet.MergedRegions;
regions.Add(someRegion);  // silently corrupts the cache without updating XML

...will corrupt the cache. This is a behavioral change from the pre-cache version, which returned a fresh list each time. Code that previously worked (mutating the returned list as a scratch pad) will now silently break.

Similarly, CellRangeAddress has public setters on FirstRow, LastRow, FirstColumn, LastColumn. Mutating a returned element now corrupts the cached data — previously this was harmless since each call built fresh objects.

Recommendation: Return a shallow copy: new List<CellRangeAddress>(_cachedMergedRegions). The allocation cost of copying a list of object references is trivial compared to the XML parsing cost being eliminated — you still get the vast majority of the perf win while preserving the original behavioral contract.

Alternatively, return _cachedMergedRegions.AsReadOnly(), though this changes the return type to ReadOnlyCollection<T> which is an API break.

Moderate: RemoveMergedRegions early return skips cache invalidation

if(!worksheet.IsSetMergeCells())
{
    return;  // InvalidateMergedRegionsCache() at end of method is never reached
}

This is benign today (if IsSetMergeCells() is false, the cache should already be null or empty), but it's fragile. Adding InvalidateMergedRegionsCache() before the early return costs nothing and is defensive.

Cache invalidation coverage — verified complete

All three mutation paths invalidate correctly:

  • AddMergedRegion
  • RemoveMergedRegion
  • RemoveMergedRegions ✓ (except the early return noted above)

Indirect callers (row/column shifting via RowShifter/ColumnShifter, copy operations) all go through AddMergedRegion/RemoveMergedRegions, so they invalidate properly.

Minor notes

  • No thread safety on _cachedMergedRegions, but XSSFSheet doesn't appear to guarantee thread safety in general, so this is acceptable.
  • The worksheet is loaded once (via EnsureWorksheetLoaded()), so the cache won't go stale from re-loading.
  • ValidateMergedRegions during AddMergedRegion correctly reads the cache before invalidation, checking against existing regions.

ken-swyfft and others added 2 commits March 21, 2026 15:06
Return new List<CellRangeAddress> from cached data so callers who mutate
the returned list cannot corrupt the cache. Also add defensive cache
invalidation on RemoveMergedRegions early-return path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The initial fix only returned a shallow copy on cache hits (2nd+ call),
but the first call still returned the cached reference directly. Callers
mutating that list would corrupt the cache. Restructured the property to
always return a copy regardless of cache state.

Added two tests:
- TestMergedRegionsCacheReturnsFreshList: mutates returned list, verifies
  subsequent calls are unaffected
- TestRemoveMergedRegionsOnEmptySheet: exercises the early-return path in
  RemoveMergedRegions when no merge cells exist

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor Author

@ken-swyfft ken-swyfft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated Review

Both concerns from my earlier review have been addressed:

  1. Mutable cached list — Fixed. The property now returns new List<CellRangeAddress>(_cachedMergedRegions) (shallow copy), so callers can't corrupt the cache. The new TestMergedRegionsCacheReturnsFreshList test verifies this by calling .Add() and .Clear() on the returned list and confirming subsequent calls are unaffected.

  2. RemoveMergedRegions early return — Fixed. InvalidateMergedRegionsCache() is now called before the early return on the !IsSetMergeCells() path. Covered by TestRemoveMergedRegionsOnEmptySheet.

The remaining theoretical concern (callers mutating CellRangeAddress elements via public setters) is marginal — it requires deliberate misuse rather than a natural pattern, and a deep copy would undercut the perf win. Acceptable as-is.

Approve.

@tonyqus
Copy link
Member

tonyqus commented Mar 22, 2026

You are teaching me how to use Claude 👯 AI+human review based on the comment.

Claude is much better than Copilot, which doesn't allow comment-based review. I have to go back to the copilot conversation and talk to it, then it will modify the PR description instead of commenting. Sometimes, the original fix description is replaced. It's a bit annoying.

@tonyqus
Copy link
Member

tonyqus commented Mar 22, 2026

My Review

I double-checked that there are a few existing test cases for the MergedRegions property. This fix should be safe.

Sometimes, performance optimization can be a disaster - Performance is so good, but the function is broken.

Lesson Learnt from this PR

Avoid any unnecessary repeated deserialization from OpenXML entity to XSSF entities.

@tonyqus tonyqus added this to the NPOI 2.8.0 milestone Mar 22, 2026
@ken-swyfft
Copy link
Contributor Author

With respect to Claude Code - yeah, one key is to give it access to the tools it needs. The gh command-line client is really helpful in this regard.

And yeah, Claude absolutely needs supervision - but one of the best ways of supervising it is with another Claude instance, with a fresh context. They can both make mistakes, but they don't usually make the same mistakes.

@tonyqus
Copy link
Member

tonyqus commented Mar 22, 2026

but one of the best ways of supervising it is with another Claude instance, with a fresh context. They can both make mistakes, but they don't usually make the same mistakes.

AI Pair Programming ... That's new to me 🍡

@tonyqus tonyqus merged commit c079f02 into nissl-lab:master Mar 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants