Cache XSSFSheet.MergedRegions to eliminate repeated XML parsing#1744
Conversation
The MergedRegions property was rebuilding a List<CellRangeAddress> from XML on every access, parsing each "ref" string via CellRangeAddress.ValueOf(). This caused significant overhead in scenarios with repeated reads (row copy, auto-size, validation) especially on sheets with many merged regions. Add a cached list with invalidation on Add/Remove mutations. Benchmark with 5,000 regions shows ReadMergedRegions going from 3,400us to ~18ns (190,000x faster) with zero allocations on cache hits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ken-swyfft
left a comment
There was a problem hiding this comment.
Code Review
Overall: Needs changes before merge. Cache invalidation coverage is complete, but there's a correctness risk from returning the mutable cached list directly.
Critical: Returning the mutable cached list allows callers to corrupt the cache
The MergedRegions property returns the same List<CellRangeAddress> reference on every call. This is a public API. Any external consumer who does:
var regions = sheet.MergedRegions;
regions.Add(someRegion); // silently corrupts the cache without updating XML...will corrupt the cache. This is a behavioral change from the pre-cache version, which returned a fresh list each time. Code that previously worked (mutating the returned list as a scratch pad) will now silently break.
Similarly, CellRangeAddress has public setters on FirstRow, LastRow, FirstColumn, LastColumn. Mutating a returned element now corrupts the cached data — previously this was harmless since each call built fresh objects.
Recommendation: Return a shallow copy: new List<CellRangeAddress>(_cachedMergedRegions). The allocation cost of copying a list of object references is trivial compared to the XML parsing cost being eliminated — you still get the vast majority of the perf win while preserving the original behavioral contract.
Alternatively, return _cachedMergedRegions.AsReadOnly(), though this changes the return type to ReadOnlyCollection<T> which is an API break.
Moderate: RemoveMergedRegions early return skips cache invalidation
if(!worksheet.IsSetMergeCells())
{
return; // InvalidateMergedRegionsCache() at end of method is never reached
}This is benign today (if IsSetMergeCells() is false, the cache should already be null or empty), but it's fragile. Adding InvalidateMergedRegionsCache() before the early return costs nothing and is defensive.
Cache invalidation coverage — verified complete
All three mutation paths invalidate correctly:
AddMergedRegion✓RemoveMergedRegion✓RemoveMergedRegions✓ (except the early return noted above)
Indirect callers (row/column shifting via RowShifter/ColumnShifter, copy operations) all go through AddMergedRegion/RemoveMergedRegions, so they invalidate properly.
Minor notes
- No thread safety on
_cachedMergedRegions, butXSSFSheetdoesn't appear to guarantee thread safety in general, so this is acceptable. - The worksheet is loaded once (via
EnsureWorksheetLoaded()), so the cache won't go stale from re-loading. ValidateMergedRegionsduringAddMergedRegioncorrectly reads the cache before invalidation, checking against existing regions.
Return new List<CellRangeAddress> from cached data so callers who mutate the returned list cannot corrupt the cache. Also add defensive cache invalidation on RemoveMergedRegions early-return path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The initial fix only returned a shallow copy on cache hits (2nd+ call), but the first call still returned the cached reference directly. Callers mutating that list would corrupt the cache. Restructured the property to always return a copy regardless of cache state. Added two tests: - TestMergedRegionsCacheReturnsFreshList: mutates returned list, verifies subsequent calls are unaffected - TestRemoveMergedRegionsOnEmptySheet: exercises the early-return path in RemoveMergedRegions when no merge cells exist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ken-swyfft
left a comment
There was a problem hiding this comment.
Updated Review
Both concerns from my earlier review have been addressed:
-
Mutable cached list — Fixed. The property now returns
new List<CellRangeAddress>(_cachedMergedRegions)(shallow copy), so callers can't corrupt the cache. The newTestMergedRegionsCacheReturnsFreshListtest verifies this by calling.Add()and.Clear()on the returned list and confirming subsequent calls are unaffected. -
RemoveMergedRegionsearly return — Fixed.InvalidateMergedRegionsCache()is now called before the early return on the!IsSetMergeCells()path. Covered byTestRemoveMergedRegionsOnEmptySheet.
The remaining theoretical concern (callers mutating CellRangeAddress elements via public setters) is marginal — it requires deliberate misuse rather than a natural pattern, and a deep copy would undercut the perf win. Acceptable as-is.
Approve.
|
You are teaching me how to use Claude 👯 AI+human review based on the comment. Claude is much better than Copilot, which doesn't allow comment-based review. I have to go back to the copilot conversation and talk to it, then it will modify the PR description instead of commenting. Sometimes, the original fix description is replaced. It's a bit annoying. |
My ReviewI double-checked that there are a few existing test cases for the MergedRegions property. This fix should be safe. Sometimes, performance optimization can be a disaster - Performance is so good, but the function is broken. Lesson Learnt from this PRAvoid any unnecessary repeated deserialization from OpenXML entity to XSSF entities. |
|
With respect to Claude Code - yeah, one key is to give it access to the tools it needs. The And yeah, Claude absolutely needs supervision - but one of the best ways of supervising it is with another Claude instance, with a fresh context. They can both make mistakes, but they don't usually make the same mistakes. |
AI Pair Programming ... That's new to me 🍡 |
Summary
MergedRegionsproperty result to avoid rebuilding aList<CellRangeAddress>from XML on every accessMergedRegionsBenchmarkfor measuring repeated read and add scenariosThe
MergedRegionsproperty was parsing everyCT_MergeCell.@refstring viaCellRangeAddress.ValueOf()on each call. This is called repeatedly during row copy, auto-size, shift, and validation operations.Benchmark Results
ReadMergedRegions (10 repeated reads per invocation)
AddMergedRegions (validated adds)
Test plan
TestMergeRegionsSpeedtest (50k regions) passes🤖 Generated with Claude Code