Skip to content

Replace Hashtable/ArrayList with generic collections in formula eval hot path#1742

Open
ken-swyfft wants to merge 5 commits intonissl-lab:masterfrom
swyfft-insurance:perf/ks/20260321_generic-collections-formula-eval
Open

Replace Hashtable/ArrayList with generic collections in formula eval hot path#1742
ken-swyfft wants to merge 5 commits intonissl-lab:masterfrom
swyfft-insurance:perf/ks/20260321_generic-collections-formula-eval

Conversation

@ken-swyfft
Copy link
Contributor

Summary

  • Replace Hashtable with Dictionary<K,V> and ArrayList with List<T> in the formula evaluation hot path, eliminating boxing overhead and double-lookup patterns
  • Fix bug in FormulaCellCache.Remove() which was keying on the cell object instead of cell.IdentityKey, causing silent removal failures
  • Fix Decimal overflow crash in MultiplyEval when operands exceed decimal range (~7.9e28), falling back to double arithmetic to match Excel behavior
  • Pass args to BenchmarkSwitcher.Run() so --filter works from CLI

Files changed

  • PlainCellCache.csHashtableDictionary<Loc, PlainValueCellCacheEntry>
  • FormulaCellCache.csHashtableDictionary<object, FormulaCellCacheEntry>
  • OperationEvaluatorFactory.csHashtableDictionary<OperationPtg, Function>
  • FormulaUsedBlankCellSet.csHashtableDictionary<BookSheetKey, BlankCellSheetGroup>
  • Ptg.csArrayListList<Ptg>
  • FormulaParser.csArrayListList<ParseNode> / List<object>
  • MultiplyEval.cs — catch OverflowException for large values
  • Program.cs — pass args through to BenchmarkSwitcher

Benchmark results (EvaluateAll on 1.43M formulas, 17MB .xlsx)

Metric master this PR Change
Mean time 16.19 s 11.00 s -32%
Allocated 14.91 GB 14.92 GB ~same

Test plan

  • All 4,616 tests pass on net8.0 (0 failures)
  • All 4,619 tests pass on net472 (0 failures)
  • Benchmark: LargeExcelFileBenchmark.Evaluate completes without crash
  • Review with Tony for merge suitability

🤖 Generated with Claude Code

ken-swyfft and others added 3 commits March 21, 2026 11:20
…ation hot path

Eliminates boxing overhead and improves type safety in the most frequently
called code during formula evaluation.

- PlainCellCache: Hashtable → Dictionary<Loc, PlainValueCellCacheEntry>
- FormulaCellCache: Hashtable → Dictionary<object, FormulaCellCacheEntry>
  - Also fixes bug: Remove() was keying on cell instead of cell.IdentityKey
- OperationEvaluatorFactory: Hashtable → Dictionary<OperationPtg, Function>
- FormulaUsedBlankCellSet: Hashtable → Dictionary<BookSheetKey, BlankCellSheetGroup>
- Ptg.ReadTokens: ArrayList → List<Ptg>
- FormulaParser.Arguments/ParseArrayRow: ArrayList → List<ParseNode>/List<object>

All lookups converted to TryGetValue to avoid double-lookup patterns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When both operands are very large doubles (> ~7.9e28), casting to decimal
throws OverflowException. Fall back to double arithmetic, matching Excel
behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor Author

@ken-swyfft ken-swyfft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Approve with minor suggestions. The collection replacements are mechanically correct, no public API signatures change, and no thread-safety regression.

Good catch: FormulaCellCache.Remove() bug fix

The old code was calling _formulaEntriesByCell.Remove(cell) instead of Remove(cell.IdentityKey) — meaning entries were never actually removed, causing a memory leak during cell mutations. The fix is correct.

Moderate concerns

  1. Missing test coverage for the FormulaCellCache.Remove fix — This is a real behavioral change (fixing a memory leak). A regression test demonstrating that entries are actually evicted after NotifyDeleteCell would guard against this silently regressing.

  2. Missing test coverage for the MultiplyEval overflow fix — Casting a double > ~7.9e28 to decimal throws OverflowException; the fallback to double arithmetic is correct and matches Excel. However, the existing test file (TestMultiplyEval.cs) doesn't cover this scenario. A test case like Confirm(new NumberEval(1e29), new NumberEval(1e29), 1e58) would lock this in.

Minor notes

  • The 32% benchmark improvement is plausible — the double-lookup → TryGetValue in the innermost eval loop is likely the primary driver, and the Remove bug may have caused unbounded cache growth slowing hash operations. Profiler attribution would strengthen the claim.
  • FormulaCellCache key type remains object (because IEvaluationCell.IdentityKey returns Object). Fine for this PR's scope, just noting it.
  • Pre-existing Equals null-safety issues in Loc and BookSheetKey (direct cast without null/type check) — out of scope but worth a follow-up.
  • Removal of unused System.Runtime.Serialization.Formatters.Binary using in Ptg.cs is correct cleanup.

…rflow

- TestFormulaCellCacheRemoveActuallyEvicts: verifies entries are evicted
  after Remove(), guarding against the bug where Remove() keyed on cell
  instead of cell.IdentityKey
- TestLargeValuesOverflowDecimal: verifies MultiplyEval falls back to
  double arithmetic when operands exceed decimal range (~7.9e28)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ken-swyfft
Copy link
Contributor Author

The CI is failing above, but I think those are failures that will be fixed by this PR here: #1746. All tests pass clean on my local.

@ken-swyfft
Copy link
Contributor Author

Thanks for the thorough review! Both moderate concerns were valid — addressed in 42364a3:

1. FormulaCellCache.Remove regression test — Added TestFormulaCellCacheRemoveActuallyEvicts in TestEvaluationCache.cs. Creates a FormulaCellCache directly, puts two entries, removes one, and verifies it's actually evicted (Get returns null) while the other entry is unaffected.

2. MultiplyEval overflow test — Added TestLargeValuesOverflowDecimal in TestMultiplyEval.cs. Multiplies values > 7.9×10²⁸ that would throw OverflowException on the decimal cast, verifying the try/catch fallback to double arithmetic works correctly.

Full suite green: 2,752 main + 1,787 OOXML = 4,539 passed across both net8.0 and net472.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants