Skip to content

Latest commit

 

History

History
72 lines (50 loc) · 3.1 KB

File metadata and controls

72 lines (50 loc) · 3.1 KB

ExcelBench Methodology

Purpose

ExcelBench measures feature fidelity for Python Excel libraries -- how accurately they can read and write Excel features compared to native Excel. It also measures performance (throughput, memory) as a secondary axis.

Fidelity vs Performance

  • Fidelity evaluates correctness and completeness of features (formats, formulas, borders, etc.).
  • Performance measures throughput (cells/s) and memory (RSS, tracemalloc) for read/write workloads.

Both tracks use the same adapter framework but run independently: fidelity uses oracle verification, performance skips it.

Oracle Strategy

ExcelBench uses a hybrid write verification model:

  • Primary oracle: Excel itself via xlwings (highest fidelity).
  • Fallback oracle: openpyxl when Excel is unavailable (CI / headless).

This allows reliable local verification while keeping CI runnable.

Fixtures Policy

Two fixture paths exist:

  • fixtures/excel/ (tracked): canonical Excel-generated files used in CI.
  • fixtures/excel_xls/ (tracked): canonical .xls fixtures for legacy format testing.
  • test_files/ (gitignored): local scratch output for development.

Generate canonical fixtures with:

uv run excelbench generate --output fixtures/excel        # .xlsx
uv run excelbench generate-xls --output fixtures/excel_xls  # .xls

Feature Tiers

Features are organized into tiers reflecting complexity:

Tier Features Count
Tier 0 (Core) cell_values, formulas, multiple_sheets 3
Tier 1 (Formatting) alignment, background_colors, borders, dimensions, number_formats, text_formatting 6
Tier 2 (Advanced) comments, conditional_formatting, data_validation, freeze_panes, hyperlinks, images, merged_cells, pivot_tables 8
Tier 3 (Workbook Metadata) named_ranges, tables 2

Framework coverage: 19 modeled features total. Current public XLSX profile: 17 tested features, where 16 are scoreable per-library in current results (pivot_tables is tested but N/A across adapters on macOS fixtures).

Scoring

Pass-rate model per feature per adapter:

  • 3 (Green): all tests pass -- complete fidelity
  • 2 (Yellow): >= 80% pass -- functional for common cases
  • 1 (Orange): >= 50% pass -- basic recognition, significant limitations
  • 0 (Red): < 50% pass -- errors, corruption, or data loss
  • N/A: adapter does not support this operation direction (read/write)

Test cases have importance weights: basic (must-pass) vs edge (bonus). Both count toward pass rate but edge-case failures are noted in diagnostics.

Reproducibility

All results are written to JSON as the source of truth:

  • results/xlsx/results.json -- XLSX fidelity results
  • results/xls/results.json -- XLS fidelity results
  • results/perf/results.json -- Performance results

Renderers produce human-readable output (README.md, matrix.csv, heatmap, HTML dashboard) from these JSON files. The rendering step is deterministic and can be re-run independently:

uv run excelbench report --input results/xlsx/results.json --output results/xlsx