ExcelBench measures feature fidelity for Python Excel libraries -- how accurately they can read and write Excel features compared to native Excel. It also measures performance (throughput, memory) as a secondary axis.
- Fidelity evaluates correctness and completeness of features (formats, formulas, borders, etc.).
- Performance measures throughput (cells/s) and memory (RSS, tracemalloc) for read/write workloads.
Both tracks use the same adapter framework but run independently: fidelity uses oracle verification, performance skips it.
ExcelBench uses a hybrid write verification model:
- Primary oracle: Excel itself via
xlwings(highest fidelity). - Fallback oracle:
openpyxlwhen Excel is unavailable (CI / headless).
This allows reliable local verification while keeping CI runnable.
Two fixture paths exist:
fixtures/excel/(tracked): canonical Excel-generated files used in CI.fixtures/excel_xls/(tracked): canonical .xls fixtures for legacy format testing.test_files/(gitignored): local scratch output for development.
Generate canonical fixtures with:
uv run excelbench generate --output fixtures/excel # .xlsx
uv run excelbench generate-xls --output fixtures/excel_xls # .xlsFeatures are organized into tiers reflecting complexity:
| Tier | Features | Count |
|---|---|---|
| Tier 0 (Core) | cell_values, formulas, multiple_sheets | 3 |
| Tier 1 (Formatting) | alignment, background_colors, borders, dimensions, number_formats, text_formatting | 6 |
| Tier 2 (Advanced) | comments, conditional_formatting, data_validation, freeze_panes, hyperlinks, images, merged_cells, pivot_tables | 8 |
| Tier 3 (Workbook Metadata) | named_ranges, tables | 2 |
Framework coverage: 19 modeled features total. Current public XLSX profile: 17 tested features, where 16 are scoreable per-library in current results (pivot_tables is tested but N/A across adapters on macOS fixtures).
Pass-rate model per feature per adapter:
- 3 (Green): all tests pass -- complete fidelity
- 2 (Yellow): >= 80% pass -- functional for common cases
- 1 (Orange): >= 50% pass -- basic recognition, significant limitations
- 0 (Red): < 50% pass -- errors, corruption, or data loss
- N/A: adapter does not support this operation direction (read/write)
Test cases have importance weights: basic (must-pass) vs edge (bonus). Both count toward pass rate but edge-case failures are noted in diagnostics.
All results are written to JSON as the source of truth:
results/xlsx/results.json-- XLSX fidelity resultsresults/xls/results.json-- XLS fidelity resultsresults/perf/results.json-- Performance results
Renderers produce human-readable output (README.md, matrix.csv, heatmap, HTML dashboard) from these JSON files. The rendering step is deterministic and can be re-run independently:
uv run excelbench report --input results/xlsx/results.json --output results/xlsx