Commit 9bb29bf
chore: Pull request benchmarking (#255)
* Add GitHub Actions benchmark workflow with interactive comparison charts
This PR adds a comprehensive benchmarking infrastructure for performance testing:
**Features:**
- Automated benchmark workflow comparing baseline (PyPI) vs PR (source build)
- Interactive HTML comparison charts with Plotly visualization
- Branch-specific chart publishing to GitHub Pages
- Library version extraction from benchmark repository
- Smart Cargo caching with RUSTFLAGS awareness
**Changes:**
- Add `.github/workflows/benchmark.yml`: Main benchmark workflow
- Add `benchmarks/compare_benchmark_results.sh`: Result comparison script
- Add `benchmarks/generate_comparison_charts.py`: Interactive chart generation
- Add `benchmarks/README_BENCHMARKS.md`: Documentation for benchmark system
- Update `docs/performance.md`: Link to live comparison charts
- Update `README.md`: Reference to benchmark documentation
- Update `Cargo.lock`: Version bump to 0.18.0
- Update `.gitignore`: Exclude mkdocs cache directory
**Technical Details:**
- Uses skylake CPU target for stable GitHub Actions builds
- Implements proper RUSTFLAGS handling for incremental compilation
- Generates 6 charts per run (3 operations × 2 types)
- Publishes to biodatageeks.org/polars-bio/benchmark-comparison/{branch}/
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Improve cargo caching for faster benchmark builds
**Key improvements:**
1. Move Rust installation before cache step (correct order)
2. Restructure cache key for better fallback strategy
3. Enable incremental compilation for release builds (CARGO_INCREMENTAL=1)
4. Build from repo root to match cached target/ directory
**Cache Strategy:**
- Primary key: cargo-benchmark-{Cargo.lock hash}-skylake
- Fallback 1: cargo-benchmark-{Cargo.lock hash}- (same deps, different RUSTFLAGS)
- Fallback 2: cargo-benchmark- (any recent cache)
This should significantly reduce compilation time on subsequent runs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix maturin build path issue
The previous change broke the build by looking for Cargo.toml in the wrong directory.
Reverted to running maturin from polars-bio-bench with explicit -m ../Cargo.toml path.
This maintains the cargo cache improvements while fixing the build error.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add missing parse_benchmark_results.py script
The compare_benchmark_results.sh script was calling this file but it didn't exist.
This script:
- Parses baseline and PR CSV benchmark results
- Compares polars_bio performance against threshold
- Generates JSON and markdown reports
- Detects performance regressions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix PR comment step to work with workflow_dispatch trigger
The previous condition only worked for pull_request events, but the workflow
is triggered by workflow_dispatch. Now we:
1. Find the PR number for the target branch using GitHub CLI
2. Comment on that PR if it exists
3. Use the correct PR number instead of context.issue.number
This allows PR comments to work even when manually triggering the workflow.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Simplify PR comment to only show chart and artifact links
Removed detailed benchmark tables from PR comments to keep them concise.
Full details are available via:
- Interactive charts (published to GitHub Pages)
- Workflow artifacts
- Workflow summary
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Make operation name parsing generic and flexible
Changed from hardcoded "-single-4tools" pattern to generic parsing:
- Extracts test case from end using regex: _(\d+-\d+)$
- Extracts operation name as everything before first dash
- Handles operations with underscores (count_overlaps)
- Works with any config: single-1tool, single-4tools, multi-8tools, etc.
- Works with no config pattern at all
Examples:
- overlap-single-4tools_7-8.csv -> operation="overlap"
- count_overlaps-multi-8tools_1-2.csv -> operation="count_overlaps"
- nearest_3-7.csv -> operation="nearest"
- overlap-single-1tool_5-3.csv -> operation="overlap"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Support unary test case numbers in addition to pairs
Updated regex pattern to match both:
- Unary test cases: _7 or _3
- Pair test cases: _7-8 or _1-2
Examples now supported:
- overlap-single-4tools_7.csv -> operation="overlap", test_case="7"
- overlap-single-4tools_7-8.csv -> operation="overlap", test_case="7-8"
- nearest_3.csv -> operation="nearest", test_case="3"
- count_overlaps_1-2.csv -> operation="count_overlaps", test_case="1-2"
Changed pattern from _(\d+-\d+)$ to _(\d+(?:-\d+)?)$
This makes the second number optional.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Support alphanumeric test case names with dashes
Changed test case pattern from strict numeric to flexible alphanumeric:
- Old pattern: _(\d+(?:-\d+)?)$ (only digits and optional -digits)
- New pattern: _([^_]+)$ (anything except underscore)
This now supports:
- Numeric: _7, _7-8, _1-2
- Alphanumeric with dashes: _gnomad-sv-vcf, _test-case-1
- Any other naming: _custom-name, _dataset-v2
Examples:
- overlap_gnomad-sv-vcf.csv -> operation="overlap", test_case="gnomad-sv-vcf"
- overlap-single-4tools_7-8.csv -> operation="overlap", test_case="7-8"
- nearest_custom-test.csv -> operation="nearest", test_case="custom-test"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add run timestamp to HTML benchmark report header
Added UTC timestamp to the header showing when the report was generated.
Format: YYYY-MM-DD HH:MM:SS UTC
The timestamp appears in the subtitle line alongside baseline and PR info:
"Baseline: v0.18.0 | PR: issue-234 | Generated: 2025-10-24 18:30:45 UTC"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Use self-hosted runner with huge-c10m25 label for benchmarks
Changed from ubuntu-latest to self-hosted runner with huge-c10m25 label.
This allows benchmarks to run on dedicated hardware with consistent
performance characteristics for more reliable benchmark comparisons.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Change self-hosted runner label to extra-c20m50
Updated runner label from huge-c10m25 to extra-c20m50 for benchmarks.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add PATH configuration for self-hosted runner
Added /home/gha/.local/bin to PATH for self-hosted runner.
This ensures tools installed in the user's local bin directory
(like poetry) are available to all workflow steps.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix PATH configuration to use GITHUB_PATH
Changed from env-level PATH override to using GITHUB_PATH.
This properly prepends /home/gha/.local/bin to PATH without
removing system paths (where tar and other utilities are located).
Using GITHUB_PATH ensures the path is available to all subsequent
steps while preserving the existing PATH.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Implement multi-runner benchmark support with matrix strategy
**Major Changes:**
1. **Workflow Structure:**
- Added `runners` input parameter (default: 'linux', supports: 'linux,macos')
- New `prepare` job that parses runners and creates matrix
- `benchmark` job now runs in parallel for each selected runner
- New `aggregate` job collects all runner results and generates charts
2. **Runner Configuration:**
- Linux: [self-hosted, Linux, extra-c20m50]
- macOS: [self-hosted, macos, extra-c10m50]
- Easy to add more runners in the matrix setup
3. **Result Storage:**
- Each runner uploads artifacts as `benchmark-results-{runner}`
- Includes `runner_info.json` with metadata (os, arch, timestamp)
- Baseline and PR results stored separately per runner
4. **Chart Generation:**
- Added `--multi-runner` mode to generate_comparison_charts.py
- Discovers all runner results from artifacts directory
- Currently shows first runner (tabbed interface TODO)
- Backward compatible with single-runner mode
5. **Aggregation Job:**
- Downloads all runner artifacts
- Generates combined comparison charts
- Publishes to gh-pages
- Posts PR comments
- Runs even if some benchmarks fail
**Usage:**
- Default (linux only): `gh workflow run benchmark.yml`
- Multiple runners: `gh workflow run benchmark.yml --field runners=linux,macos`
**TODO:**
- Implement full tabbed HTML interface for multiple runners
- Add historical data storage
- Add trend charts across time
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add benchmark suite selection parameter (fast/full)
- Add 'benchmark_suite' input parameter with choice between 'fast' and 'full'
- Default is 'fast' which uses conf/benchmark-pull-request-fast.yaml
- 'full' option uses conf/benchmark-pull-request.yaml
- Benchmark suite info is saved in runner_info.json for tracking
- Fixed prepare job outputs to reference actual step IDs
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Configure fast mode to use GitHub-hosted runners on both platforms
Fast mode (default):
- Uses GitHub-hosted runners (ubuntu-latest, macos-latest)
- Runs on BOTH linux and macos by default
- Suitable for quick PR validation
Full mode:
- Uses self-hosted runners (extra-c20m50, extra-c10m50)
- Respects runners parameter (default: linux only)
- Suitable for comprehensive benchmarking
Changes:
- Modified matrix generation to select runner type based on benchmark_suite
- Made PATH setup conditional (only for self-hosted runners)
- Split system dependencies installation for Linux (apt-get) vs macOS (brew)
- Fixed typo: jqwould -> jq
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Set architecture-specific Rust CPU targets for optimal performance
- Linux AMD64: Uses -Ctarget-cpu=skylake for x86_64 optimization
- macOS ARM64: Uses -Ctarget-cpu=native for Apple Silicon optimization
- Added dynamic Rust target CPU step that selects based on matrix.arch
- Updated Cargo cache key to include target CPU for proper cache separation
This ensures optimal performance on each platform while maintaining
separate build caches for different architectures.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Use apple-m1 target instead of native for ARM64 builds
Changed from -Ctarget-cpu=native to -Ctarget-cpu=apple-m1 for macOS ARM64
to match the publish workflow configuration and ensure consistent builds
across different workflows.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add macOS-specific linker flags to match publish workflow
For macOS ARM64 builds, now includes the same RUSTFLAGS as the publish workflow:
- -Clink-arg=-undefined
- -Clink-arg=dynamic_lookup
- -Ctarget-cpu=apple-m1
Linux AMD64 continues to use:
- -Ctarget-cpu=skylake
This ensures consistent build configuration between benchmark and publish
workflows, and resolves potential linking issues on macOS.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Implement interactive benchmark comparison with historical data storage
This commit implements a complete solution for interactive benchmark
comparison with historical data storage on GitHub Pages.
Key changes:
1. Fix aggregation error:
- Make comparison_report_combined.md optional in workflow summary
- Prevents failure when file doesn't exist in multi-runner scenarios
2. Add interactive comparison generator:
- New benchmarks/generate_interactive_comparison.py
- Generates HTML report with dropdown selection (no architecture clutter)
- Dynamic tab rendering based on available data
- Preserves original chart layout and styling
- Supports "Baseline vs Target" terminology
3. Add helper scripts:
- scripts/generate_benchmark_metadata.py: Creates metadata.json for each run
- scripts/update_benchmark_index.py: Maintains master index of all datasets
4. Update benchmark workflow:
- Clone/create gh-pages branch automatically
- Store benchmark data in structured format:
* Tags: benchmark-data/tags/{version}/{runner}/
* Commits: benchmark-data/commits/{sha}/{runner}/
- Generate metadata.json for each dataset
- Update master index.json with new datasets
- Generate interactive report at benchmark-comparison/index.html
- Commit and push all changes to gh-pages
Features:
- Historical data storage for all release tags
- Automatic cleanup for old commit data (via cleanup script, to be added)
- Dropdown selection without architecture labels
- Dynamic tabs for multi-runner comparisons
- Pure static HTML (no server required)
- Automatic deployment to GitHub Pages
Resolves: https://github.com/biodatageeks/polars-bio/actions/runs/18800632149/job/53648469437
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix static chart publishing to use index.html and avoid overwriting interactive report
- Change static chart filename from static.html to index.html
- Remove main branch special case (always use branch-specific directory)
- This ensures old PR links work correctly
- Main branch URL points to interactive report at root
- PR branch URLs point to static charts in subdirectories
* Fix: Store baseline benchmark data to gh-pages in addition to PR data
Previously only PR/branch data was being stored from pr_results,
but the baseline tag data (from baseline_results) was not saved.
Changes:
- Add new step 'Store baseline benchmark data to gh-pages'
- Process baseline_results directory for each runner
- Store baseline data to tags/{baseline_tag}/{runner}/
- Skip if baseline data already exists (don't overwrite)
- Mark baseline as latest tag in index
- This allows the interactive comparison to show tag data
Now when workflow runs, it will store BOTH:
1. Target/PR data (branch or tag being tested)
2. Baseline data (comparison tag, e.g., 0.18.0)
This populates the dropdown with historical tag data for comparison.
* Fix: Fetch tags in aggregate job for baseline SHA lookup
The 'Store baseline benchmark data' step needs to run git rev-parse
on the baseline tag, but tags weren't being fetched in the aggregate
job checkout.
Changes:
- Add fetch-depth: 0 to get full history
- Add fetch-tags: true to fetch all tags
This fixes: fatal: ambiguous argument '0.18.0': unknown revision
* Implement multi-runner tabs in static PR comparison report
Previously the static report only showed data for the first runner
when multiple runners were present. Now it generates a tabbed
interface similar to the interactive report.
Changes:
- Add _create_tabbed_html() helper function
- Modify generate_multi_runner_html() to generate charts for all runners
- Extract body content from each runner's HTML
- Create tabbed interface with styled tabs
- First tab is active by default
- Tab switching with JavaScript
- Runner labels: Linux AMD64, macOS ARM64
This allows PR comments to show comprehensive multi-architecture
benchmark results with easy tab switching between Linux and macOS.
* Fix: Make chart IDs unique per runner in multi-runner tabs
The previous implementation had all runners using the same chart element
IDs (e.g., 'chart-overlap-total'), causing all scripts to render to the
same elements regardless of which tab was active.
Changes:
- Add runner suffix to all chart div IDs
- Update Plotly.newPlot references to match new IDs
- Each runner now has unique IDs: chart-{operation}-{type}-{runner}
Example:
- Linux: chart-overlap-total-linux
- macOS: chart-overlap-total-macos
This ensures each tab displays its own unique data correctly.
* Keep 10 most recent commits per branch in dropdown
Previously only the latest commit per branch was shown in the
interactive comparison dropdown. Now keeps the 10 most recent
commits sorted by timestamp descending.
Changes:
- Modify add_dataset() to keep multiple commits per branch
- Avoid duplicates by checking commit SHA
- Limit to N most recent commits per branch+runner combination
- Add --max-commits parameter (default: 10)
- Tags still kept as single entry (only latest)
This allows users to compare with recent historical commits,
not just the absolute latest one.
Dropdown will now show:
- All tags (0.18.0, 0.19.0, etc.)
- 10 most recent commits for issue-234
- 10 most recent commits for any other branch
Example:
issue-234 (70eaf33) - 2025-10-25 10:54 UTC
issue-234 (00945fc) - 2025-10-25 10:42 UTC
issue-234 (0ffaa45) - 2025-10-25 10:22 UTC
...
* Fix: Show all commits in interactive comparison dropdown
Previously, multiple commits from the same branch were collapsed into a
single dropdown entry. Now each commit gets its own entry in the dropdown.
Changes:
- Group branch datasets by commit SHA instead of just ref name
- Use unique key format (ref@sha) to differentiate commits
- Update JavaScript to handle per-commit entries
- Users can now select and compare any of the 10 most recent commits
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix: Make dataset storage keys unique across commits
Previously, all commits from the same branch shared the same dataset ID,
causing later commits to overwrite earlier ones when loading data. Only
the last commit's data was accessible, making comparison ineffective.
Changes:
- Include commit SHA in dataset storage keys for branches
- Update refs_by_type to reference unique dataset keys
- Now each commit's data is stored separately and accessible
Example:
- Before: all commits used 'branch-issue-234-linux'
- After: 'branch-issue-234-linux@f4398c1', 'branch-issue-234-linux@28ed1c2', etc.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix: Make data and layout variables unique across runner tabs
Previously, both Linux and macOS tabs used the same variable names
(data_count_overlaps_total, layout_count_overlaps_total, etc.), causing
the second tab's declarations to overwrite the first tab's data. Both tabs
ended up showing the same data (from whichever runner was processed last).
Changes:
- Add runner suffix to all data variable declarations
- Add runner suffix to all layout variable declarations
- Update Plotly.newPlot references to use renamed variables
Example transformation:
- var data_count_overlaps_total -> var data_count_overlaps_total_linux
- var data_count_overlaps_total -> var data_count_overlaps_total_macos
- Plotly.newPlot(..., data_count_overlaps_total_linux, layout_count_overlaps_total_linux)
Now each runner tab has uniquely named variables and displays its own data.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix switchTab function to work without event.target
The switchTab function was crashing because it relied on the global
event.target which is undefined when called manually or in certain
contexts. This prevented users from actually switching between Linux
and macOS tabs in static PR reports.
Fix: Find and activate the button using the runnerName parameter
instead of relying on event.target.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Use data-runner attribute for tab button identification
The previous fix using onclick attribute matching was unreliable.
Now using a data-runner attribute on buttons to properly identify
and activate the correct tab button when switching.
Changes:
- Add data-runner attribute to tab buttons
- Use button.dataset.runner for reliable button identification
- Simplifies switchTab logic and ensures tabs switch correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix Plotly chart rendering in hidden tabs with redraw
Charts rendered in hidden tabs (display: none) don't render correctly
in Plotly. When tabs switch, we need to call Plotly.redraw() to force
the charts to re-render with correct dimensions and data.
Changes:
- Add data-runner attribute to tab buttons for reliable identification
- Call Plotly.redraw() on all charts when switching tabs
- Use setTimeout to ensure tab is visible before redrawing
Tested with Playwright automation - confirms charts show different
data for Linux (936.386ms) vs macOS (833.433ms).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix multi-runner chart tab switching with lazy initialization
Charts rendered in hidden tabs (display: none) have incorrect dimensions
and don't update visually when tabs switch. This commit implements lazy
chart initialization to solve the problem:
- Store chart data/layout in window.chartConfigs instead of creating charts immediately
- Initialize charts only when their tab becomes visible for the first time
- Track which tabs have been initialized to avoid recreating charts
- Initialize the first (active) tab on page load via DOMContentLoaded
This approach ensures charts are always created in visible containers with
correct dimensions, fixing the issue where switching tabs showed identical
data despite having different underlying values.
Tested with Playwright automation confirming different data for Linux (936.386ms)
vs macOS (833.433ms) benchmark results.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix generate comparison
---------
Co-authored-by: Claude <noreply@anthropic.com>1 parent 02c1d7b commit 9bb29bf
File tree
27 files changed
+4198
-3623
lines changed- .github/workflows
- benchmarks
- results
- docs
- scripts
27 files changed
+4198
-3623
lines changedLarge diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
| 37 | + | |
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
| |||
This file was deleted.
0 commit comments