Conversation
This PR adds a comprehensive benchmarking infrastructure for performance testing:
**Features:**
- Automated benchmark workflow comparing baseline (PyPI) vs PR (source build)
- Interactive HTML comparison charts with Plotly visualization
- Branch-specific chart publishing to GitHub Pages
- Library version extraction from benchmark repository
- Smart Cargo caching with RUSTFLAGS awareness
**Changes:**
- Add `.github/workflows/benchmark.yml`: Main benchmark workflow
- Add `benchmarks/compare_benchmark_results.sh`: Result comparison script
- Add `benchmarks/generate_comparison_charts.py`: Interactive chart generation
- Add `benchmarks/README_BENCHMARKS.md`: Documentation for benchmark system
- Update `docs/performance.md`: Link to live comparison charts
- Update `README.md`: Reference to benchmark documentation
- Update `Cargo.lock`: Version bump to 0.18.0
- Update `.gitignore`: Exclude mkdocs cache directory
**Technical Details:**
- Uses skylake CPU target for stable GitHub Actions builds
- Implements proper RUSTFLAGS handling for incremental compilation
- Generates 6 charts per run (3 operations × 2 types)
- Publishes to biodatageeks.org/polars-bio/benchmark-comparison/{branch}/
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
**Key improvements:**
1. Move Rust installation before cache step (correct order)
2. Restructure cache key for better fallback strategy
3. Enable incremental compilation for release builds (CARGO_INCREMENTAL=1)
4. Build from repo root to match cached target/ directory
**Cache Strategy:**
- Primary key: cargo-benchmark-{Cargo.lock hash}-skylake
- Fallback 1: cargo-benchmark-{Cargo.lock hash}- (same deps, different RUSTFLAGS)
- Fallback 2: cargo-benchmark- (any recent cache)
This should significantly reduce compilation time on subsequent runs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The previous change broke the build by looking for Cargo.toml in the wrong directory. Reverted to running maturin from polars-bio-bench with explicit -m ../Cargo.toml path. This maintains the cargo cache improvements while fixing the build error. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The compare_benchmark_results.sh script was calling this file but it didn't exist. This script: - Parses baseline and PR CSV benchmark results - Compares polars_bio performance against threshold - Generates JSON and markdown reports - Detects performance regressions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The previous condition only worked for pull_request events, but the workflow is triggered by workflow_dispatch. Now we: 1. Find the PR number for the target branch using GitHub CLI 2. Comment on that PR if it exists 3. Use the correct PR number instead of context.issue.number This allows PR comments to work even when manually triggering the workflow. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Benchmark Comparison: issue-234 vs 0.18.0📊 View Interactive Charts | 📦 Download Artifacts Benchmark comparison generated by polars-bio CI |
Removed detailed benchmark tables from PR comments to keep them concise. Full details are available via: - Interactive charts (published to GitHub Pages) - Workflow artifacts - Workflow summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed from hardcoded "-single-4tools" pattern to generic parsing: - Extracts test case from end using regex: _(\d+-\d+)$ - Extracts operation name as everything before first dash - Handles operations with underscores (count_overlaps) - Works with any config: single-1tool, single-4tools, multi-8tools, etc. - Works with no config pattern at all Examples: - overlap-single-4tools_7-8.csv -> operation="overlap" - count_overlaps-multi-8tools_1-2.csv -> operation="count_overlaps" - nearest_3-7.csv -> operation="nearest" - overlap-single-1tool_5-3.csv -> operation="overlap" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Updated regex pattern to match both: - Unary test cases: _7 or _3 - Pair test cases: _7-8 or _1-2 Examples now supported: - overlap-single-4tools_7.csv -> operation="overlap", test_case="7" - overlap-single-4tools_7-8.csv -> operation="overlap", test_case="7-8" - nearest_3.csv -> operation="nearest", test_case="3" - count_overlaps_1-2.csv -> operation="count_overlaps", test_case="1-2" Changed pattern from _(\d+-\d+)$ to _(\d+(?:-\d+)?)$ This makes the second number optional. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed test case pattern from strict numeric to flexible alphanumeric: - Old pattern: _(\d+(?:-\d+)?)$ (only digits and optional -digits) - New pattern: _([^_]+)$ (anything except underscore) This now supports: - Numeric: _7, _7-8, _1-2 - Alphanumeric with dashes: _gnomad-sv-vcf, _test-case-1 - Any other naming: _custom-name, _dataset-v2 Examples: - overlap_gnomad-sv-vcf.csv -> operation="overlap", test_case="gnomad-sv-vcf" - overlap-single-4tools_7-8.csv -> operation="overlap", test_case="7-8" - nearest_custom-test.csv -> operation="nearest", test_case="custom-test" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Added UTC timestamp to the header showing when the report was generated. Format: YYYY-MM-DD HH:MM:SS UTC The timestamp appears in the subtitle line alongside baseline and PR info: "Baseline: v0.18.0 | PR: issue-234 | Generated: 2025-10-24 18:30:45 UTC" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed from ubuntu-latest to self-hosted runner with huge-c10m25 label. This allows benchmarks to run on dedicated hardware with consistent performance characteristics for more reliable benchmark comparisons. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Updated runner label from huge-c10m25 to extra-c20m50 for benchmarks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Added /home/gha/.local/bin to PATH for self-hosted runner. This ensures tools installed in the user's local bin directory (like poetry) are available to all workflow steps. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed from env-level PATH override to using GITHUB_PATH. This properly prepends /home/gha/.local/bin to PATH without removing system paths (where tar and other utilities are located). Using GITHUB_PATH ensures the path is available to all subsequent steps while preserving the existing PATH. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
**Major Changes:**
1. **Workflow Structure:**
- Added `runners` input parameter (default: 'linux', supports: 'linux,macos')
- New `prepare` job that parses runners and creates matrix
- `benchmark` job now runs in parallel for each selected runner
- New `aggregate` job collects all runner results and generates charts
2. **Runner Configuration:**
- Linux: [self-hosted, Linux, extra-c20m50]
- macOS: [self-hosted, macos, extra-c10m50]
- Easy to add more runners in the matrix setup
3. **Result Storage:**
- Each runner uploads artifacts as `benchmark-results-{runner}`
- Includes `runner_info.json` with metadata (os, arch, timestamp)
- Baseline and PR results stored separately per runner
4. **Chart Generation:**
- Added `--multi-runner` mode to generate_comparison_charts.py
- Discovers all runner results from artifacts directory
- Currently shows first runner (tabbed interface TODO)
- Backward compatible with single-runner mode
5. **Aggregation Job:**
- Downloads all runner artifacts
- Generates combined comparison charts
- Publishes to gh-pages
- Posts PR comments
- Runs even if some benchmarks fail
**Usage:**
- Default (linux only): `gh workflow run benchmark.yml`
- Multiple runners: `gh workflow run benchmark.yml --field runners=linux,macos`
**TODO:**
- Implement full tabbed HTML interface for multiple runners
- Add historical data storage
- Add trend charts across time
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add 'benchmark_suite' input parameter with choice between 'fast' and 'full' - Default is 'fast' which uses conf/benchmark-pull-request-fast.yaml - 'full' option uses conf/benchmark-pull-request.yaml - Benchmark suite info is saved in runner_info.json for tracking - Fixed prepare job outputs to reference actual step IDs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
| runs-on: ubuntu-latest | ||
| outputs: | ||
| matrix: ${{ steps.set-matrix.outputs.matrix }} | ||
| baseline_tag: ${{ steps.baseline.outputs.tag }} | ||
| target_ref: ${{ steps.target.outputs.ref }} | ||
| benchmark_config: ${{ steps.benchmark-config.outputs.config }} | ||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| fetch-depth: 0 | ||
|
|
||
| - name: Determine baseline tag | ||
| id: baseline | ||
| run: | | ||
| if [ -n "${{ inputs.baseline_tag }}" ]; then | ||
| BASELINE_TAG="${{ inputs.baseline_tag }}" | ||
| echo "Using user-specified baseline tag: $BASELINE_TAG" | ||
| else | ||
| BASELINE_TAG=$(git tag --sort=-creatordate | head -1) | ||
| if [ -z "$BASELINE_TAG" ]; then | ||
| echo "Error: No git tags found. Please create a tag or specify baseline_tag input." | ||
| exit 1 | ||
| fi | ||
| echo "Using latest tag as baseline: $BASELINE_TAG" | ||
| fi | ||
| echo "tag=$BASELINE_TAG" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Determine target reference | ||
| id: target | ||
| run: | | ||
| if [ -n "${{ inputs.target_branch }}" ]; then | ||
| TARGET_REF="${{ inputs.target_branch }}" | ||
| else | ||
| TARGET_REF="${{ github.ref_name }}" | ||
| fi | ||
| echo "ref=$TARGET_REF" >> $GITHUB_OUTPUT | ||
| echo "Target reference: $TARGET_REF" | ||
|
|
||
| - name: Set up runner matrix | ||
| id: set-matrix | ||
| run: | | ||
| RUNNERS="${{ inputs.runners }}" | ||
|
|
||
| # Build matrix JSON | ||
| MATRIX_JSON='{"include":[' | ||
|
|
||
| if [[ "$RUNNERS" == *"linux"* ]]; then | ||
| MATRIX_JSON+='{"runner":"linux","labels":"[\"self-hosted\",\"Linux\",\"extra-c20m50\"]","os":"linux","arch":"amd64"},' | ||
| fi | ||
|
|
||
| if [[ "$RUNNERS" == *"macos"* ]]; then | ||
| MATRIX_JSON+='{"runner":"macos","labels":"[\"self-hosted\",\"macos\",\"extra-c10m50\"]","os":"macos","arch":"arm64"},' | ||
| fi | ||
|
|
||
| # Remove trailing comma and close JSON | ||
| MATRIX_JSON="${MATRIX_JSON%,}]}" | ||
|
|
||
| echo "Matrix: $MATRIX_JSON" | ||
| echo "matrix=$MATRIX_JSON" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Determine benchmark config | ||
| id: benchmark-config | ||
| run: | | ||
| SUITE="${{ inputs.benchmark_suite }}" | ||
| if [ "$SUITE" == "full" ]; then | ||
| CONFIG="conf/benchmark-pull-request.yaml" | ||
| else | ||
| CONFIG="conf/benchmark-pull-request-fast.yaml" | ||
| fi | ||
| echo "Using benchmark config: $CONFIG" | ||
| echo "config=$CONFIG" >> $GITHUB_OUTPUT | ||
|
|
||
| benchmark: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 4 months ago
To fix this problem, add an explicit permissions block with the minimal necessary scopes to the prepare job. Since the job only checks out code and computes conditions/outputs (it does not push, create PRs, or access secrets), only contents: read is required (read-only access to code). Place this block immediately under the runs-on: declaration for clarity and to follow standard YAML structure. No additional methods, imports, or definitions are needed: the change is a single-line YAML addition.
| @@ -35,6 +35,8 @@ | ||
| jobs: | ||
| prepare: | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| outputs: | ||
| matrix: ${{ steps.set-matrix.outputs.matrix }} | ||
| baseline_tag: ${{ steps.baseline.outputs.tag }} |
Fast mode (default): - Uses GitHub-hosted runners (ubuntu-latest, macos-latest) - Runs on BOTH linux and macos by default - Suitable for quick PR validation Full mode: - Uses self-hosted runners (extra-c20m50, extra-c10m50) - Respects runners parameter (default: linux only) - Suitable for comprehensive benchmarking Changes: - Modified matrix generation to select runner type based on benchmark_suite - Made PATH setup conditional (only for self-hosted runners) - Split system dependencies installation for Linux (apt-get) vs macOS (brew) - Fixed typo: jqwould -> jq 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Linux AMD64: Uses -Ctarget-cpu=skylake for x86_64 optimization - macOS ARM64: Uses -Ctarget-cpu=native for Apple Silicon optimization - Added dynamic Rust target CPU step that selects based on matrix.arch - Updated Cargo cache key to include target CPU for proper cache separation This ensures optimal performance on each platform while maintaining separate build caches for different architectures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed from -Ctarget-cpu=native to -Ctarget-cpu=apple-m1 for macOS ARM64 to match the publish workflow configuration and ensure consistent builds across different workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
For macOS ARM64 builds, now includes the same RUSTFLAGS as the publish workflow: - -Clink-arg=-undefined - -Clink-arg=dynamic_lookup - -Ctarget-cpu=apple-m1 Linux AMD64 continues to use: - -Ctarget-cpu=skylake This ensures consistent build configuration between benchmark and publish workflows, and resolves potential linking issues on macOS. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit implements a complete solution for interactive benchmark
comparison with historical data storage on GitHub Pages.
Key changes:
1. Fix aggregation error:
- Make comparison_report_combined.md optional in workflow summary
- Prevents failure when file doesn't exist in multi-runner scenarios
2. Add interactive comparison generator:
- New benchmarks/generate_interactive_comparison.py
- Generates HTML report with dropdown selection (no architecture clutter)
- Dynamic tab rendering based on available data
- Preserves original chart layout and styling
- Supports "Baseline vs Target" terminology
3. Add helper scripts:
- scripts/generate_benchmark_metadata.py: Creates metadata.json for each run
- scripts/update_benchmark_index.py: Maintains master index of all datasets
4. Update benchmark workflow:
- Clone/create gh-pages branch automatically
- Store benchmark data in structured format:
* Tags: benchmark-data/tags/{version}/{runner}/
* Commits: benchmark-data/commits/{sha}/{runner}/
- Generate metadata.json for each dataset
- Update master index.json with new datasets
- Generate interactive report at benchmark-comparison/index.html
- Commit and push all changes to gh-pages
Features:
- Historical data storage for all release tags
- Automatic cleanup for old commit data (via cleanup script, to be added)
- Dropdown selection without architecture labels
- Dynamic tabs for multi-runner comparisons
- Pure static HTML (no server required)
- Automatic deployment to GitHub Pages
Resolves: https://github.com/biodatageeks/polars-bio/actions/runs/18800632149/job/53648469437
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
…nteractive report - Change static chart filename from static.html to index.html - Remove main branch special case (always use branch-specific directory) - This ensures old PR links work correctly - Main branch URL points to interactive report at root - PR branch URLs point to static charts in subdirectories
Previously only PR/branch data was being stored from pr_results,
but the baseline tag data (from baseline_results) was not saved.
Changes:
- Add new step 'Store baseline benchmark data to gh-pages'
- Process baseline_results directory for each runner
- Store baseline data to tags/{baseline_tag}/{runner}/
- Skip if baseline data already exists (don't overwrite)
- Mark baseline as latest tag in index
- This allows the interactive comparison to show tag data
Now when workflow runs, it will store BOTH:
1. Target/PR data (branch or tag being tested)
2. Baseline data (comparison tag, e.g., 0.18.0)
This populates the dropdown with historical tag data for comparison.
The 'Store baseline benchmark data' step needs to run git rev-parse on the baseline tag, but tags weren't being fetched in the aggregate job checkout. Changes: - Add fetch-depth: 0 to get full history - Add fetch-tags: true to fetch all tags This fixes: fatal: ambiguous argument '0.18.0': unknown revision
Previously the static report only showed data for the first runner when multiple runners were present. Now it generates a tabbed interface similar to the interactive report. Changes: - Add _create_tabbed_html() helper function - Modify generate_multi_runner_html() to generate charts for all runners - Extract body content from each runner's HTML - Create tabbed interface with styled tabs - First tab is active by default - Tab switching with JavaScript - Runner labels: Linux AMD64, macOS ARM64 This allows PR comments to show comprehensive multi-architecture benchmark results with easy tab switching between Linux and macOS.
The previous implementation had all runners using the same chart element
IDs (e.g., 'chart-overlap-total'), causing all scripts to render to the
same elements regardless of which tab was active.
Changes:
- Add runner suffix to all chart div IDs
- Update Plotly.newPlot references to match new IDs
- Each runner now has unique IDs: chart-{operation}-{type}-{runner}
Example:
- Linux: chart-overlap-total-linux
- macOS: chart-overlap-total-macos
This ensures each tab displays its own unique data correctly.
Previously only the latest commit per branch was shown in the interactive comparison dropdown. Now keeps the 10 most recent commits sorted by timestamp descending. Changes: - Modify add_dataset() to keep multiple commits per branch - Avoid duplicates by checking commit SHA - Limit to N most recent commits per branch+runner combination - Add --max-commits parameter (default: 10) - Tags still kept as single entry (only latest) This allows users to compare with recent historical commits, not just the absolute latest one. Dropdown will now show: - All tags (0.18.0, 0.19.0, etc.) - 10 most recent commits for issue-234 - 10 most recent commits for any other branch Example: issue-234 (70eaf33) - 2025-10-25 10:54 UTC issue-234 (00945fc) - 2025-10-25 10:42 UTC issue-234 (0ffaa45) - 2025-10-25 10:22 UTC ...
Previously, multiple commits from the same branch were collapsed into a single dropdown entry. Now each commit gets its own entry in the dropdown. Changes: - Group branch datasets by commit SHA instead of just ref name - Use unique key format (ref@sha) to differentiate commits - Update JavaScript to handle per-commit entries - Users can now select and compare any of the 10 most recent commits 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Previously, all commits from the same branch shared the same dataset ID, causing later commits to overwrite earlier ones when loading data. Only the last commit's data was accessible, making comparison ineffective. Changes: - Include commit SHA in dataset storage keys for branches - Update refs_by_type to reference unique dataset keys - Now each commit's data is stored separately and accessible Example: - Before: all commits used 'branch-issue-234-linux' - After: 'branch-issue-234-linux@f4398c1', 'branch-issue-234-linux@28ed1c2', etc. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Previously, both Linux and macOS tabs used the same variable names (data_count_overlaps_total, layout_count_overlaps_total, etc.), causing the second tab's declarations to overwrite the first tab's data. Both tabs ended up showing the same data (from whichever runner was processed last). Changes: - Add runner suffix to all data variable declarations - Add runner suffix to all layout variable declarations - Update Plotly.newPlot references to use renamed variables Example transformation: - var data_count_overlaps_total -> var data_count_overlaps_total_linux - var data_count_overlaps_total -> var data_count_overlaps_total_macos - Plotly.newPlot(..., data_count_overlaps_total_linux, layout_count_overlaps_total_linux) Now each runner tab has uniquely named variables and displays its own data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The switchTab function was crashing because it relied on the global event.target which is undefined when called manually or in certain contexts. This prevented users from actually switching between Linux and macOS tabs in static PR reports. Fix: Find and activate the button using the runnerName parameter instead of relying on event.target. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The previous fix using onclick attribute matching was unreliable. Now using a data-runner attribute on buttons to properly identify and activate the correct tab button when switching. Changes: - Add data-runner attribute to tab buttons - Use button.dataset.runner for reliable button identification - Simplifies switchTab logic and ensures tabs switch correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Charts rendered in hidden tabs (display: none) don't render correctly in Plotly. When tabs switch, we need to call Plotly.redraw() to force the charts to re-render with correct dimensions and data. Changes: - Add data-runner attribute to tab buttons for reliable identification - Call Plotly.redraw() on all charts when switching tabs - Use setTimeout to ensure tab is visible before redrawing Tested with Playwright automation - confirms charts show different data for Linux (936.386ms) vs macOS (833.433ms). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Charts rendered in hidden tabs (display: none) have incorrect dimensions and don't update visually when tabs switch. This commit implements lazy chart initialization to solve the problem: - Store chart data/layout in window.chartConfigs instead of creating charts immediately - Initialize charts only when their tab becomes visible for the first time - Track which tabs have been initialized to avoid recreating charts - Initialize the first (active) tab on page load via DOMContentLoaded This approach ensures charts are always created in visible containers with correct dimensions, fixing the issue where switching tabs showed identical data despite having different underlying values. Tested with Playwright automation confirming different data for Linux (936.386ms) vs macOS (833.433ms) benchmark results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
No description provided.