spec

Name: benchkit
Version: 1.0.0
Date: 2025-08-08
Status: DRAFT

Part I: Public Contract (Mandatory Requirements)
- 1. Vision & Scope
  - 1.1. Core Vision: Practical Benchmarking Toolkit
  - 1.2. In Scope: The Toolkit Philosophy
  - 1.3. Out of Scope
- 1. System Actors
- 1. Ubiquitous Language (Vocabulary)
- 1. Core Functional Requirements
  - 4.1. Measurement & Timing
  - 4.2. Data Generation
  - 4.3. Report Generation
  - 4.4. Analysis Tools
- 1. Non-Functional Requirements
- 1. Feature Flags & Modularity
- 1. Standard Directory Requirements
Part II: Internal Design (Design Recommendations)
- 1. Architectural Principles
- 1. Integration Patterns
Part III: Development Guidelines
- 1. Lessons Learned Reference
- 1. Implementation Priorities

Part I: Public Contract (Mandatory Requirements)

1. Vision & Scope

1.1. Core Vision: Practical Benchmarking Toolkit

benchkit is designed as a toolkit, not a framework. Unlike opinionated frameworks that impose specific workflows, benchkit provides flexible building blocks that developers can combine to create custom benchmarking solutions tailored to their specific needs.

Key Philosophy:

Standard Directory Compliance: ALL benchmark files must be in standard benches/ directory
Automatic Documentation: benches/readme.md automatically updated with comprehensive reports
Research-Grade Statistical Rigor: Professional statistical analysis meeting publication standards
Toolkit over Framework: Provide tools, not constraints
Optimization-Focused: Surface key metrics that guide optimization decisions
Integration-Friendly: Work alongside existing tools, not replace them

1.2. In Scope: The Toolkit Philosophy

Core Capabilities:

Standard Directory Integration: ALL benchmark files organized in standard benches/ directory following Rust conventions
Automatic Report Generation: benches/readme.md automatically updated with comprehensive benchmark results and analysis
Flexible Measurement: Time, memory, throughput, custom metrics with statistical rigor
Data Generation: Configurable test data generators for common patterns
Analysis Tools: Statistical analysis, comparative benchmarking, regression detection, git-style diffing, visualization
Living Documentation: Automatically maintained performance documentation that stays current with code changes

Target Use Cases:

Performance analysis for optimization work
Before/after comparisons for feature implementation
Historical performance tracking across commits/versions
Continuous performance monitoring in CI/CD
Documentation generation for performance characteristics
Research and experimentation with algorithm variants

1.3. Out of Scope

Not Provided:

Opinionated benchmark runner (use criterion for that)
Automatic CI/CD integration (provide tools for manual integration)
Real-time monitoring (focus on analysis, not monitoring)
GUI interfaces (command-line and programmatic APIs only)

2. System Actors

Actor	Description	Primary Use Cases
Performance Engineer	Optimizes code performance	Algorithmic comparisons, bottleneck identification
Library Author	Maintains high-performance libraries	Before/after analysis, performance documentation
CI/CD System	Automated testing and reporting	Performance regression detection, report generation
Researcher	Analyzes algorithmic performance	Experimental comparison, statistical analysis

3. Ubiquitous Language (Vocabulary)

Term	Definition
Benchmark Suite	A collection of related benchmarks measuring different aspects of performance
Test Case	A single benchmark measurement with specific parameters
Performance Profile	A comprehensive view of performance across multiple dimensions
Comparative Analysis	Side-by-side comparison of two or more performance profiles
Performance Regression	A decrease in performance compared to a baseline
Performance Diff	Git-style comparison showing changes between benchmark results
Optimization Insight	Actionable recommendation derived from benchmark analysis
Report Template	A customizable format for presenting benchmark results
Data Generator	A function that creates test data for benchmarking
Metric Collector	A component that gathers specific performance measurements

4. Core Functional Requirements

4.1. Measurement & Timing (FR-TIMING)

FR-TIMING-1: Flexible Timing Interface

Must provide simple timing functions for arbitrary code blocks
Must support nested timing for hierarchical analysis
Must collect statistical measures (mean, median, min, max, percentiles)

FR-TIMING-2: Custom Metrics

Must support user-defined metrics beyond timing (memory, throughput, etc.)
Must provide extensible metric collection interface
Must allow metric aggregation and statistical analysis

FR-TIMING-3: Baseline Comparison

Must support comparing current performance against saved baselines
Must detect performance regressions automatically
Must provide percentage improvement/degradation calculations

4.2. Data Generation (FR-DATAGEN)

FR-DATAGEN-1: Common Patterns

Must provide generators for common benchmark data patterns:
- Lists of varying sizes (small: 10, medium: 100, large: 1000, huge: 10000)
- Maps with configurable key-value distributions
- Strings with controlled length and character sets
- Nested data structures with configurable depth

FR-DATAGEN-2: Parameterizable Generation

Must allow easy parameterization of data size and complexity
Must provide consistent seeding for reproducible benchmarks
Must optimize data generation to minimize benchmark overhead

FR-DATAGEN-3: Domain-Specific Generators

Must allow custom data generators for specific domains
Must provide composition tools for combining generators
Must support lazy generation for large datasets

4.3. Report Generation (FR-REPORTS)

FR-REPORTS-1: Standard Directory Reporting ⭐ CRITICAL REQUIREMENT

Must generate comprehensive reports in benches/readme.md following Rust conventions
Must automatically update benches/readme.md with latest benchmark results
Must preserve existing content while updating benchmark sections
Must support updating specific sections of existing markdown files
Must use exact section matching to prevent section duplication - Critical bug fix requirement
Must validate section names to prevent conflicts and misuse
Must provide conflict detection for overlapping section names

FR-REPORTS-2: Multiple Output Formats

Must support markdown, HTML, and JSON output formats
Must provide customizable templates for each format
Must allow embedding of charts and visualizations

FR-REPORTS-3: Living Documentation

Must generate reports that serve as comprehensive performance documentation
Must provide clear, actionable summaries of performance characteristics
Must highlight key optimization opportunities and bottlenecks
Must include timestamps and configuration details for reproducibility
Must maintain historical context and trends in benches/readme.md

FR-REPORTS-4: Safe API Design ⭐ CRITICAL REQUIREMENT

Must provide section name validation to prevent invalid names (empty, too long, invalid characters)
Must offer both safe (validated) and unchecked API variants for backwards compatibility
Must detect and warn about potential section name conflicts before they cause issues
Must use proper error types (MarkdownError) with clear, actionable error messages
Must prevent the critical substring matching bug through exact section matching
Must guide users toward safe section naming practices through API design

4.4. Analysis Tools (FR-ANALYSIS)

FR-ANALYSIS-1: Research-Grade Statistical Analysis ⭐ CRITICAL REQUIREMENT

Must provide research-grade statistical rigor meeting publication standards
Must calculate proper confidence intervals using t-distribution (not normal approximation)
Must perform statistical significance testing (Welch's t-test for unequal variances)
Must calculate effect sizes (Cohen's d) for practical significance assessment
Must detect outliers using statistical methods (IQR method)
Must assess normality of data distribution (Shapiro-Wilk test)
Must calculate statistical power for detecting meaningful differences
Must provide coefficient of variation for measurement reliability assessment
Must flag unreliable results based on statistical criteria
Must document statistical methodology in reports

FR-ANALYSIS-2: Comparative Analysis

Must support before/after performance comparisons
Must provide A/B testing capabilities for algorithm variants
Must generate comparative reports highlighting differences

FR-ANALYSIS-3: Git-Style Performance Diffing

Must compare benchmark results across different implementations or commits
Must generate git-style diff output showing performance changes
Must classify changes as improvements, regressions, or minor variations

FR-ANALYSIS-4: Visualization and Charts

Must generate performance charts for scaling analysis and framework comparison
Must support multiple output formats (SVG, PNG, HTML)
Must provide high-level plotting functions for common benchmarking scenarios

FR-ANALYSIS-5: Optimization Insights

Must analyze results to suggest optimization opportunities
Must identify performance scaling characteristics
Must provide actionable recommendations based on measurement patterns

5. Critical Bug Fixes and Security Requirements

CBF-1: Markdown Section Duplication Prevention ⭐ CRITICAL FIX

Background: A critical substring matching bug was discovered where MarkdownUpdater.replace_section_content() used line.contains() instead of exact matching for section headers. This caused severe section duplication when section names shared common substrings.

Impact Evidence:

wflow project: readme.md grew from 5,865 to 7,751 lines (+1,886 lines) in one benchmark run
37 duplicate "Performance Benchmarks" sections created
201 duplicate table headers generated
Documentation became unusable and contradictory

Root Cause: src/reporting.rs:56 contained:

if line.contains(self.section_marker.trim_start_matches("## ")) {

This matched ANY section containing the substring, so:

"Performance Benchmarks" ✓ (intended)
"Language Operations Performance" ✓ (unintended - contains "Performance")
"Realistic Scenarios Performance" ✓ (unintended - contains "Performance")

Required Fix: Changed to exact matching:

if line.trim() == self.section_marker.trim() {

Prevention Requirements:

Must use exact section name matching in all markdown processing
Must provide comprehensive regression tests for section matching edge cases
Must validate section names to prevent conflicts
Must detect and warn about potential substring conflicts
Must maintain backwards compatibility through unchecked API variants

6. Non-Functional Requirements

NFR-PERFORMANCE-1: Low Overhead

Measurement overhead must be <1% of measured operation time for operations >1ms
Data generation must not significantly impact benchmark timing
Report generation must complete within 10 seconds for typical benchmark suites

NFR-USABILITY-1: Simple Integration

Must integrate into existing projects with <10 lines of code
Must provide sensible defaults for common benchmarking scenarios
Must allow incremental adoption alongside existing benchmarking tools

NFR-COMPATIBILITY-1: Environment Support

Must work in std environments (primary target)
Should provide no_std compatibility for core timing functions
Must support all major platforms (Linux, macOS, Windows)

NFR-RELIABILITY-1: Reproducible Results

Must provide consistent results across multiple runs (±5% variance)
Must support deterministic seeding for reproducible data generation
Must handle system noise and provide statistical confidence measures

7. Feature Flags & Modularity

Feature	Description	Default	Dependencies
`enabled`	Core benchmarking functionality	✓	-
`markdown_reports`	Safe markdown report generation with exact section matching ⭐	✓	pulldown-cmark
`data_generators`	Common data generation patterns	✓	rand
`criterion_compat`	Compatibility layer with criterion	✓	criterion
`html_reports`	HTML report generation	-	tera
`json_reports`	JSON report output	-	serde_json
`statistical_analysis`	Research-grade statistical analysis ⭐	-	statistical
`comparative_analysis`	A/B testing and comparisons	-	-
`diff_analysis`	Git-style benchmark result diffing	-	-
`visualization`	Chart generation and plotting	-	plotters
`optimization_hints`	Performance optimization suggestions	-	statistical_analysis

Critical Note: The markdown_reports feature now includes mandatory safety features:

Section name validation and conflict detection
Exact section matching (prevents duplication bug)
MarkdownError type for proper error handling
Safe/unchecked API variants for backwards compatibility

8. Standard Directory Requirements

SR-DIRECTORY-1: ABSOLUTE benches/ Directory Requirement ⭐ MANDATORY - NO EXCEPTIONS

ALL benchmark-related files MUST be located EXCLUSIVELY in the benches/ directory
This is NON-NEGOTIABLE for cargo bench compatibility and ecosystem standards
Benchmark binaries, data generation scripts, and analysis tools MUST ALL reside in benches/
🚫 STRICTLY PROHIBITED: ANY benchmark files in tests/, examples/, or src/bin/
🚫 ENFORCEMENT: benchkit will ERROR if benchmarks detected outside benches/

SR-DIRECTORY-2: Automatic Documentation Generation ⭐ MANDATORY

benches/readme.md must be automatically generated and updated with benchmark results
The file must serve as comprehensive performance documentation for the project
Updates must preserve existing content while refreshing benchmark sections
Reports must include timestamps, configuration details, and historical context

SR-DIRECTORY-3: Structured Organization

project/
├── benches/
│   ├── readme.md              # Automatically updated comprehensive reports
│   ├── algorithm_comparison.rs # Comparative benchmarks
│   ├── performance_suite.rs    # Main benchmark suite
│   ├── memory_benchmarks.rs    # Memory-specific benchmarks
│   └── data_generation.rs      # Custom data generators
├── src/
│   └── lib.rs                  # Main library code
└── tests/
    └── unit_tests.rs           # Unit tests (NO benchmarks)

SR-DIRECTORY-4: Integration with Rust Toolchain

Must work seamlessly with cargo bench command
Must support standard Rust benchmark discovery and execution patterns
Must integrate with existing Rust development workflows
Must provide compatibility with IDE tooling and cargo extensions

Part II: Internal Design (Design Recommendations)

9. Architectural Principles

AP-1: Toolkit over Framework

Provide composable functions rather than monolithic framework
Allow users to choose which components to use
Minimize assumptions about user workflow

AP-2: Markdown-First Reporting

Treat markdown as first-class output format
Optimize for readability and version control
Support inline updates of existing documentation

AP-3: Zero-Copy Where Possible

Minimize allocations during measurement
Use borrowing and references for data passing
Optimize hot paths for measurement accuracy

AP-4: Statistical Rigor

Provide proper statistical analysis of results
Handle measurement noise and outliers appropriately
Offer confidence intervals and significance testing

10. Integration Patterns

Pattern 1: Standard Directory Benchmarking

// benches/performance_suite.rs
use benchkit::prelude::*;

fn main()
{
  let mut suite = BenchmarkSuite::new( "Core Function Performance" );
  
  suite.benchmark( "small_input", ||
  {
    let data = generate_list_data( 10 );
    bench_block( || my_function( &data ) )
  });
  
  let results = suite.run_all();
  
  // Automatically update benches/readme.md with safe API
  let updater = MarkdownUpdater::new( "benches/readme.md", "Performance Results" ).unwrap();
  updater.update_section( &results.generate_markdown_report() ).unwrap();
}

Pattern 2: Comparative Analysis

// benches/algorithm_comparison.rs
use benchkit::prelude::*;

fn main()
{
  let comparison = ComparativeAnalysis::new( "Algorithm Performance Comparison" )
    .algorithm( "original", || original_algorithm( &data ) )
    .algorithm( "optimized", || optimized_algorithm( &data ) )
    .with_data_sizes( &[ 10, 100, 1000, 10000 ] );
  
  let report = comparison.run_comparison();
  
  // Update benches/readme.md with comparison results using safe API
  let updater = MarkdownUpdater::new( "benches/readme.md", "Algorithm Comparison" ).unwrap();
  updater.update_section( &report.generate_markdown_report() ).unwrap();
}

Pattern 3: Comprehensive Benchmark Suite

// benches/comprehensive_suite.rs
use benchkit::prelude::*;

fn main()
{
  let mut suite = BenchmarkSuite::new( "Comprehensive Performance Suite" );
  
  // Add multiple benchmark categories
  suite.benchmark( "data_processing", || process_large_dataset() );
  suite.benchmark( "memory_operations", || memory_intensive_task() );
  suite.benchmark( "io_operations", || file_system_benchmarks() );
  
  let results = suite.run_all();
  
  // Generate comprehensive benches/readme.md report with safe API
  let comprehensive_report = results.generate_comprehensive_report();
  let updater = MarkdownUpdater::new( "benches/readme.md", "Performance Analysis" ).unwrap();
  updater.update_section( &comprehensive_report ).unwrap();
  
  println!( "Updated benches/readme.md with comprehensive performance analysis" );
}

Pattern 4: Git-Style Performance Diffing

use benchkit::prelude::*;

fn compare_implementations()
{
  // Baseline results (old implementation)
  let baseline_results = vec!
  [
    ( "string_ops".to_string(), bench_function( "old_string_ops", || old_implementation() ) ),
    ( "hash_compute".to_string(), bench_function( "old_hash", || old_hash_function() ) ),
  ];
  
  // Current results (new implementation) 
  let current_results = vec!
  [
    ( "string_ops".to_string(), bench_function( "new_string_ops", || new_implementation() ) ),
    ( "hash_compute".to_string(), bench_function( "new_hash", || new_hash_function() ) ),
  ];
  
  // Generate git-style diff
  let diff_set = diff_benchmark_sets( &baseline_results, &current_results );
  
  // Show summary and detailed analysis
  for diff in &diff_set.diffs
  {
    println!( "{}", diff.to_summary() );
  }
  
  // Check for regressions in CI/CD
  for regression in diff_set.regressions()
  {
    eprintln!( "⚠️ Performance regression detected: {}", regression.benchmark_name );
  }
}

Pattern 5: Custom Metrics

use benchkit::prelude::*;

fn memory_benchmark()
{
  let mut collector = MetricCollector::new()
    .with_timing()
    .with_memory_usage()
    .with_custom_metric( "cache_hits", || count_cache_hits() );
    
  let results = collector.measure( || expensive_operation() );
  println!( "{}", results.to_markdown_table() );
}

Pattern 6: Visualization and Charts

use benchkit::prelude::*;
use std::path::Path;

fn generate_performance_charts()
{
  // Scaling analysis chart
  let scaling_results = vec!
  [
    (10, bench_function( "test_10", || algorithm_with_n( 10 ) )),
    (100, bench_function( "test_100", || algorithm_with_n( 100 ) )),
    (1000, bench_function( "test_1000", || algorithm_with_n( 1000 ) )),
  ];
  
  plots::scaling_analysis_chart(
    &scaling_results,
    "Algorithm Scaling Performance", 
    Path::new( "docs/scaling_chart.svg" )
  );
  
  // Framework comparison chart
  let framework_results = vec!
  [
    ("Fast Framework".to_string(), bench_function( "fast", || fast_framework() )),
    ("Slow Framework".to_string(), bench_function( "slow", || slow_framework() )),
  ];
  
  plots::framework_comparison_chart(
    &framework_results,
    "Framework Performance Comparison",
    Path::new( "docs/comparison_chart.svg" )
  );
}

Pattern 7: Safe Section Management with Conflict Detection ⭐ CRITICAL FEATURE

// benches/safe_section_management.rs
use benchkit::prelude::*;

fn main() -> Result<(), benchkit::reporting::MarkdownError>
{
  // Safe API with validation - prevents the critical substring matching bug
  let updater = MarkdownUpdater::new("benches/readme.md", "Performance Results")?;
  
  // Check for potential conflicts before proceeding
  let conflicts = updater.check_conflicts()?;
  if !conflicts.is_empty() {
    println!("⚠️ Warning: Potential section name conflicts detected:");
    for conflict in conflicts {
      println!("  - {}", conflict);
    }
    println!("Consider using more specific section names to avoid duplication.");
  }
  
  // Safe to proceed - exact matching prevents duplication
  let suite = BenchmarkSuite::new("Core Performance");
  let results = suite.run_all();
  updater.update_section(&results.generate_markdown_report())?;
  
  // Example of problematic section names that would be caught:
  // ✅ Good: "Performance Results", "Memory Benchmarks", "API Tests"  
  // ⚠️ Risky: "Performance", "Benchmarks", "Test" (too generic, likely to conflict)
  
  // For backwards compatibility, unchecked API is still available:
  // let unchecked = MarkdownUpdater::new_unchecked("benches/readme.md", "");
  
  Ok(())
}

Pattern 8: Research-Grade Statistical Analysis ⭐ CRITICAL FEATURE

use benchkit::prelude::*;

fn research_grade_performance_analysis()
{
  // Collect benchmark data with proper sample size
  let algorithm_a_result = bench_function_n( "algorithm_a", 20, || algorithm_a() );
  let algorithm_b_result = bench_function_n( "algorithm_b", 20, || algorithm_b() );
  
  // Professional statistical analysis 
  let analysis_a = StatisticalAnalysis::analyze( &algorithm_a_result, SignificanceLevel::Standard ).unwrap();
  let analysis_b = StatisticalAnalysis::analyze( &algorithm_b_result, SignificanceLevel::Standard ).unwrap();
  
  // Check statistical quality before drawing conclusions
  if analysis_a.is_reliable() && analysis_b.is_reliable()
  {
    // Perform statistical comparison with proper hypothesis testing
    let comparison = StatisticalAnalysis::compare(
      &algorithm_a_result,
      &algorithm_b_result, 
      SignificanceLevel::Standard
    ).unwrap();
    
    println!( "Statistical comparison:" );
    println!( "  Effect size: {:.3} ({})", comparison.effect_size, comparison.effect_size_interpretation() );
    println!( "  P-value: {:.4}", comparison.p_value );
    println!( "  Significant: {}", comparison.is_significant );
    println!( "  Conclusion: {}", comparison.conclusion() );
    
    // Generate research-grade report with methodology
    let report = ReportGenerator::new( "Algorithm Comparison", results );
    let statistical_report = report.generate_statistical_report();
    println!( "{}", statistical_report );
  }
  else
  {
    println!( "⚠️ Results do not meet statistical reliability criteria - collect more data" );
  }
}

11. Key Learnings from unilang/strs_tools Benchmarking

Lesson 1: Focus on Key Metrics

Surface 2-3 critical performance indicators
Hide detailed statistics behind optional analysis
Provide clear improvement/regression percentages

Lesson 2: Markdown Integration is Critical

Developers want to update documentation automatically
Version-controlled performance results are valuable
Manual report copying is error-prone and time-consuming

Lesson 3: Data Generation Patterns

Common patterns: small (10), medium (100), large (1000), huge (10000)
Parameterizable generators reduce boilerplate significantly
Reproducible seeding is essential for consistent results

Lesson 4: Statistical Rigor Matters

Raw numbers without confidence intervals are misleading
Outlier detection and handling improves result quality
Multiple sampling provides more reliable measurements

Lesson 5: Git-Style Diffing for Performance

Developers are familiar with git diff workflow and expect similar experience
Performance changes should be as easy to review as code changes
Historical comparison across commits/implementations is essential for CI/CD

Lesson 6: Integration Simplicity

Developers abandon tools that require extensive setup
Default configurations should work for 80% of use cases
Incremental adoption is more successful than wholesale replacement

Part III: Development Guidelines

12. Lessons Learned Reference

CRITICAL: All development decisions for benchkit are based on real-world experience from unilang and strs_tools benchmarking work. The complete set of requirements, anti-patterns, and mandatory standards is documented in usage.md.

Key lessons that shaped benchkit design:

9.1. Toolkit vs Framework Decision

Problem: Criterion's framework approach was too restrictive for our use cases
Solution: benchkit provides building blocks, not rigid workflows
Evidence: "I don't want to mess with all that problem I had" - User feedback on complexity

9.2. Markdown-First Integration

Problem: Manual copy-pasting of performance results into documentation
Solution: Automated markdown section updating with version control friendly output
Evidence: Frequent need to update README performance sections during optimization

9.3. Standard Data Size Patterns

Problem: Inconsistent data sizes across different benchmarks made comparison difficult
Solution: Standardized DataSize enum with proven effective sizes
Evidence: "Common patterns: small (10), medium (100), large (1000), huge (10000)"

9.4. Feature Flag Philosophy

Problem: Heavy dependencies slow compilation and increase complexity
Solution: Granular feature flags for all non-core functionality
Evidence: "put every extra feature under cargo feature" - Explicit requirement

9.5. Focus on Key Metrics

Problem: Statistical details overwhelm users seeking optimization guidance
Solution: Surface 2-3 key indicators, hide details behind optional analysis
Evidence: "expose just few critical parameters of optimization and hid the rest deeper"

9.6. Critical Substring Matching Bug ⭐ CRITICAL LESSON

Problem: Markdown section updates used substring matching causing exponential duplication
Impact: Files grew from 5,865 to 7,751 lines in one run, 37 duplicate sections created
Root Cause: line.contains() matched overlapping section names like "Performance"
Solution: Exact matching with line.trim() == section_marker.trim() + API validation
Prevention: Safe API with conflict detection, comprehensive regression tests, backwards compatibility

For complete requirements and mandatory standards, see usage.md.

13. Cargo Bench Integration Requirements ⭐ CRITICAL

REQ-CARGO-001: Seamless cargo bench Integration Priority: FOUNDATIONAL - Without this, benchkit will not be adopted by the Rust community.

Requirements:

MUST integrate seamlessly with cargo bench as the primary interface
MUST support the standard benches/ directory structure
MUST work with Rust's built-in benchmark harness and custom harnesses
MUST automatically update documentation during benchmark execution
MUST provide regression analysis as part of the benchmark process
MUST be compatible with existing cargo bench workflows

Technical Implementation Requirements:

# In Cargo.toml - Standard Rust benchmark setup
[[bench]]
name = "performance_suite"
harness = false  # Use benchkit as the harness

[dev-dependencies]
benchkit = { version = "0.8.0", features = ["cargo_bench"] }

// In benches/performance_suite.rs - Works with cargo bench
use benchkit::prelude::*;

fn main() 
{
    let mut suite = BenchmarkSuite::new("Algorithm Performance");
    suite.benchmark("algorithm_a", || algorithm_a_implementation());
    
    // Automatically update documentation during cargo bench
    let results = suite.run_with_auto_docs(&[
        ("README.md", "## Performance"),
        ("PERFORMANCE.md", "## Latest Results"),
    ])?;
    
    // Automatic regression analysis
    results.check_regressions_and_alert()?;
}

Expected User Workflow:

# User expectation - this MUST work without additional setup
cargo bench

# Should automatically:
# - Run all benchmarks in benches/
# - Update README.md and PERFORMANCE.md
# - Check for performance regressions
# - Generate professional performance reports
# - Maintain historical data for trend analysis

Success Criteria:

cargo bench runs benchkit benchmarks without additional setup
Documentation updates automatically during benchmark execution
Zero additional commands needed for typical benchmark workflows
Works in existing Rust projects without structural changes
Integrates with CI/CD pipelines using standard cargo bench
Provides regression analysis automatically during benchmarks
Compatible with existing criterion-based projects
Supports migration from criterion with <10 lines of code changes

14. Implementation Priorities

Based on real-world usage patterns and critical path analysis from unilang/strs_tools work:

Phase 1: Core Functionality (MVP) + Mandatory cargo bench

Justification: Essential for any benchmarking work + Rust ecosystem adoption

cargo bench integration (cargo_bench_runner) - CRITICAL REQUIREMENT
Automatic markdown updates (markdown_auto_update) - CRITICAL REQUIREMENT
Basic timing and measurement (enabled)
Simple markdown report generation (markdown_reports)
Standard data generators (data_generators)

Phase 2: Enhanced cargo bench + Analysis Tools

Justification: Essential for professional performance analysis

Regression analysis during cargo bench - HIGH PRIORITY
Historical data management for cargo bench - HIGH PRIORITY
Research-grade statistical analysis (statistical_analysis) ⭐ CRITICAL
Comparative analysis (comparative_analysis)
Git-style performance diffing (diff_analysis)

Phase 3: Advanced Features

Justification: Nice-to-have for comprehensive analysis

Multi-environment cargo bench configurations - HIGH PRIORITY
Chart generation and visualization (visualization)
HTML and JSON reports (html_reports, json_reports)
Enhanced criterion compatibility (criterion_compat)
Optimization hints and recommendations (optimization_hints)

Phase 4: Ecosystem Integration

Justification: Long-term adoption and CI/CD integration

CI/CD cargo bench automation - HIGH PRIORITY
IDE integration and tooling support
Performance monitoring and alerting
Advanced regression detection and alerting

Success Criteria

User Experience Success Metrics:

New users can run first benchmark in <5 minutes
Integration requires <10 lines of code
Documentation updates happen automatically
Performance regressions detected within 1% accuracy
Critical substring matching bug eliminated - No more section duplication
Safe API prevents common mistakes - Validation guides users to best practices

Technical Success Metrics:

Measurement overhead <1% for operations >1ms
All features work independently
Compatible with existing criterion benchmarks
Memory usage scales linearly with data size
Exact section matching prevents document corruption
Comprehensive regression tests prevent bug recurrence
Backwards compatibility maintained through unchecked API variants

Reference Documents

usage.md - Mandatory standards and compliance requirements from production systems
readme.md - Usage-focused documentation with examples
examples/ - Comprehensive usage demonstrations

FilesExpand file tree

spec.md

Latest commit

History