Skip to content

Enable Post-Processing and Export of Benchmark Results to External Data Warehouses #116

@acalhounRH

Description

@acalhounRH

Summary

Add post-processing capabilities to Zathras that convert benchmark results into structured JSON documents and export them to external data warehouses (OpenSearch, Horreum) for long-term storage, querying, and analysis.


Problem Statement / Current State

Currently, Zathras:

  • ✅ Successfully orchestrates benchmark execution across cloud and bare metal systems
  • ✅ Collects extensive metadata (hardware config, system config, cloud metadata)
  • ✅ Retrieves test results as compressed tarballs (results_<test>.zip)
  • ✅ Stores results in directory structures: results_prefix/os_vendor/cloud_type/instance_type_N/

However:

  • ❌ Results are stored as unstructured tarballs on the controller filesystem
  • ❌ No centralized database for historical result queries
  • ❌ Difficult to perform trend analysis across multiple test runs
  • ❌ No automated regression detection capabilities
  • ❌ Limited visibility into performance trends over time
  • ❌ Results are isolated per test run with no cross-run correlation
  • ❌ Executive reporting requires manual data extraction

Impact:

  • Performance engineers must manually extract and analyze results
  • Historical comparisons require custom scripts
  • No dashboards or visualization of trends
  • Difficult to answer questions like:
    • "How has STREAM performance on m5.xlarge changed over the last 6 months?"
    • "Which tuned profile performs best for linpack across instance types?"
    • "Are we seeing performance regressions after OS updates?"

Proposed Solution

Implement a post-processing and export pipeline that:

  1. Extracts results from Zathras archive directories
  2. Transforms test outputs into structured JSON documents
  3. Enriches with metadata (hardware, system config, cloud details)
  4. Validates data integrity and schema compliance
  5. Exports to configurable data warehouse targets

Architecture

post_processing/
├── main.py                          # Orchestrator script
├── processors/
│   ├── base_processor.py           # Abstract base class
│   ├── fio_processor.py            # FIO-specific parser
│   ├── streams_processor.py        # STREAM-specific parser
│   └── ...                          # One per test type
├── exporters/
│   ├── opensearch_exporter.py      # OpenSearch integration
│   └── horreum_exporter.py         # Horreum integration
└── utils/
    ├── metadata_extractor.py       # Parse Zathras metadata
    └── archive_handler.py          # Handle zip/tar extraction

Unified JSON Schema

{
  "test_run": {
    "id": "uuid",
    "timestamp": "ISO-8601",
    "zathras_version": "3.2"
  },
  "infrastructure": {
    "type": "aws|azure|gcp|local",
    "instance_type": "m5.xlarge",
    "region": "us-east-1",
    "os": { "vendor": "rhel", "version": "9.3" }
  },
  "hardware": {
    "cpu": { "model": "...", "cores": 16, ... },
    "memory": { "total_gb": 32, ... }
  },
  "test": {
    "name": "streams",
    "version": "v1.0",
    "status": "passed|failed",
    "duration_seconds": 235
  },
  "results": {
    "metrics": { /* test-specific */ },
    "raw_output": "..."
  }
}

Justification

Why This Approach?

Option 1 Considered: Modify all test wrappers to output JSON

  • ❌ Requires coordinating changes across 18+ separate repositories
  • ❌ Each wrapper maintained by different teams
  • ❌ No standardization guarantee
  • ❌ Months of coordination overhead
  • ❌ Breaks backward compatibility

Option 2 (Proposed): Post-processing at Zathras level

  • Single point of implementation - change only Zathras
  • Wrapper independence - no changes to external test repos
  • Standardized schema - consistent structure across all tests
  • Metadata enrichment - combine test results with infrastructure context
  • Faster deployment - weeks instead of months
  • Backward compatible - doesn't break existing workflows
  • Historical data support - can reprocess old results

Why Not Store in Zathras Database?

  • Performance testing requires time-series analysis capabilities
  • Need sophisticated querying across multiple dimensions
  • Require visualization/dashboard integration
  • Benefit from existing data warehouse infrastructure
  • Scale to millions of data points over time

Benefits

For Performance Engineers

  • 📊 Real-time dashboards showing current and historical trends
  • 🔍 Ad-hoc queries across any dimension (test, OS, instance type, date range)
  • 📈 Trend visualization to spot performance changes over time
  • 🚨 Regression detection via automated alerts
  • 📝 Automated reporting for stakeholders

For Engineering Managers

  • 📉 Executive dashboards with performance KPIs
  • 💰 Cost analysis (test runtime × cloud pricing = spend tracking)
  • 🎯 Goal tracking (performance targets vs actual results)
  • 📊 Team productivity metrics (tests run, pass rates)

For CI/CD Integration

  • 🔄 Automated performance gates (fail CI if regression detected)
  • 📧 Notifications on performance changes
  • 🔗 Integration with existing monitoring systems
  • 📦 Data portability (JSON export to other tools)

For Research & Analysis

  • 🔬 Cross-test correlation (does better STREAM predict better HammerDB?)
  • 🌡️ Environmental impact (how do tuned profiles affect results?)
  • 📐 Statistical analysis (standard deviation, percentiles)
  • 🗂️ Dataset creation for ML models

Data Warehouse Options

Option A: OpenSearch (Elasticsearch fork)

Best for: General-purpose search, analytics, and visualization

Pros:

  • ✅ Powerful query language (SQL + DSL)
  • ✅ Kibana dashboards for visualization
  • ✅ Time-series analysis built-in
  • ✅ Widely adopted, large community
  • ✅ Real-time indexing and search
  • ✅ Flexible schema (add fields without migration)
  • ✅ REST API for easy integration

Cons:

  • ⚠️ Not performance-testing-specific
  • ⚠️ Requires infrastructure setup
  • ⚠️ May need custom dashboards

Use Cases:

  • Real-time monitoring during test runs
  • Ad-hoc queries across result history
  • Executive dashboards
  • Log correlation with performance data

Option B: Horreum (Performance Test Results Repository)

Best for: Performance regression tracking and historical comparison

Pros:

  • Built specifically for performance testing
  • ✅ Automatic regression detection
  • ✅ Change point analysis (detects when performance shifts)
  • ✅ Comparison views (before/after, baseline/candidate)
  • ✅ Schema validation for test results
  • ✅ Native understanding of performance metrics
  • ✅ Integration with CI/CD pipelines
  • ✅ Run comparison and annotation

Cons:

  • ⚠️ More specialized, smaller community
  • ⚠️ Less flexible for non-performance queries
  • ⚠️ Steeper learning curve

Use Cases:

  • Performance regression tracking in CI/CD
  • Baseline management (golden results)
  • Automated alerting on regressions
  • Historical performance comparison

Option C: Both (Recommended)

Strategy: Support multiple exporters, let users configure based on needs

# In scenario file
global:
  results_export:
    enabled: true
    targets:
      - opensearch:
          url: "https://opensearch.example.com"
          index: "zathras-results"
      - horreum:
          url: "https://horreum.example.com"
          test: "zathras-benchmark-suite"

Benefits:

  • OpenSearch for real-time monitoring and dashboards
  • Horreum for regression detection and CI/CD gates
  • Users choose based on infrastructure and needs
  • Not mutually exclusive

Option D: Other Data Warehouses

The architecture supports adding exporters for:

  • InfluxDB (time-series focused)
  • PostgreSQL/TimescaleDB (SQL queries)
  • Prometheus (metrics and alerting)
  • Custom REST APIs (organization-specific systems)

Implementation Approach

Phase 1: Foundation

  • Create post_processing/ directory structure
  • Implement base processor interface
  • Build metadata extractor (parse hw_config.yml, ansible_vars.yml)
  • Implement archive handler (zip/tar extraction)
  • Create one processor (FIO - already outputs JSON)
  • Build OpenSearch exporter
  • Create main.py orchestrator
  • Test end-to-end with sample FIO results

Deliverable: Working prototype that processes FIO results to OpenSearch


Phase 2: Expansion

  • Add 3-5 more processors (streams, linpack, coremark, uperf)
  • Implement Horreum exporter
  • Add configuration file support
  • Error handling and retry logic
  • Logging and debugging capabilities
  • CLI improvements (dry-run, batch mode)

Deliverable: Multi-test support with dual exporter capability


Phase 3: Production Ready

  • Complete remaining processors
  • Integration tests with real scenarios
  • Documentation (README, usage examples)
  • Schema documentation
  • Performance optimization
  • Optional integration into burden script

Deliverable: Production-ready, documented solution


Key Design Decisions

1. Standalone vs Integrated

Decision: Build as standalone tool first, integrate later

Rationale:

  • Allows independent development and testing
  • Can process historical data
  • Users can run manually or in cron jobs
  • Optional integration into burden when stable

2. Push vs Pull

Decision: Push model (Zathras pushes to data warehouse)

Rationale:

  • Simpler architecture
  • Real-time availability
  • No data warehouse needs Zathras access
  • Standard pattern for observability

3. Synchronous vs Asynchronous

Decision: Synchronous initially, async later if needed

Rationale:

  • Simpler implementation
  • Export time is small relative to test runtime
  • Can optimize later if bottleneck

4. Schema Strict vs Flexible

Decision: Flexible schema with core required fields

Rationale:

  • Tests evolve over time
  • New metadata may be added
  • Graceful degradation if parsing fails
  • OpenSearch handles schema evolution well

Success Criteria

Must Have

  • Process at least 3 test types (FIO, STREAM, Linpack)
  • Export to OpenSearch successfully
  • Unified JSON schema documented
  • Metadata enrichment working (hardware + cloud + test config)
  • Error handling (graceful failures)
  • Can process historical results

Should Have

  • Export to Horreum
  • 5-7 test processors implemented
  • Configuration file support
  • Batch processing mode
  • Unit test coverage >70%

Nice to Have

  • All 18 test processors
  • Integration into burden script
  • Sample Kibana dashboards
  • Performance optimization
  • Resumability (skip already-exported)

Non-Goals (Out of Scope)

  • ❌ Modifying test wrappers
  • ❌ Real-time streaming during test execution
  • ❌ Building custom visualization UI
  • ❌ Changing Zathras core functionality
  • ❌ Result validation/correctness checking
  • ❌ Test execution scheduling

Risks & Mitigations

Risk Impact Mitigation
Test output format changes break parsers Medium Keep raw output; version detection; graceful fallback
Data warehouse unavailable during export Low Retry logic; queue exports; offline mode
Schema evolution over time Medium Flexible schema; version field; backward compatibility
Performance overhead Low Async export; make optional; optimize critical paths
Adoption resistance Medium Make optional; show value first; gather feedback

Open Questions

  1. Authentication: How should credentials be managed? (env vars, config file, secrets manager?)
  2. Data retention: Who manages data warehouse retention policies?
  3. Schema governance: Who approves schema changes?
  4. Priority tests: Which 3-5 tests should we implement first?
  5. Infrastructure: Is OpenSearch/Horreum already deployed or needs setup?
  6. Access control: Who can export results? Any restrictions?

References


Labels

enhancement, feature, observability, data-export

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions