Skip to content

Benchmark stability of Perf, HWPC Sensor and SmartWatts against RAPL stability #1

@Inkedstinct

Description

@Inkedstinct

From https://docs.google.com/document/d/1iKMhEt-780Ub3iqzNHx1hTi-nJlUZosVuoWUSwtYoAU/edit?tab=t.0#heading=h.nbhjcz20m98s

H_0

RAPL metrics are stable against coherent workloads and Perf, HWPC and SmartWatts produced metrics stay stable (i.e does not introduce instability) by using this interface as base input.


Open Questions

    • Q: Is stress-ng a correct candidate ?
  • Stress can be configured to reproduce the same usage at given report frequency
    • Q: What is the current baseline/SOTA to be considered ?
  • RAPL
    • Q: What are the metrics to be consider both for MM and OM
  • MM
    • Joules
  • OM
    • CPU Consumption
    • Memory usage
    • Global system calls impact
    • Q: What is the "most specific" zoom to consider in sub-benchmarks results

To Avoid : Benchmarking Crimes

  1. Selective Benchmarking
    1. Not evaluating degradation in other areas : Performance significantly improve in area of interest AND does not significantly degrade elsewhere
    • Check for decrease of std, variance and cv of Metric Measured (MM) with stable Outside Metrics (OM)
    1. Cherry-picking subsets of a suite without justification (and conclude about the whole suite) : Justify avoided subset, not arbitrary choices
    • Run the benchmark against all available type of infrastructure (with reasonable combination)
    1. Selective data set hidding deficiencies : no restriction of a dataset if removed data tells a different story
    • Present the whole benchmarks results
  2. Improper handling of benchmarks results
    1. Micro-benchmarks cannot be pictured alone, they can be used an example before presenting macro-benchmarking for a real-world workloads
    • May this benchmark conclude of stability against stress-ng use-cases, ensure that it stays true again a more complexe use case, still controlled ?
    1. Throughput degraded by X% => overhead is X% : Accompany throughput comparisons with complete CPU load AND Compare I/O throughput in terms of processing time per bit
    • If comparing two stability results for different PowerAPI version, do so against OM
    1. Downplaying overheads
      • X% to Y% is (Y-X)% increase/decrease : 1% to 2% is doubling
      • baseline is denominator in relative comparison
    2. No indication of signifiance of data : always refer to variance/standard deviation and any obvious indicators (R²...), even consider min/max. Use student's t-test to check significance
    3. Geometric mean shall be preferred to Arithmetic means at least with normalized values
  3. Using the wrong benchmarks
    1. Benchmarking of simplified simulated system : use a representative system AND do not make any simplifying assumption with impact
      • Ensure that stress-ng is a correct candidate for benchmarking
    2. Unappropriated and misleading benchmarks : measure what you are changing
      • Use the same inputs of formula/sensors on a given benchmark
    3. Same dataset for calibration and validation : disjoint both data sets
    • Use different program / options to calibrate SmartWatts on development process
  4. Improper comparison of results
    1. No proper baseline : always compare to the "real" baseline (SOTA solution, theoretically best)
      • Define OQ Baseline
    2. Only evaluating against yourself : Even re-edition/correction shall refer to the current "real" baseline
      • Present evolution from previous paper and from current SOTA/baseline
    3. Unfair benchmarking of competitors : be explicit and complete about competitor's tool use (even consider having them validating the results)
      • If comparing to Scaphandre...
    4. Inflating gains by not comparing against current SOTA : avoid "they improved by X and us by Y" prefer "We improved X new established baseline by Z"
  5. Missing information
    1. Missing specification of evaluation platform : give as much details about ALL used hardware to ensure reproducibility : processor arch, nb cores, clock rate, memory sizes, all cache level size, core type, microarchitecture, OS & version, hypervisor & version
    • Enrich as much metadata as possible to have sup resources
    1. Present only top level/aggregated results : sub-benchmarks shall be presented to avoid loss of information
      • Define OQ Adequat grain for "most specific" sub-benchmarks
    2. Relative numbers only : present raw values in addition of ratio to allow people to sanity check
    • To do ! :)

Best Practice

  • Document what I do
  • Run warp-up iterations that aren't timed
    • May be usefull to litteraly warm-up CPU
  • Do several runs and check std (expected to be < 0.1% )
    • [ ]We may define 0.1% as the ideal threshold to focus on something else (and then keep 0.1% as a threshold for further features acceptance)
  • Use combination of successive and separate run
    • same twice in a row (possible caching exhibit)
      • May be relevant for HWPC Sensor
    • explore data set in both directions
      • "directions" might be "more stressed" to "more relaxed" and return ? may need a huge "cool off" step ?
  • If making use of regular strides (2, 4, 8, 16...) also use "random" points to avoid "pathological case" but keep regular in order to identify those pathological cases
    • May bee considered in formulas ?

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions